Big Data Warehouse

Location:

Fort Worth, TX

Posted:

May 30, 2025

Contact this candidate

Resume:

Pavan Supreeth Bendi

+1-872-***-**** ***********@*****.*** LinkedIn: www.linkedin.com/in/bendi-pavan Summary

● 6 years of IT experience in Data Engineering, Analytics, Data Modeling, and Big Data using Scala, PySpark, Hadoop, and HDFS environments and expertise in Python and Cloud Technologies (AWS, GCP).

● Implemented Big Data solutions using the Hadoop technology stack, including PySpark, Hive, Sqoop, Avro & Thrift.

● Worked on developing ETL work scheduling for Data Extraction, transformations, and loading using Informatica Power Center.

● Hands-on experience on Google Cloud Platform (GCP) in all the big data products such as Big Query, Cloud Data Proc, Google Cloud Storage, and Composer (Air Flow as a service).

● Defined data warehouse schemas (star and snowflake schema), fact tables, cubes, dimensions, and measures using SQL Server Analysis Services.

● Implemented UNIX shell scripts to run the Informatica workflows and control the ETL flow.

● Firm understanding of Hadoop architecture and various components, including HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming.

● Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using Airflow and Oozie.

● Experienced in Optimizing the PySpark jobs to run on Kubernetes Cluster for faster data processing.

● Involved in converting Hive Queries into various Spark Actions and Transformations by Creating RDD and Data frames from the required files in HDFS.

● Strong knowledge of Hadoop Architecture and Daemons such as HDFS, JOB Tracker, Task Tracker, Name Node, Data Node, and Map Reduce concepts.

● Experienced in developing scripts using Python and Shell Scripting to Extract, Load, and Transform data working knowledge of GCP ssh.

● Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ER Studio. Used ER Studio for Creating/Updating Data Models.

● Hands-on experience with Informatica power center integrating with different applications & relational databases.

● Developed AWS CI/CD Data pipeline and AWS Data Lake using EC2, AWS Glue, and AWS Lambda.

● Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems, which includes loading nested JSON formatted data into Snowflake table.

● Hands-on Experience with different API Endpoints like Edge Optimized, Regional & Private in Aws API Gateway. Also, Configured control connections to varying levels like API Key, Method level & Account Level.

● Experienced in optimizing Hive queries using best practices and the correct parameters and using technologies like Hadoop, YARN, Python, and PySpark.

● Experienced in requirement analysis, application development, application migration, and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies. Technical Skills

Big Data Eco System HDFS, MapReduce, Hive, Yarn, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Spark, Zookeeper, NiFi, Amazon Web Services Hadoop Distributions Apache Hadoop 1.x/2.x, Cloudera CDP, Hortonworks HDP Programming Languages Python, Scala, Java, HiveQL, Shell Scripting, Bash Software Methodologies Agile, SDLC Waterfall

Design Patterns Eclipse, Net Beans, IntelliJ, Spring Tool Suite Databases MySQL, MS SQL SERVER, Oracle, PostgreSQL, DB2, DynamoDB, workbench NoSQL HBase, MongoDB, Cassandra

ETL/BI Power BI, Tableau, Talend, Informatica, SSIS, SSRS, SSAS Version control GIT, SVN, Bitbucket

Web Development JavaScript, Node.js, HTML, CSS, Spring, J2EE, JDBC, Hibernate. Operating Systems Windows, Linux (Unix, Ubuntu), Mac OS Cloud Technologies Amazon Web Services, EC2, S3, Azure Data Bricks, GCP Work Experiences

Capital One, Plano, TX Jan’24 - Present

Sr. Big Data Engineer (AWS, Snowflake)

Responsibilities:

• Designed and developed end-to-end data pipelines on AWS to ingest both batch and streaming data from DCIS into structured data products using S3, Glue, and Lambda.

• Engineered multi-layered data transformation frameworks (D0, D1, D2A, D2B) for secure and role-specific data access across raw, analyst, and business domains.

• Leveraged AWS Glue and Step Functions to orchestrate scalable ETL processes, ensuring seamless ingestion, transformation, and loading of data into downstream systems.

• Implemented robust data aggregation logic and pushed curated datasets into Snowflake for enterprise-level analytics and reporting.

• Integrated DBT (Data Build Tool) with Snowflake to modularize and automate SQL transformations, enabling version-controlled and testable data models.

• Built real-time data sinks in DynamoDB for fast-access analytics use cases and API consumption across business platforms.

• Collaborated with cross-functional teams including data analysts and business analysts, ensuring governance and data accessibility through D2A and D2B layers.

• Created reusable Terraform modules (IaC) to automate the deployment of AWS resources like Glue jobs, Lambdas, and MWAA DAGs, enabling consistent infrastructure across environments.

• Performed Most Used Analysis in Snowflake for all D2 and D1 layer tables to find Potential Data Grain from the data asset to map the Data Products and publish in Data Product Universe.

• Performed and built an ETL by connecting six data assets into one denormalized data asset into DynamoDB and Snowflake and standardized the data attributes to make sure these attributes were standardized across all platforms and accessible for data consumers and enterprise organizations.

• Developed custom Python scripts to validate data integrity and trigger downstream actions post- transformation using AWS Lambda.

• Designed and developed Spark Structured Streaming jobs to ingest and process real-time data from Kafka and Kinesis into data lakes and warehouses.

• Built scalable ETL pipelines using Apache Spark (Scala/PySpark) for batch and streaming use cases, handling multi-terabyte datasets.

• Optimized Spark jobs by tuning partitions, memory configurations, and broadcast joins, improving performance by up to 40%.

• Implemented data transformation and cleansing logic in Spark to prepare raw data for downstream analytics and ML models.

• Used Spark SQL for complex aggregations, window functions, and incremental data processing.

• Orchestrated Spark jobs via Apache Airflow and integrated them with AWS services like S3, Redshift, and Glue.

• Developed idempotent and fault-tolerant Spark streaming pipelines, ensuring exactly-once processing using checkpointing and watermarking.

Environment: AWS, GCP, Python, PySpark, Azure, HDFS, Spark, Kafka, Hive, Yarn, Cassandra, HBase, Jenkins, Terraform, Docker, MySQL.

University of Phoenix, Phoenix, AZ April ’22 – Dec ’23 Sr. Data Engineer/Data Scientist (AWS) Engineer

Responsibilities:

● Worked on various AWS services such as Glue, RedShift, Step functions, Dynamo DB, AWS MWAA, and AWS LAMBDA, Sage maker.

● Build the automated project using Terraform (IAAC) with AWS step function in which the resources like glue, lambda, and sage maker are written in Python and stored in the s3 bucket.

● Used Python to perform classification model and mapping the values for the field “states” which is in string type to numerical data and followed by other fields along with predicted values and sent to the AEP team.

● Involved in setting up AWS data storage services & managing buckets along with access control configurations.

● Wrote Scala-based MERGE statements for incremental uploading on the daily historical data by time-stamped metrics.

● Used the UNNEST function in the powerful query to analyze event parameters generated from Google Analytics 4 (GA4) and executed SQL queries for the analysis.

● Worked on Apache Airflow with AWS MWAA to manage end-end ETL workloads in AWS.

● Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, AWS step functions, Teradata, and Snowflake.

● Developed AWS GLUE JOB for merging the data from source s3 by using Scala, python, and Panda’s framework.

● Utilized Athena services to create optimized SQL queries for querying and analyzing large datasets stored in AWS S3. Implemented SQL query optimization techniques to enhance performance and reduce query execution time.

● Developed SQL queries to examine data integrity, completeness, and accuracy, ensuring high data quality standards in AWS Athena.

● Worked in DBeaver to connect to Redshift SSO to create the database and new tables for lookup values as well as used SQL left join in ETL GLUE script to join the data folders.

● Developed complete automation for data transformation in serverless Step functions AWS.

● Used Athena services for creating the tables for querying the data and performing QA checks.

● Used Sage maker processing job services for creating machine learning models to predict the data.

● Worked on Docker images to build by using AWS ECR to apply in the model to perform predictive analysis.

● Used AWS ECS to create a cluster to run the task definitions which is an alternative solution to the AWS EC2 instance.

● Worked in SFTP portal to pull the data from the server and by using AWS lambda Function we processed the data.

● Worked in terraform IAAC for engaging with AWS services as well as GCP services to create resources like Data proc, Big Query, Bucket, Data flow, Google analytics, Pub/sub.

● Worked on transformations by using data proc cluster using Scala language and used Big Query for data analysis using SQL queries to see the schema granularity.

● Used Terraform scripts to create the resources with the marketing media data of 30 GB/per day and a total size is 1TB, which helped us to create the clusters and buckets to store the data and process the data both in AWS AND GCP.

● Experience in event logging through AWS lambda for ETL Glue, AWS Sage maker, AWS Athena, AWS DynamoDB, AWS Redshift, AWS MWAA, AWS Step Functions, and AWS KMS. Environment: AWS, GCP, Python, PySpark, Azure, HDFS, Spark, Kafka, Hive, Yarn, Cassandra, HBase, Jenkins, Terraform, Docker, MySQL.

Value Labs, Hyderabad, India Feb’20 – Aug ’21

Big Data / Hadoop (Spark) Engineer

Responsibilities:

● Responsible for designing, implementing, and architecture large-scale data intelligence solutions around big data platforms.

● Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop, and Zookeeper.

● Developed multiple POCs using Spark and Scala and deployed them on the Yarn Cluster, compared the performance of Spark with Hive and SQL.

● Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service

(S3) as a storage mechanism.

● Writing Custom pyspark scripts to replicate the function of the Teradata code in the data transformation process from Teradata to AWS one lake.

● Designed and implemented automated testing and deployment processes for Scala code, using tools such as sbt, Jenkins, and Git.

● Extensively used the SET, UPDATE & MERGE statements for creating, updating & merging various SAS data sets.

● Extracted, transformed, and loaded data using SAS/ETL.

● Developed Mappings, Sessions, and Workflows to extract, validate, and transform data according to the business rules using Data Stage.

● Worked on NiFi data Pipeline to process large sets of data & configured Lookups for Data Validation & Integrity.

● Installed and configured Hadoop MapReduce and HDFS, developed multiple MapReduce jobs in Python and NiFi for data cleaning and preprocessing.

● Contributed to open-source Scala projects, such as Scala Standard Library, Akka, and Cats, by submitting bug fixes, feature requests, and code contributions.

● Worked with different file formats like JSON, AVRO, and parquet and compression techniques like snappy.

● Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.

● Used Model Mart of Erwin for effective model management of sharing, dividing, and reusing model information and design for productivity improvement.

● Extensively used Data Null and SAS procedures such as Print, Report, Tabulate, Freq, Means, Summary, and Transpose for producing ad-hoc and customized reports and external files.

● Written programs in Spark using Python, PySpark, and Pandas packages for performance tuning, optimization, and data quality validations.

● Translated the business requirements into workable functional and non-functional requirements at a detailed production level using Workflow Diagrams, Sequence Diagrams, Activity Diagrams, and Use Case Modelling with the help of Erwin.

● Worked on developing Kafka Producers & Kafka Consumers for streaming millions of events per second on streaming data.

● Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.

● Hands-on experience fetching live stream data from UDB into an HBase table using PySpark streaming & Apache Kafka.

● Worked on Tableau to build customized interactive reports, worksheets, and dashboards. Environment: HDFS, Python, SQL, Web Services, MapReduce, Spark, Kafka, Teradata, Hive, Yarn, Pig, Flume, Zookeeper, Sqoop, UDB, Tableau, AWS, GitHub, Shell Scripting. Syntel, Hyderabad, India June ’18 – Jan ’20

Data Engineer (Snowflake, Informatica)

Responsibilities:

● Worked on Informatica to create data pipelines with complex transformations & mappings to the target databases.

● Used transformations like Filter, Expression, Sequence Generator, Update Strategy, Joiner, Router, and Aggregator to create robust mappings in the Informatica Power Center Designer.

● Used FTP scripts to transfer data between consumer and HCA account portals via XML data formats.

● Involved in identifying dynamic data fields and implemented Slowly Changing Dimensions (SCD) for the modified data between native systems and end-user data-consuming layers.

● Worked on multiple Informatica Mappings & transformations like Source qualifier, Aggregator, Connected & unconnected Lookups, Filter, Rank, Stored Procedure, Expression & Sequence Generator & Reusable transformations.

● Designed fact & dimension tables & utilized multiple snowflake functions such as COPY, UNLOAD, MERGE & TRANSPOSE.

● Involved in designing and implementing data warehouse solutions for multiple business groups and providing multi-dimensional analytical services for quick and easy metrics.

● Performed data profiling, cleansing, and aggregations in the Snowflake databases.

● Worked on performance tuning by optimizing snowflake queries and workloads in the ETL pipelines. Some optimization techniques include indexing, partitioning, and query rewriting as per the data dynamics.

● Used EXPLAIN and PROFILE commands for query optimization and utilized clustering keys and materialized views for query execution.

● Worked on snowflake integration with other systems, such as data warehouses and data lakes, for effective business user data consumption.

● Implemented data governing practices with effective policies and procedures, change requests, and traceability for all the historical database modifications.

● Set up role-based access control for user groups based on Active Directory and maintained custom security layers to handle new security scenarios.

● Led teams performing Data mining, Data Modeling, Data/Business Analytics, Data Visualization, Data Governance & Operations & Business Intelligence (BI) Analysis & communicated insights & results to the stakeholders.

Environment: Python, Spark ETL, Kafka, Tableau, MongoDB, Hadoop, SQL Server, SDLC, ETL, SSIS. I Labs, Hyderabad, India October ’17 – April ‘18

BI Developer (ETL, SQL Server)

Responsibilities:

● Worked as a Data Engineer to generate Data Models and develop a relational database system.

● Involved in Data Mapping activities for the data warehouse.

● Produced PL/SQL statements and stored procedures in DB2 for extracting and writing data.

● Developed Star and Snowflake schemas-based dimensional model to develop the data warehouse.

● Used forward engineering to create a Physical Data Model with DDL that best suits the requirements of the Logical Data Model.

● Provided source-to-target mappings to the ETL team to perform initial, full & incremental loads into the target data mart.

● Responsible for migrating the data and data models from the SQL server to the Oracle environment.

● Worked closely with the ETL SSIS Developers to explain the complex Data Transformation using Logic.

● Developed normalized Logical & Physical database models to design OLTP systems for insurance applications.

● Created a dimensional model for the reporting system by identifying required dimensions and facts using ER Studio Data Architect.

● Worked with Database Administrators, Business Analysts, and Content Developers to conduct design reviews and validate the developed models.

● Identified detailed business rules and Use Cases based on requirements analysis.

● Facilitated development, testing & maintenance of quality guidelines & procedures & necessary documentation.

● Responsible for defining the naming standards for the data warehouse.

● Generated ad-hoc SQL queries using joins, database connections, and transformation rules to fetch data from legacy DB2 and SQL Server database systems.

Environment: SQL server, SQL database, ETL, SSIS, T-SQL, PL/SQL. Education

UNIVERSITY OF NORTH TEXAS, TX, USA

Masters in data science (ML,AI)

• 3.8 GPA

Malla Reddy College of Engineering and Technology, HYD, IND. Bachelors in Computer Science

• 3.6 GPA

Certification

GCP Certification, Charlotte, USA

Professional Data Engineer Series ID: 41354

https://www.credly.com/badges/ed507ce5-16b0-4143-9d20-a0c143c31533/public_url

Contact this candidate