Pravali Rao
Phone:330-***-****
Email: ************@*****.***
Summary:
Data Engineer with almost 5 years of hands-on experience in designing and implementing scalable data solutions. Proficient in SQL, AWS, Hadoop, and Pyspark, with a strong focus on building efficient ETL pipelines, optimizing big data workflows, and delivering high-performance analytics systems.
Developed an ETL pipeline using Python & Apache Airflow, automating data ingestion from REST APIs to Amazon Redshift.
Optimized a PySpark batch job that processed 1TB+ of data, reducing execution time by 40% using partitioning and caching.
Designed a data pipeline in Python, extracting and transforming raw JSON data from an external API and loading it into Snowflake.
Implemented a real-time data processing pipeline using Kafka & PySpark, streaming transaction data into AWS S3 for analytics.
Built an automated data validation system in Python, ensuring 99.9% data accuracy before loading into PostgreSQL.
Integrated Python scripts with dbt, automating transformations in a modern data stack (Snowflake & BigQuery).
Designed and implemented batch data pipelines on Amazon EMR using Apache Spark, Python, and Scala, enabling cost-effective and scalable processing of terabytes of data.
Experienced in processing serialized data in Spark using formats like Avro, Parquet, and ORC with a deep understanding of their features and limitations.
Expertise in AWS services such as S3, Redshift, Lambda, EMR, and Glue for building scalable, cloud-native data solutions and data lakes.
Hands-on experience working with Hadoop ecosystems, including HDFS, Hive, and MapReduce, to manage and analyze massive datasets.
Expertise in designing and building efficient ETL/ELT pipelines to process large-scale data.
Hands-on experience in building and maintaining data lakes with partitioning and columnar formats like Parquet and ORC.
Developed custom data transformations in PySpark, leveraging AWS S3 for storage.
Automated Spark job executions on AWS EMR clusters using AWS Step Functions.
Tuned PySpark performance on AWS EMR by adjusting resource configurations.
Executed large-scale data processing using PySpark and AWS Hive on AWS EMR.
Built a Python-based ingestion script to move data from S3 Snowflake, scheduled via AWS Lambda.
Developed serverless applications on Google Cloud Functions, leveraging event-driven architecture for real-time data processing and automation.
Proficient in Google Cloud Storage's versioning and object archiving features, ensuring data retention and compliance with data governance policies.
Demonstrated expertise in optimizing Google Dataproc jobs through cluster optimization, task scheduling, and efficient resource utilization.
Managed auto-scaling configurations on Google Compute Engine instances to adapt to fluctuating workloads and reduce operational costs
Technical Skills:
Big Data Apache Hadoop, HDFS, Map Reduce, Hive, Sqoop, Apache Spark, Kafka, Cloudera Manager, PySpark
Languages C, Python, SQL
Cloud Azure, GCP, AWS
Databases SQL Server, MySQL, PostgreSQL, NoSQL Databases
RDBMS Oracle, MS SQL Server, MySQL
IDE and Build Tools Eclipse, R-Studio, JIRA, IntelliJ IDEA.
Version Control Git, GitHub
ETL Tools Apache Spark, Apache Kafka, and Apache Airflow
Professional Experience:
Amazon Oct 2021-Dec 2022
Data Engineer
Responsibilities:
Integrated AWS S3 with PySpark to process large datasets in a distributed environment efficiently.
Oversaw ETL workflows using PySpark on AWS EMR, leveraging AWS S3 for scalable data storage.
Configured Spark jobs on AWS EMR to efficiently read and write data from AWS S3.
Developed custom data transformations in PySpark, leveraging AWS S3 for storage.
Performed large-scale data processing with PySpark and AWS Hive on AWS EMR for efficient data management and analytics.
Designed a batch ETL pipeline using Python (PySpark), processing 500M+ records from S3 and loading into Redshift.
Optimized performance by using partitioning & bucketing, reducing job execution time by 30%.Configured EC2 instances within AWS EMR clusters to run PySpark jobs for scalable data processing.
Orchestrated Spark jobs on AWS EMR using AWS Step Functions to automate and streamline data processing workflows.
Maintained and upgraded complex study data pipelines, ensuring data integrity and optimizing storage efficiency.
Optimized data storage solutions, integrating PostgreSQL, Snowflake, and AWS S3 for structured/unstructured data.
Enhanced PySpark applications on AWS EMR by optimizing shuffling and partitioning strategies for improved performance and efficiency.
Utilized AWS EC2 spot instances to run cost-effective AWS EMR clusters for PySpark jobs.
Implemented Git-based CI/CD workflows, ensuring smooth code integration and deployment.
H.S.B.C Feb 2020-Oct 2021
Data Engineer
Responsibilities:
Designed, developed, and managed robust ETL pipelines to efficiently extract, transform, and load data from various sources into centralized data warehouses.
Automated daily data ingestion from APIs to PostgreSQL using Apache Airflow (DAGs written in Python).
Implemented error handling & logging with Python & CloudWatch, improving pipeline reliability by 99.9%.
Automated daily data ingestion from multiple sources (MySQL, APIs, CSVs) into Snowflake, reducing manual processing time by 90%.
Designed an incremental ETL pipeline using Python & dbt, improving data freshness and reducing query costs by 30%.
Integrated a batch ETL workflow in Python, extracting data from Kafka, transforming it using PySpark, and loading it into Google BigQuery.
Optimized ETL performance by implementing partitioning & bucketing in PySpark, reducing job execution time by 40%..
Collaborated with data architects to design data storage solutions.
Implemented data partitioning and shuffling strategies for optimization.
Developed data validation and quality checks in Spark pipelines.
Implemented automated data pipelines using Sqoop, and other Hadoop ecosystem components for large-scale data processing.
Adept in using Sqoop to load data into Hive tables for further processing and analysis.
Developed and implemented Sqoop-based solutions to integrate external databases with Hadoop-based applications and workflows.
Utilized Sqoop to perform data validation and reconciliation checks, ensuring the accuracy and completeness of data during the transfer process.
Implemented Sqoop-based data synchronization solutions to keep Hadoop data in sync with external databases.
Company-Cognizant
Client: Google Sept 2018 – Feb 2020
Data Engineer
Responsibilities:
Familiarity with data quality and data cleansing techniques in ETL testing.
Hands-on experience with managing Google Compute Engine instances, including image creation, network configuration, and instance scaling.
Strong knowledge of Google Cloud Functions triggers and bindings for seamless integration with various event-driven workflows.
Implemented data replication and backup strategies using Google Cloud Storage to ensure data durability and disaster recovery.
Designed and executed data processing pipelines on Google Dataproc, incorporating tools like Spark and Hadoop for data transformation and analysis.
Utilized Google Compute Engine for high-performance computing tasks, such as machine learning model training and batch processing.
Developed serverless applications on Google Cloud Functions, leveraging event-driven architecture for real-time data processing and automation.
Proficient in Google Cloud Storage's versioning and object archiving features, ensuring data retention and compliance with data governance policies.
Demonstrated expertise in optimizing Google Dataproc jobs through cluster optimization, task scheduling, and efficient resource utilization.
Managed auto-scaling configurations on Google Compute Engine instances to adapt to fluctuating workloads and reduce operational costs.
Designed and deployed serverless APIs using Google Cloud Functions, enabling seamless integration with other cloud services and applications.
Implemented data encryption and access controls in Google Cloud Storage to ensure data security and privacy compliance.
Education:
Master’s in computing and information systems - Big Data Engineering, Youngstown State University, Ohio. Jan 2023-Dec 2024
Bachelor’s in computer science and technology, Malla Reddy University, India. 2014-2018