Data Engineer Big

Location:

Chagrin Falls, OH

Posted:

July 25, 2025

Contact this candidate

Resume:

Bhavana T

Sr.AWS Data Engineer

Mobile: +1-216-***-****

Email: *********.*@*****.***

PROFESSIONAL SUMMARY

I'm a Data Engineer with over 8+ years of experience in building and enhancing data systems that support smarter business decisions. My expertise lies in creating scalable data pipelines, managing large datasets, and leveraging big data technologies. I'm proficient in SQL, Python, and PySpark, and I have strong hands-on experience with cloud platforms like AWS. I enjoy tackling complex data problems, optimizing system performance, and ensuring smooth, reliable data flow across platforms. I'm also great at collaborating with teams and translating business requirements into practical, impactful data solutions. Core Skills:

• Over 8 years of hands-on experience in Data Engineering, Big Data processing, and cloud-based data solutions, with a strong focus on AWS, SQL, Python, and PySpark.

• Proficient in the design, development, and optimization of scalable ETL pipelines with Redshift, AWS Glue, EMR, Lambda, and Step Functions.

• Excellent skills in SQL development, including the creation of intricate queries, stored procedures, indexing, and query optimization for processing data at fast speeds.

• Proven data engineer with over eight years of experience in large data solutions, ETL workflows, and data pipeline design, development, and optimization using AWS, Python, PySpark, and SQL.

• AWS Cloud Expert, adept in utilizing AWS Glue, Lambda, Redshift, S3, EMR, Athena, and Step Functions to implement scalable serverless data pipelines.

• Integrated OpenAI's GPT API into internal logging tools to auto-generate data pipeline summaries and incident descriptions, improving issue triage speed by 40%.

• Expert in designing complicated queries, indexing, performance tuning, and data modeling for effective data processing in AWS Redshift, PostgreSQL, and Snowflake, with expertise in SQL and database optimization.

• Extensive experience with PySpark for large-scale data processing, real-time streaming using Kafka, and implementing complex data transformations with performance in mind.

• Strong background in Python scripting for automation, API integration, and supporting machine learning workflows.

• Experienced in managing high-volume data ingestion and transformation tasks, optimizing cloud resources for performance and cost efficiency.

• Skilled in developing serverless data pipelines on AWS using tools like Lambda, API Gateway, Step Functions, and Glue Catalog, minimizing operational overhead.

• Expert in deploying serverless data pipelines using AWS Lambda, Step Functions, and API Gateway, reducing operational overhead.

• Skilled in integrating SQL, Python, and PySpark to build scalable ETL frameworks and data transformation pipelines.

• Well-versed in monitoring and troubleshooting data pipelines using AWS CloudWatch, SNS, and CloudTrail, ensuring reliability and performance.

• Hands-on with orchestration tools like Apache Airflow, AWS Step Functions, and AWS Batch for scheduling and maintaining robust data workflows.

• Solid understanding of data governance, security best practices, IAM roles, and access control to ensure data integrity and regulatory compliance.

• Developed Python-based solutions using OpenAI's API to generate dynamic SQL queries and ETL scripts, improving development efficiency across multiple projects.

• Strong analytical and problem-solving skills, collaborating with cross-functional teams to drive business intelligence, reporting, and data-driven decision-making. TECHNICAL SKILLS

Big Data Technologies AWS (AWS Glue, AWS EMR, AWS S3, AWS Lambda, AWS Step Functions, Amazon MSK, AWS CloudWatch, PYSpark, AWS Athena, AWS CloudTrail, AWS CodePipeline, AWS CodeBuild)

Big Data & Processing Apache Spark, PySpark, Hadoop, Hive, Kafka, Airflow, Delta Lake, Big Query AWS Services S3, Glue, EMR, Lambda, Redshift, Athena, DynamoDB, IAM, CloudWatch, Step Functions, Kinesis, SNS, SQS

Databases PostgreSQL, MySQL, Redshift, Snowflake, SQL Server Relational Databases MS SQL Server 2016/2014/2012, MS Access, Oracle 11g/12c, Snowflake, Cassandra

Languages SQL, PL/SQL, Python, PySpark, HiveQL, Spark-SQL, Snow SQL, Scala GenAI Tools OpenAI, Lang Chain, Azure OpenAI, AWS Bedrock CI/CD tools Docker, Git, Jenkins, Terraform

Version Control &

Collaboration

GIT, GitHub, Microsoft Teams, JIRA

IDE Eclipse, IntelliJ, Visual Studio, SQL Server Management Studio(SSMS) Programming Languages Python, SQL, Bash, Shell Scripting WORK EXPERIENCE

Client: Harman International – Detroit,MI (Remote) Aug 2021 – Present Role: Sr.AWS Data Engineer.

Responsibilities:

• Developed and optimized ETL pipelines using AWS Glue to process large datasets, integrating them with AWS S3 for scalable data storage.

• Leveraged Python and PySpark for data transformation tasks on AWS Glue, improving data processing speeds by 30%.

• Configured and managed AWS EMR clusters for high-performance distributed data processing, reducing operational costs by optimizing cluster usage.

• Architected ETL solutions on AWS Glue for processing and transforming large-scale datasets, achieving a 20% increase in processing efficiency.

• Developed complex PySpark transformations in Glue, enabling smooth data integration with downstream analytics and reporting systems.

• Managed AWS EMR clusters for batch and real-time processing of data using PySpark, achieving highly optimized and scalable workflows.

• Implemented data lakes on AWS S3, integrating data ingestion from multiple sources, and providing cost-effective, scalable storage.

• Employed Apache Kafka for real-time data ingestion into AWS Glue, enabling efficient streaming data flow across systems.

• Integrated Generative AI models (GPT, Claude) into data pipelines to automate data quality checks, anomaly detection, and metadata generation for large-scale data lake projects.

• Used Databricks Autoloader to support continuous file ingestion from cloud storage with incremental checkpointing and schema inference.

• Automated Glue job scheduling and orchestration using AWS Lambda and CloudWatch events for efficient data pipeline management.

• Built scalable data pipelines that ingested real-time streaming data from Kafka into AWS S3, ensuring low-latency data availability.

• Conducted detailed data analysis on AWS Glue using Python scripts to identify bottlenecks and enhance performance.

• Built a GenAI-driven assistant using Lang Chain and Snowflake to help business analysts generate ad- hoc queries using natural language prompts.

• Applied prompt engineering and embeddings-based search (FAISS) to build a prototype Slack bot that answers questions about pipeline architecture and ingestion flows.

• Integrated AWS Glue with Athena to provide fast querying capabilities on large datasets stored in S3 without the need for complex infrastructure.

• Set up CI/CD processes for Glue jobs using AWS CodePipeline and CodeBuild, streamlining deployment and version control.

• Implemented AWS Glue workflows to orchestrate multiple PySpark jobs, ensuring data processing tasks are completed in sequence.

• Deployed custom Python libraries in Glue jobs for specialized data cleaning and validation operations, improving data accuracy.

• Automated backup and data retention strategies in S3 using AWS Lambda and lifecycle policies, ensuring long-term storage of critical datasets.

Environment: AWS (Glue, S3, EMR, Lambda, Athena), PySpark, Python, Kafka, Databricks, Lang Chain, GPT, FAISS, Snowflake, CI/CD with CodePipeline.

Client: Try Cry Fly Services – Hyderabad, India Dec 2018 – Jan 2021 Role: Data Engineer

Responsibilities:

• Built scalable distributed data processing applications using PySpark, leveraging RDDs, DataFrames, and Spark SQL for large-scale data transformations.

• Developed Python-based ETL scripts for data extraction, transformation, and loading from multiple sources (SQL, APIs, S3, NoSQL).

• Created API-based data integrations using Python, consuming REST and GraphQL APIs and storing data in Redshift and Snowflake.

• Automated data ingestion workflows using Python, Pandas, and PySpark, reducing manual effort by 60%.

• Designed and implemented data quality frameworks using Great Expectations and Pytest to validate data integrity before ingestion.

• Developed multi-threaded Python scripts for parallel data processing, improving performance in high- volume datasets.

• Used Azure Databricks with PySpark to process large datasets from streaming sources and write results to Synapse and Power BI workspaces.

• Improved Spark job execution by implementing partitioning, caching, bucketing, and efficient join strategies.

• Developed PySpark-based ETL pipelines, processing terabytes of structured and unstructured data for analytics and reporting.

• Worked on storing all the crucial passwords and database connections in AWS secrets manager.

• Optimized PySpark jobs by tuning shuffle partitions, broadcast joins, and caching techniques, reducing execution time by 40%.

• Implemented Spark SQL and Data Frame transformations for efficient data processing, improving performance on AWS EMR and Databricks.

• Designed real-time streaming pipelines using PySpark Structured Streaming and Kafka, ensuring low- latency data ingestion.

• Configured Spark dynamic resource allocation for efficient cluster utilization, reducing costs while improving performance.

• Created data pipelines using Delta Lake for ACID-compliant processing on AWS S3 and Databricks, enabling incremental data loads and updates.

• Developed ML pipelines in PySpark, integrating Spark MLlib for predictive analytics and feature engineering.

• Enabled fault-tolerant ETL processing using checkpointing and WAL (Write-Ahead Logging) in Spark Streaming to ensure data recovery and reliability.

• Developed PySpark UDFs (User Defined Functions) for advanced data transformations, optimizing execution with vectorized Pandas UDFs.

• Implemented parallel processing and job orchestration using Apache Airflow with PySpark for batch data workflows.

Environment: PySpark, Python, AWS (S3, Redshift, EMR), Azure Databricks, Kafka, Delta Lake, Airflow, Great Expectations, Synapse.

Client: Milestone Technologies – Hyderabad, India May2017 – Nov 2018 Role: Big Data Developer

Responsibilities:

• Engineered Python-based ETL pipelines to automate data ingestion and transformation tasks, reducing manual processes by 60%.

• Improved processing speed by 40% through the use of multi-threaded Python applications for handling high-volume data workloads.

• Integrated Python with AWS services (S3, Lambda, Glue, DynamoDB, Step Functions) to streamline cloud-based data workflows.

• Built and deployed RESTful APIs using Flask/FastAPI, enabling real-time data access for analytics and reporting.

• Created automation scripts in Python for validating data quality, running integrity checks, and monitoring pipelines, enhancing data accuracy to 99%.

• Increased ETL robustness by adding structured error handling, logging, and retry logic in Python scripts, reducing job failures by 35%.

• Created data transformation pipelines for ML models, leveraging Pandas, NumPy, and Scikit-Learn, improving model accuracy.

• Tuned complex SQL queries using advanced techniques like window functions, indexing, and partitioning to cut execution time in half.

• Developed and tuned stored procedures, triggers, and materialized views, improving data retrieval efficiency.

• Designed and built the frameworks for migrating data from Teradata to Hadoop file systems using Map Reduce, HIVE, Spark and Oozie.

• Applied Change Data Capture (CDC) strategies to enable near real-time, incremental data updates and processing.

• Built OLAP data models using star and snowflake schemas to support scalable BI and reporting systems.

• Ensured consistent and accurate data across systems by automating reconciliation and validation tasks using Python and SQL.

• Migrated on-premises databases to AWS RDS and Redshift, reducing infrastructure costs by 40%.

• Optimized ETL workflows by integrating SQL with PySpark, enabling seamless transformation of petabyte-scale data.

• Developed PySpark-based ETL pipelines, processing terabytes of structured and unstructured data in real-time.

• Optimized Spark jobs using partitioning, caching, and broadcast joins, improving performance by 40%.

• Implemented PySpark Structured Streaming with Kafka, enabling low-latency event-driven data processing.

• Designed Delta Lake solutions on AWS S3 to support incremental loads with full ACID compliance.

• Automated PySpark job scheduling and orchestration using Apache Airflow, improving data pipeline efficiency.

• Configured and managed AWS EMR and Databricks clusters, optimizing resource allocation and reducing costs.

• Developed PySpark UDFs to handle complex transformations, improving data processing speed by 30%.

• Architected and built AWS-based ETL pipelines, integrating AWS Glue, Lambda, Redshift, and Step Functions.

• Optimized AWS Redshift performance by leveraging distribution keys, sort keys, and workload management (WLM).

• Implemented AWS Kinesis and Kafka-based real-time data streaming, reducing data processing latency by 90%.

• Automated AWS infrastructure provisioning using Terraform and CloudFormation, reducing deployment time.

• Configured IAM roles, security policies, and VPC setups, ensuring data security and compliance.

• Reduced AWS operational costs by 50% by implementing Auto Scaling, Spot Instances, and EMR optimization.

• Developed serverless data pipelines using AWS Lambda, API Gateway, and S3, eliminating infrastructure overhead.

• Connected AWS Athena to Glue Catalog to support cost-effective querying of S3 data stored in Parquet and ORC formats.

Environment: AWS (S3, Lambda, Redshift, Glue, Kinesis), PySpark, Python, Kafka, Teradata, HIVE, Oozie, Delta Lake, Terraform, Airflow.

Education

B. Tech in Electronics and Communication Engineering at Bhoj Reddy Engineering College for Women. Master’s in Computer Science at Cleveland State University.

Contact this candidate