DATA ENGINEER
PROFESSIONAL SUMMARY
Data Engineer with 5 years of experience in designing, developing, and optimizing AWS and Big Data solutions. Expertise in building scalable data pipelines, ETL processes, and cloud-based architectures using cutting-edge technologies like AWS Glue, Redshift, EMR, S3, Apache Spark, Hadoop, and PySpark. Strong background in data integration, real-time analytics, and data warehousing, with a focus on performance optimization and automation.
CORE QUALIFICATIONS
Expertise in AWS services: S3, Glue, Redshift, Lambda, EMR, Athena, Step Functions, Kinesis, and CloudFormation.
Strong experience in Big Data technologies: Hadoop, Hive, HDFS, Spark, MapReduce, and Kafka for large-scale data processing.
Proficient in Python, Scala & PySpark for ETL development, data transformations, and real-time streaming data processing.
Skilled in SQL and NoSQL databases: Redshift, PostgreSQL, DynamoDB, MySQL, and Cassandra.
Designed and implemented scalable data pipelines using AWS Glue, Apache Airflow, and Step Functions for data orchestration.
Built real-time data streaming applications leveraging AWS Kinesis and Kafka.
Expertise in CI/CD pipelines and Infrastructure-as-Code (IaC) using AWS CloudFormation and Terraform.
Strong experience in data injection, storage, and processing using Hadoop ecosystem tools like Sqoop, Hive, Pig, Spark, MapReduce, Spark Streaming, Flume, Kafka, HBase, Oozie, Zookeeper, and HDFS.
Excellent experience importing and exporting data using Sqoop from RDBMS to HDFS and vice versa.
Well-versed in implementing partitioning, dynamic partitioning, and bucketing concepts in Hive to compute data metrics.
Processed data in HDFS by developing solutions, analyzing data using MapReduce, Hive, Pig, and producing summary results from Hadoop to downstream systems.
Experience in data modeling, predictive analytics, and developing best practices for data integration, analytics, and operational solutions.
Utilized Jira to track and manage data engineering tasks, prioritize backlog, and collaborate with team members to ensure timely delivery of data pipeline projects.
Strong experience in global delivery models (Onsite-Offshore model) involving multiple vendors and cross-functional engineering teams.
Excellent oral and written communication skills and a great team player.
Hands-on experience with data lake architectures, data warehousing, and performance tuning in cloud environments.
Strong analytical and problem-solving skills, with a proven ability to work in agile teams and collaborate across cross-functional teams.
TECHNICAL SKILLS:
Big Data & Cloud Platforms
AWS (Glue, Redshift, S3, EMR, Lambda, Kinesis), Hadoop (HDFS, Hive, Spark, MapReduce, Kafka)
Programming & Scripting
Python, PySpark, SQL, Shell Scripting, Scala
ETL & Data Processing
Apache Spark, AWS Glue, Hive, Sqoop, Airflow
Databases
Amazon Redshift, PostgreSQL, DynamoDB, MySQL, Cassandra
Version Control & DevOps
Git, GitHub, AWS CloudFormation, Terraform, CI/CD pipelines
Visualization & Analytics
AWS QuickSight, Tableau, Snowflake
Workflow Orchestration
Apache Airflow, Step Functions
PROFESSIONAL EXPERIENCE
Client: Duke Energy, Charlotte, NC Aug 2022- Feb 2025
Role: Data Engineer
Responsibilities:
Designed and Developed scalable Big Data solutions using AWS (EMR, Glue, Redshift) and Hadoop (HDFS, YARN, MapReduce) for distributed processing at petabyte scale.
Designed real-time data streaming pipelines with Amazon Kinesis, Kafka, and Firehose, reducing data latency by 30% and enabling real-time analytics.
Built and optimized ETL pipelines using Spark (PySpark, Scala), AWS Glue, and Snowflake for data transformation, achieving a 40% reduction in processing time.
Deployed and secured AWS infrastructure (EC2, S3, Lambda, RDS, Redshift), implementing automation, cost optimization, and disaster recovery strategies.
Used AWS Lambda for event-driven processing, automating file uploads, metadata extraction, and transformations, reducing infrastructure overhead.
Automated ETL workflows with AWS Glue, Step Functions, and Python, loading processed data into Redshift and Snowflake for analytics.
Configured S3 for secure storage with lifecycle policies, encryption, and CloudFront CDN for optimized content delivery.
Developed CI/CD pipelines with AWS Code Pipeline, Terraform, Drone, and Vela, streamlining deployments and infrastructure provisioning.
Built Spark jobs for Data Quality (DQ) and Data Integrity (DI), optimizing performance with caching, partitioning, and in-memory processing.
Optimized Hive queries with partitioning, bucketing, and ORC file formats, improving performance in large-scale data processing.
Integrated Redshift and Snowflake for data warehousing, optimizing queries, workload management, and materialized views for faster analytics.
Used Hyperion, Hue, Xenon, Knox, Shepherd, and Data Miner for real-time monitoring, job orchestration, and performance tuning.
Automated job scheduling with Apache Oozie and AWS Step Functions, reducing manual effort and ensuring seamless pipeline execution.
Used JIRA for Managing data engineering tasks, sprints, and cross-team collaboration, optimizing workflows and ensuring timely delivery.
Environment: AWS, EMR, Glue, Redshift, S3, Lambda, EC2, Spark, Python, Scala, SQL, ETL, HDFS, YARN, MapReduce, Kinesis, Kafka, Snowflake, Jira.
Client: Genpact, Hyderabad Oct 2017- Jan 2020
Role: Hadoop Developer
Responsibilities:
Utilized Sqoop to import data from MySQL and Oracle to Hadoop Distributed File System (HDFS) on a regular basis, ensuring seamless data integration.
Aggregated large volumes of data using Apache Spark and Scala, storing processed data in Hive Warehouse for further analysis.
Worked extensively with Data Lakes and big data ecosystems, including Hadoop, Spark, Hortonworks, and Cloudera, leveraging their capabilities for efficient data processing.
Developed Hive queries to analyze data and meet specific business requirements, using Hive Query Language (HiveQL) to simulate MapReduce functionalities.
Built HBase tables by leveraging HBase integration with Hive, facilitating efficient storage and retrieval of data.
Applied Kafka and Spark Streaming to process streaming data in specific use cases, enabling real-time data analysis and insights generation.
Used Spark-Scala (RDDs, DataFrames, Spark SQL) and Spark-Cassandra-Connector APIs for various tasks such as data migration and business report generation.
Extensively worked on creating combiners, partitioning, and distributed cache to improve the performance of MapReduce jobs.
Designed and implemented a data pipeline using Kafka, Spark, and Hive, ensuring seamless data ingestion, transformation, and analysis.
Implemented Continuous Integration and Continuous Deployment (CI/CD) pipelines to build and deploy projects in the Hadoop environment, ensuring streamlined development and deployment processes.
Worked with Spark using Python (PySpark) and Spark SQL for faster data testing and processing, enabling efficient data analysis and insights generation.
Created and maintained HiveQL scripts and jobs using tools such as Apache Oozie and Apache Airflow.
Built and improved automation systems at AutoSys to make testing faster and reduce manual work, using tools like Selenium and Python.
Employed Spark Streaming to divide streaming data into batches as input to the Spark engine for batch processing, facilitating real-time data processing and analytics.
Utilized Zookeeper for coordination, synchronization, and serialization of servers within clusters, ensuring efficient and reliable distributed data processing.
Integrated serialization techniques with distributed caching systems, improving data retrieval speeds by implementing efficient serialization and deserialization of cached data.
Worked on Oozie workflow engine for job scheduling, enabling seamless execution and management of data processing workflows.
Analyzed SQL scripts and redesigned them using PySpark SQL for faster performance.
Created Hive Queries and Pig scripts to study customer behavior by analyzing data.
Developed Spark applications in PySpark in distributed environments to load a large number of CSV files with different schemas into Hive ORC tables.
Leveraged Git as a version control tool to maintain code repositories, ensuring efficient collaboration, version tracking, and code management.
Improved system resilience by implementing partition-level backup and recovery strategies, enabling efficient restoration of specific partitions in case of data corruption or failure.
Environment: Sqoop, MySQL, HDFS, Apache Spark, Scala, Hive, Hadoop, Cloudera, HBase, Kafka, MapReduce, Zookeeper, Oozie, Data Pipelines, RDBMS, Python, PySpark, JIRA, Autosys, YARN, Pig, Git.
Education: Master’s in computer science from Jawaharlal Nehru Technological University, Hyderabad 2017