Data Engineer Engineering

Location:

Manhattan, NY, 10003

Posted:

March 07, 2025

Contact this candidate

Resume:

Swapnith Gaddameedhi

+1-315-***-**** ***********@*****.***

https://www.linkedin.com/in/swapnithg99/

Professional Summary:

Possessed 5 years of experience in IT consulting as a Data Engineer, developing scalable and reliable enterprise applications within Agile development environments.

Designed and implemented solutions for challenges in Big Data Analytics, Data Warehousing, and Data Integration to support business intelligence and operational reporting.

Worked on data-centric roles across the Financial, Food & Agriculture, and Manufacturing sectors, delivering robust data engineering solutions.

Performed core data engineering functions, including data extraction, loading, transformation (ETL), and integration, supporting enterprise data infrastructures such as data warehouses, operational data stores, and master data management systems.

Developed and optimized big data pipelines using the Hadoop (2.0) framework within Cloudera Hadoop Ecosystem projects, leveraging Apache Spark/PySpark, MapReduce, Sqoop, Pig, HDFS, HBase, Hive, Oozie, and Flume for large-scale data processing.

Designed and implemented data ingestion models using AWS Step Functions, AWS Glue, and Python modules to streamline ETL processes.

Built and managed real-time data pipelines using Apache Kafka, integrating Spark Streaming with Kafka for high-throughput data processing.

Performed performance tuning and optimizations in Oracle databases, utilizing SQL tuning, execution plans, and complex SQL queries to validate data transitions across different layers of the data warehouse.

Worked extensively with the Azure Data Engineering stack, including Azure Data Factory (ADF), Databricks, SQL Database, Function Apps, Event Hubs, Stream Analytics, Synapse Analytics, Blob Storage, Data Lake Gen2, and Delta Lake.

Designed and optimized data pipelines using DBT for modular, reusable, and scalable data transformation workflows in Snowflake.

Developed end-to-end Snowflake data solutions, leveraging SnowSQL, Snowflake Streams, and Tasks for efficient data warehousing and analytics.

Orchestrated complex ETL workflows with Apache Airflow, ensuring reliable scheduling, monitoring, and execution of data pipelines.

Proficient in writing optimized SQL queries, stored procedures, and performance tuning for large-scale enterprise data warehouses.

Implemented ETL pipelines using Azure Data Factory (ADF), executing data ingestion, transformation, and cleaning within Operational Data Stores and Data Warehouses.

Configured and managed Azure Databricks clusters performed data transformations, and orchestrated data pipeline execution with Azure services.

Handled batch and real-time data processing using Azure Event Hubs, Azure Stream Analytics, and Azure Databricks for scalable analytics workflows.

Worked with version control systems such as GIT and SVN, utilizing SourceTree, Kraken, Git Bash, and Mac Terminal for source code management and collaboration.

Provided technical support by troubleshooting, maintaining, and resolving complex escalated issues, ensuring application stability and performance.

Followed industry-standard methodologies, including System Development Life Cycle (SDLC), Agile (Scrum/Kanban), and Waterfall methodologies, to deliver high-quality software solutions.

Demonstrated strong communication, analytical, and problem-solving skills, excelling both as an individual contributor and a collaborative team member.

Technical Skills

Big Data & Streaming

Apache Spark (PySpark, Scala), Apache Kafka, Apache Airflow, Delta Lake, Hadoop (HDFS, Yarn, MapReduce, Hive, HBase, Flume, Sqoop), Snowflake, Databricks

Cloud Platforms

Microsoft Azure (ADF, Databricks, SQL Server, Synapse, Event Hubs, ADLS), AWS (S3, Redshift, Glue, Lambda, EMR, QuickSight), GCP (BigQuery, Cloud Storage, Pub/Sub)

Databases & Storage

SQL Server, PostgreSQL, MySQL, Oracle, Snowflake, CosmosDB, MongoDB, Aerospike

ETL & Data Processing

Azure Data Factory, Informatica IICS, DBT, SnowSQL, Python (Pandas, NumPy, PySpark), Scala

DevOps & CI/CD

Docker, Kubernetes, Terraform, Jenkins, GitHub Actions, Apache NiFi

Data Governance & Monitoring

Data Lineage, Data Quality, Splunk, Datadog, CloudWatch

.Version Control & Collaboration

Git, Bitbucket, SVN, Jira, Confluence

Operating Systems

Linux, Unix, macOS, Windows

PROFESSIONAL EXPERIENCE

Client: Molina Health Care- Remote

Data Engineer November 2023 - Present

Responsibilities:

Developed Python applications to transfer data from Snowflake to Aerospike NoSQL database using SnowSQL and Aerospike Loader utilities, optimizing data migration workflows.

Optimized and tuned performance for batch data loading from Snowflake to NoSQL databases, ensuring efficient data ingestion and retrieval.

Implemented batch data processing pipelines, using SnowSQL to load/unload bulk data between Snowflake and upstream data sources.

Developed ETL pipelines using Informatica IICS, optimizing data integration and transformation workflows for healthcare data processing.

Designed and implemented DBT models for standardized transformations in Snowflake, enabling efficient ELT workflows.

Developed real-time and batch data pipelines using Apache Airflow, scheduling Snowflake ELT jobs with dependency management.

Integrated observability and monitoring into data pipelines, leveraging Datadog, Splunk, and CloudWatch for proactive alerting.

Collaborated with cross-functional teams to define SLA requirements and enforce data quality standards in Operational Business Intelligence Mart.

Developed Kafka Producer applications to ingest data from Snowflake and publish it to Kafka topics, improving real-time data streaming capabilities.

Performed ETL performance tuning, optimizing data extraction, transformation, and load (ETL) processes to ensure high efficiency and reliability.

Optimized Power BI data refresh schedules and incremental data loading, reducing report processing times.

Integrated row-level security (RLS) in Power BI, ensuring compliance with healthcare data privacy regulations (HIPAA).

Designed and implemented Kafka Producer and Consumer applications to ingest data from on-premises and cloud-based Kafka topics, ensuring seamless data integration and message reliability.

Developed Spring Boot-based Java Controller Web Applications from initial phases, integrating backend services with data pipelines and APIs.

Developed interactive Power BI reports and dashboards, providing real-time insights into healthcare analytics.

Maintained and optimized multiple data pipelines across various teams, ensuring high availability, scalability, and reliability of data processing workflows.

Performed performance tuning and optimization of Kafka Producer/Consumer applications, reducing latency and improving throughput for real-time streaming workloads.

Implemented CI/CD pipelines with Terraform and Udeploy, ensuring automated deployment and versioning of ETL workflows.

Designed data masking solutions to enforce compliance with HIPAA and SOX regulations, securing sensitive healthcare data.

Automated bulk data extraction processes, using SnowSQL and Python scripts to unload data into JSON and CSV formats, facilitating downstream analytics and reporting.

Deployed applications to Amazon EKS (Elastic Kubernetes Service) using Jenkins and Docker, improving CI/CD automation and container orchestration.

Enhanced and maintained a scalable and reusable Big Data framework to support data-driven analytical products.

Utilized GIT version control for code development, branch management, and collaborative workflows throughout the project lifecycle.

Engaged in requirement gathering sessions with Product Owners (POs) and Business Analysts (BAs) to define technical specifications and create detailed technical documentation.

Showcased sprint deliverables and project work to stakeholders through live demonstrations, final documentation, and code reviews, ensuring project transparency and alignment with business goals.

Technology Stack:

Linux, Kafka, Python, Pandas, Power BI, Java, DBT, Snowflake, Aerospike, Snow SQL, Spring Boot, AWS, Datadog, Insomnia, EKS, Git, Jenkins, Maven

Client: JPMC, Texas

Data Engineer March 2022 - October 2023

Responsibilities:

Designed and developed AWS Data Pipelines to facilitate data movement across AWS compute and storage services, ensuring efficient data ingestion and processing.

Built and maintained ETL pipelines to extract data from MySQL and Oracle sources, storing it in AWS S3, and loading it into AWS Redshift, achieving high success metrics.

Developed and optimized data warehouse applications using Apache Spark, AWS Athena, Python, and Apache Airflow, ensuring seamless management of large datasets in AWS S3.

Engaged in requirement gathering and analysis, conducting workshops and meetings with business users to document functional and technical specifications.

Implemented backend business logic using Python to process and manipulate large datasets for performance optimization and accurate reporting.

Built cloud-native data pipelines using DBT, Snowflake, and AWS Athena, ensuring optimized query performance for data analytics.

Implemented Airflow DAGs to schedule and monitor Snowflake ELT workflows, reducing operational overhead.

Built and maintained DBT transformation models, enforcing data quality and lineage for analytics in AWS Redshift and Snowflake.

Developed SQL-based ETL solutions to extract, transform, and load data from MySQL and Oracle into Snowflake.

Implemented workflow orchestration using Apache Airflow, ensuring scheduled and event-driven execution of ETL jobs.

Enhanced SQL query performance in Snowflake and PostgreSQL, optimizing stored procedures and indexing strategies.

Designed and developed CI/CD pipelines for Snowflake transformations, ensuring automated deployments and version control with GitHub Actions and Jenkins.

Built and maintained DBT transformation models, enforcing data quality and lineage for analytics in AWS Redshift and Snowflake.

Worked extensively with SQL & PL/SQL for data modeling, transformation, and stored procedure optimization.

Designed and optimized database structures, creating tables, functions, stored procedures, and PL/SQL queries for efficient data retrieval and transformation.

Utilized AWS EMR for processing large-scale data transformations, integrating data across AWS S3, DynamoDB, and other AWS data stores.

Developed data transformation scripts using Python and SQL, integrating structured and semi-structured data into AWS Data Lake.

Designed real-time data streaming pipelines using Apache Kafka and Spark Streaming API to handle large-scale, real-time event processing.

Implemented real-time data ingestion and processing workflows with Kafka, ensuring low-latency data availability for downstream analytics.

Performed debugging, troubleshooting, and deployed Python bug fixes, ensuring the stability and reliability of critical applications used by both customers and internal teams.

Analyzed and transformed data in AWS Data Lake, generating business intelligence reports using AWS QuickSight, incorporating filters, parameters, and calculated fields.

Monitored and analyzed data pipelines using Splunk and Datadog, ensuring real-time data visualization, performance tracking, and system reliability.

Technology Stack:

AWS (S3, Redshift, EMR, Data Pipeline, Athena, QuickSight), DBT, Python, Apache Spark, Apache Kafka, Airflow, SQL, PostgreSQL, DBeaver, Presto SQL, Git, Splunk, Datadog.

Client: Void Main Technologies, [Hyderabad, India]

AWS Data Engineer Jun 2020 – July 2021

Responsibilities:

Collaborated with clients to gather business requirements, conduct analysis, and design data warehouse solutions, ensuring optimal data architecture.

Created and optimized ER diagrams and logical data models for Oracle and Teradata using ER Studio, aligning with business needs.

Developed ETL workflows using Apache Sqoop to extract and load structured data from MySQL and DB2 into the Data Lake Raw Zone, ensuring seamless data ingestion.

Designed and implemented Big Data processing solutions leveraging Hadoop, MapReduce, HBase, and Hive to enable efficient data-at-rest processing.

Built AWS Lambda functions to perform data validation, filtering, sorting, and transformation, synchronizing DynamoDB tables and loading transformed data into AWS storage.

Executed predictive and what-if analysis using Python on HDFS, successfully migrating data from Teradata to HDFS and integrating it with Hive for analytical processing.

Developed and maintained data acquisition and integration pipelines using Informatica IICS, improving data flow efficiency across multiple cloud environments.

Migrated legacy ETL jobs from on-prem DataStage to Informatica IICS, ensuring seamless transition and optimization for cloud platforms.

Built Airflow DAGs to automate and monitor end-to-end ELT processes across AWS data lake and Snowflake.

Developed advanced SQL scripts and Snowflake procedures to handle incremental data loads and deduplication.

Optimized AWS-based ETL pipelines by leveraging Glue, Redshift, and S3, enhancing data storage and retrieval speed.

Implemented data quality and lineage tracking using Informatica, ensuring data integrity and traceability across the organization.

Migrated on-prem data pipelines to Snowflake, developing DBT transformations for efficient data processing.

Built Airflow DAGs to automate and monitor end-to-end ELT processes across AWS data lake and Snowflake.

Developed advanced SQL scripts and Snowflake procedures to handle incremental data loads and deduplication.

Created DBT models for data cleansing and transformation, enabling better governance and data lineage tracking.

Developed data lineage tracking and metadata processing for AWS-based data pipelines, enhancing visibility into data movements.

Built monitoring dashboards using Datadog and AWS CloudWatch, providing real-time insights into data pipeline health and SLA adherence.

Designed Dimensional and Snowflake Schema models, ensuring efficient storage and retrieval in Snowflake and AWS Redshift.

Developed data models and extracted metadata from Amazon Redshift, AWS Kinesis Data Streams, and Elasticsearch, writing SQL queries for insightful reporting.

Implemented real-time data streaming solutions using Apache Kafka, ensuring efficient data ingestion and pipeline execution.

Built scalable data transformation pipelines using PySpark (Data Frames, Spark-SQL, Spark MLLib) and Databricks, optimizing data extraction, transformation, and aggregation for advanced analytics.

Automated ad-hoc reporting and data extracts by developing a customized reporting framework using Oozie, streamlining daily business intelligence operations.

Designed and optimized stored procedures, functions, and packages, performing SQL tuning and troubleshooting to enhance database performance.

Developed and implemented containerized applications using Docker Swarm and Docker Compose, ensuring scalability and streamlined deployments.

Deployed and managed microservices architectures on Kubernetes clusters, configuring Operators, Deployments, ConfigMaps, Secrets, and Services to enhance infrastructure efficiency.

Created and optimized interactive dashboards and reports using Tableau, incorporating filters, quick filters, sets, parameters, and calculated fields for data visualization.

Technology Stack:

Python, AWS Glue, AWS Athena, AWS S3, AWS Redshift, AWS EMR, AWS RDS, DynamoDB, SQL, QuickSight, DBT, Snowflake, Apache Spark, Kafka, MongoDB, Databricks, Hadoop, Linux, PySpark, Oozie, HDFS, MapReduce, Cloudera, HBase, Hive, Pig, Docker, Kubernetes, Tableau

Client: Cybersoft, [Hyderabad, India]

Data Engineer May 2019-June 2020

Responsibilities:

Developed scalable data pipelines using PySpark and Spark SQL in Azure Databricks, extracting, transforming, and aggregating data from multiple sources to uncover insights into customer usage, consumption patterns, and behavior.

Designed and optimized dimensional data models using Star Schema, Snowflake Schema, and Slowly Changing Dimensions (SCDs) to support forecasting and analytics on large-scale datasets.

Implemented ETL processes using Azure Data Factory (ADF) to ingest and transform structured and semi-structured data into Azure SQL Server for downstream analytics.

Designed and developed ETL solutions using Informatica PowerCenter and IICS, extracting data from Azure SQL Server, Oracle, and PostgreSQL.

Built batch and real-time data processing pipelines leveraging Azure Data Factory (ADF) and Informatica IICS, optimizing data workflows.

Developed data masking and transformation solutions, ensuring compliance with regulatory requirements such as SOX and GDPR.

Implemented ETL performance tuning techniques, optimizing SQL queries and Informatica workflows for faster data processing.

Developed scripts to automate data transfer from an FTP server to the ingestion layer using Azure CLI commands, improving efficiency in data processing workflows.

Configured and managed Azure HDInsight clusters using PowerShell scripts, ensuring automation of cluster provisioning and efficient big data processing.

Developed real-time data processing pipelines in Azure Stream Analytics and Apache Spark, integrating streaming data from various sources into Azure Data Lake and Azure SQL Server.

Built interactive dashboards and reports in Tableau, leveraging filters, quick filters, sets, parameters, and calculated fields to deliver actionable business insights.

Optimized SQL queries and stored procedures in Azure SQL Server, performing query tuning, indexing, and partitioning to enhance data retrieval and processing efficiency.

Developed and deployed containerized applications using Docker and Kubernetes, orchestrating data pipeline workloads across distributed environments.

Automated data workflow execution and monitoring using Apache Airflow, scheduling and managing pipeline dependencies efficiently.

Implemented data validation and transformation logic using Python and Pandas, ensuring high data quality in analytical and reporting layers.

Executed Unix shell scripting to schedule and monitor batch jobs, automating data movement and system health checks in Azure and on-premises environments.

Technology Stack:

Python, PySpark, SQL Server, Azure Data Factory (ADF), Azure Databricks, Azure SQL Server, Azure Blob Storage, Azure Table Storage, Apache Spark, Apache Hive, Apache Airflow, Tableau, Unix Shell Scripting, Docker, Kubernetes, PowerShell, FTP, Teradata SQL Assistant, Oracle 12c

Education

University Of Central Missouri

Master of Science, Computer Information Systems and Information Technology

Lovely Professional University

Bachelor of Technology, IT

Contact this candidate