Data Engineer Processing

Location:

Posted:

May 01, 2025

Resume:

Professional Summary

Experienced Cloud Data Engineer, Data Engineer, and Analyst with over 7 years of extensive expertise in designing, developing, and managing end-to-end data pipelines, data warehousing solutions, and BI reporting systems. Proficient in leveraging modern cloud platforms such as AWS, Azure, and GCP to build scalable, efficient, and secure data architectures. Skilled in ETL processes, data modeling, data governance, and visualization using tools like Informatica, IBM DataStage, Power BI, and Tableau. Adept at programming and automation with Python, PySpark, R, and Terraform to streamline workflows and enhance system performance.

Strong knowledge of database systems, including Snowflake, Netezza, Oracle SQL, PostgreSQL, MySQL, SnowSQL, and big data processing technologies such as Apache Spark, Kafka, and Hadoop. Demonstrated expertise in utilizing advanced cloud-based tools and services, including Google BigQuery, Amazon Redshift, Hive, Amazon Kinesis, Google Cloud Dataflow, Google Cloud Pub/Sub, Azure Blob Storage, Azure Data Lake Storage, Azure Data Factory, Azure Databricks, and Azure Cosmos DB. Proven success in integrating these technologies to design and implement efficient, real-time, and batch data processing systems, enabling scalable and high-performance analytics.

Solely managed the adoption and integration of cutting-edge technologies within the organization, leading a complete migration to cloud platforms. Designed and executed the data engineering roadmap, aligning it with business goals and managing the budget to ensure cost-effective and high-performing solutions. Played a pivotal role in transforming the organization’s data ecosystem, streamlining processes, and driving innovation.

Extensive hands-on experience with various ETL tools and technologies for managing production streams, ensuring seamless data processing and troubleshooting. Actively participated in on-call duties to resolve critical production issues within tight deadlines, minimizing downtime and maintaining system reliability. Proven ability to diagnose, address, and resolve complex technical challenges under pressure, ensuring smooth operations.

Collaborated with cross-functional teams and stakeholders to gather requirements and deliver effective data solutions tailored to business needs. Consistently focused on performance optimization and cost-effectiveness by leveraging cloud-native services and technologies, ensuring scalable and sustainable implementations. Provided guidance and mentorship to junior team members, fostering professional growth and ensuring best practices across projects. Emphasized the importance of working collaboratively within cross-functional Agile teams to deliver data engineering solutions efficiently.

Demonstrated expertise in addressing and responding to security concerns, ensuring cloud environments and data workflows adhere to company security guidelines and industry standards. Proven ability to handle large-scale migrations, manage financial data warehouses, and ensure data quality and consistency in complex environments. A collaborative team player with a track record of delivering insights through robust dashboards, advanced SQL queries, and interactive visualizations. Skilled in aligning business objectives with data-driven strategies, enabling informed decision-making and driving organizational growth.

Throughout my career, I have focused on ensuring that all data engineering solutions align with business objectives, driving data-driven decision-making, and enabling organizational growth. I have consistently embraced the challenge of learning new technologies as they emerge, practicing and implementing them in real-time projects to achieve success. Whether it’s optimizing cloud infrastructure, designing scalable data architectures, or leading high-performance teams, I have always strived to stay ahead of industry trends. By adopting new tools and technologies as needed, I have consistently delivered innovative solutions that enhance data accessibility, business agility, and overall operational efficiency. This continuous learning approach has allowed me to implement cutting-edge solutions, ensuring that the systems I design and manage are both modern and aligned with evolving business needs.

Big Data & Cloud

Apache Airflow, Hadoop, Spark, Kafka, Google BigQuery, Amazon Redshift, Snowflake, AWS, Azure, Google Cloud

ETL & Data Tools

IBM DataStage, Informatica, Talend, AWS Glue, dbt, Data Modeling (Erwin, IBM Infosphere), SQL

Programming

Python, Scala, T-SQL, PL-SQL, Shell, Perl, R, Java

Cloud Platform

MS SQL Server, AWS, Azure, Google Cloud

Data Visualization

Tableau, Power BI, Sisense, Sigma Computing

Databases

MySQL, PostgreSQL, SQL Server, Oracle, Teradata, Snowflake, Cosmos DB

Data Governance

Master Data Management, Metadata Management, Data Lineage, Data Integrity Checks

Automation & Orchestration

Terraform, Jenkins, AWS Step Functions, Apache Airflow, Autosys

CI/CD & Security

AWS IAM, Terraform, Lambda, Cloud Security Best Practices

Operating Systems

Linux, Windows, Unix

Work History

Role: Senior Cloud Data Engineer

Client: Radial INC, King of Prussia, PA September 2022 to Present

●Designed and developed ETL processes in AWS Glue for migrating campaign data from external sources like S3, ORC/Parquet/CSV/Text files into AWS Redshift, leveraging Amazon EMR for scalable data processing.

●Created and implemented complex Directed Acyclic Graphs (DAGs) in Apache Airflow to manage data workflows, incorporating dependencies, parallel execution, and error handling, with seamless integration into Amazon EMR.

●Implemented AWS IAM for managing user permissions and access control across EC2 instances and related applications.

●Integrated AWS Glue with services like S3, Lambda, and Athena, facilitating seamless data processing and analysis within the AWS ecosystem.

●Proficient in Python for developing scalable data solutions, including custom scripts, automation tools, and data processing applications using libraries such as Pandas and NumPy.

●Utilized advanced SQL techniques (Oracle SQL, T-SQL) to improve query performance and enhance database structures.

●Designed and implemented a robust ETL architecture leveraging AWS Glue, Redshift, and EMR for scalable and efficient data migration, transforming campaign data into actionable insights.

●Architected complex data workflows using Apache Airflow, designing Directed Acyclic Graphs (DAGs) to handle dependencies, ensure error resilience, and achieve parallel execution for optimized data pipelines.

●Developed shell scripts to call and execute Perl scripts, optimizing file processing workflows.

●Experienced in relational databases (MySQL, PostgreSQL, SQL Server, Teradata) and NoSQL technologies (MongoDB, Cassandra), with expertise in optimized data modeling and management.

●Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue and Kinesis.

●Developed real-time data streaming solutions using Apache Kafka to process and analyze high-velocity data from multiple sources, ensuring low-latency and reliable data ingestion.

●Migrated on-premises ETL jobs to AWS Cloud, leveraging Amazon EMR for efficient and cost-effective data processing.

●Configured EMR clusters for data ingestion, transformation, and integration with Redshift using dbt.

●Deployed applications on AWS Lambda with HTTP triggers, integrating them with API Gateway for serverless architecture implementation.

●Implemented log management in AWS S3 using shell scripts to enhance security and data integrity by monitoring cluster termination activities.

●Extracted data from multiple source systems S3, Redshift, RDS and Created multiple tables/databases in Glue Catalog by creating Glue Crawlers.

●Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources.

●Orchestrated large-scale data integration and transformation tasks using IBM DataStage, optimizing ETL processes for efficiency and performance.

●Led the design and management of complex workflows using Autosys, optimizing task dependencies and ensuring smooth task execution.

●Creating AWS Lambda functions using python for deployment management in AWS and designed, investigated and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure.

●Creating different AWS Lambda functions and API Gateways, to submit data via API Gateway that is accessible via Lambda function.

●Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations using AWS Athena.

●ETL, data profiling, data quality and clean ups for SSIS packages.

●Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs using CloudWatch.

●Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift and S3.

●Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as S3 and DynamoDB.

●Worked on the code transfer of a quality monitoring program from AWS EC2 to AWS Lambda, as well as the creation of logical datasets to administrate quality monitoring on snowflake warehouses.

●Create data ingestion modules using AWS Glue for loading data in various layers in S3 and reporting using Athena and Quick Sight.

●Designed and optimized Sisense dashboards and Elasticubes for high-performance analytics, low query latency, and user-friendly visualizations.

●Led the development of scalable, real-time data pipelines, integrating various data sources and ensuring seamless data flow from ingestion to visualization for product-level analytics.

●Managed code repositories using Git, implementing branching strategies (e.g., GitFlow) to streamline development workflows and improve team collaboration on large-scale projects.

Role: Senior Data Engineer

Client: HealthStream, Nashville, TN September 2021 to August 2022

Designed and implemented Enterprise Data Lake to support diverse use cases, including analytics, processing, storage, and reporting of large-scale, rapidly evolving data.

Ensured quality reference data by performing operations such as cleaning, transformation, and integrity checks in a relational environment, collaborating closely with stakeholders and solution architects.

Developed a Security Framework to enable fine-grained access to objects in AWS S3 using AWS Lambda and DynamoDB.

Implemented real-time ETL pipelines using Apache Kafka and Spark Streaming to ingest, process, and transform streaming data, ensuring near real-time availability for analytics and reporting systems.

Built and optimized data pipelines in Google Cloud Platform (GCP) using Apache Airflow, leveraging various operators to handle complex ETL workflows.

Hands-on experience with GCP tools, including Dataproc, GCS, Cloud Functions, and BigQuery, for data processing and analytics tasks.

Developed frameworks for automated daily ad hoc reports and data extracts from enterprise data in BigQuery, improving reporting efficiency.

Utilized Google Data Catalog and Cloud APIs for monitoring, querying, and analyzing BigQuery usage, including billing insights.

Implemented secure network communication by configuring Kerberos authentication principles and testing HDFS, Hive, Pig, and MapReduce for new user access.

Conducted end-to-end architecture assessments of various AWS services, such as Amazon EMR, Redshift, and S3, ensuring optimal utilization.

Used AWS EMR to process and move large datasets between Amazon S3 and DynamoDB.

Developed Spark SQL-based solutions in Scala and Python, leveraging schema RDDs to perform complex computations and generate actionable insights.

Imported data from multiple sources like HDFS/HBase into Spark RDDs and processed it with PySpark to deliver structured outputs.

Created AWS Lambda functions with Boto3 to deregister unused AMIs across regions, reducing EC2 costs.

Imported and exported databases using SQL Server Integration Services (SSIS) and Data Transformation Services (DTS).

Wrote Teradata BTEQ scripts to load and transform data, addressing issues like SCD 2 date chaining and duplicate cleanup.

Configured and maintained Git-based workflows, including pull requests, code reviews, and version tagging, to ensure consistent code quality and release management.

Developed a reusable framework to automate ETL processes from RDBMS systems to the Data Lake using Spark Data Sources and Hive data objects.

Conducted data preparation and blending using Alteryx and SQL for Tableau, publishing data sources to Tableau Server for visualization.

Designed Kibana dashboards based on Logstash data, integrating multiple source and target systems into Elasticsearch for near real-time log analysis of end-to-end transactions.

Implemented AWS Step Functions to automate tasks such as publishing data to S3, training machine learning models with Amazon SageMaker, and deploying them for prediction.

Integrated Apache Airflow with AWS to monitor multi-stage ML workflows, managing tasks executed on Amazon SageMaker.

Developed and maintained Tableau extracts and workbooks, optimizing performance and aligning with business requirements.

Designed and deployed interactive Tableau dashboards to visualize key metrics and insights, enabling data-driven decision-making across business units.

Role: Senior Data Engineer

Client: TRAVEL PORT, INDIA May 2018 to June 2021

●Developed and managed ETL pipelines using Informatica to extract, transform, and load data from various sources, ensuring accuracy and reliability.

●Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

●Implemented Azure Databricks for processing large datasets, enabling faster insights and analytics for real-time decision-making.

●Utilized Azure Data Factory to orchestrate ETL processes, guaranteeing timely and accurate data delivery.

●Efficiently stored and managed inventory data using Azure SQL Database.

●Designed and implemented ADHOC data pulls using Azure Data Factory from on-premises onto SQL DW.

●Developed Spark jobs using PySpark and Spark-SQL for data extraction, transformation & aggregation from multiple file formats.

●Designed and implemented robust stream-processing systems using Kafka, Storm, and Spark-Streaming to handle real-time data ingestion, processing, and analysis.

●Designed and built ETL pipelines to automate ingestion of structured and unstructured data. Rest APIs to retrieve analytics data from different data feeds.

●Implemented robust data quality checks within the ETL pipeline to identify and address discrepancies or errors in the data.

●Developed framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.

●Employed clustering algorithms to identify patterns and group similar data points, improving data organization and analysis.

●Strong skills in visualization tools in power BI, confidential Excel - formulas, pivot tables, charts, and DAX commands.

●Developed live streaming jobs using Kafka and Kinesis as sources and loaded data onto the Databricks Data Lake.

●Implemented data quality checks within Informatica workflows to identify and address discrepancies or errors in the data, ensuring high data quality standards.

●Integrated Informatica with other tools and technologies such as Azure Databricks and Power BI for seamless end-to-end data processing and visualization.

●Developed and maintained dynamic data pipelines in Informatica for real-time inventory updates and analytics for travel agents.

●Developed the live streaming jobs from the Kinesis and Kafka as the source and loaded the data onto the datalake.

●Implemented Infrastructure as Code (IaC) practices for deploying and managing cloud infrastructure, ensuring version control and repeatable deployment processes.

●Queried and analyzed data from Cosmos DB for quick searching, sorting, and grouping through CQL.

●Developed Spark scripts by using Scala shell commands as per the requirement. Processed the schema oriented and non-schema-oriented data using Scala and Spark.

●Implemented data quality checks within Databricks pipelines to ensure high data integrity.

●Worked on python and shell scripting to automate and schedule the workflows to run on Azure.

●Worked on data partitioning and distribution strategies to enhance the scalability of MapReduce jobs.

●Integrated Terraform with Jenkins to automate the execution of infrastructure provisioning and updates as part of the continuous integration process.

●Designed and built ETL pipelines for structured and unstructured data ingestion, leveraging Databricks and Rest APIs.

●Knowledge in performance troubleshooting and tuning Hadoop clusters.

●Coordinated daily Git-based code mergers and resolved conflicts in collaborative development environments, ensuring smooth integration and delivery of software updates.

Contact this candidate