Data Engineer Lead

Location:

Jersey City, NJ

Posted:

March 26, 2025

Contact this candidate

Resume:

Sagar Karakala

Lead Data Engineer

+1-551-***-**** ***************@*****.***

www.linkedin.com/in/sagarkarakala

Jersey City, NJ, USA

SUMMARY

10+ years of experience in cloud-based data engineering, specializing in Azure, AWS, and GCP solutions for enterprise-scale data transformation, cloud migration, and real-time analytics.

Extensive expertise in Azure services, including Azure Data Factory (ADF), Azure Synapse Analytics, Azure Databricks, Azure SQL Database, Cosmos DB, and Salesforce Data Cloud for cloud-native data solutions.

Proficient in AWS data services, including S3, Redshift, EMR, Lambda, DynamoDB, Glue, Athena, and GCP integration with Pub/Sub, BigQuery, and Dataflow for scalable cloud architectures.

Architected, developed, and optimized batch and real-time ETL pipelines, leveraging Apache Kafka, AWS Kinesis, Azure Event Hubs, Apache Flink, and Spark Streaming, reducing data latency from 5 minutes to <30 seconds.

Deep expertise in Hadoop ecosystem (HDFS, Hive, Pig, Sqoop) and Apache Spark (Spark SQL, MLlib, GraphX, Streaming) for processing large-scale structured and unstructured datasets.

Designed and managed Databricks workspaces, clusters, and ML lifecycle operations across Azure and AWS, integrating MLflow and SageMaker for machine learning model deployment.

Strong database engineering skills in SQL Server, Azure SQL Data Warehouse, Oracle, and NoSQL (Cosmos DB, DynamoDB, MongoDB), specializing in query optimization, indexing, partitioning, and performance tuning, improving query speeds by 50%+.

Automated database performance monitoring and tuning using SQL Profiler, Extended Events, Query Store, and cloud-native analytics tools, ensuring high availability and cost efficiency.

Built and optimized scalable ETL pipelines for enterprise data warehouses (Azure Synapse, AWS Redshift, BigQuery), ensuring seamless data ingestion, transformation, and governance.

Led multi-cloud migration projects, executing Lift & Shift, Re-Architecting, and Hybrid Cloud strategies using Azure Migrate, AWS Snowball, and Terraform, reducing migration costs by 30%.

Implemented cloud security best practices, including IAM, RBAC, Azure Key Vault, AWS KMS, data encryption, and access control policies, ensuring GDPR, HIPAA, and SOC 2 compliance.

Developed and enforced enterprise data governance frameworks, integrating data lineage tracking, PII data masking, and automated compliance reporting.

Designed and deployed CI/CD pipelines using Jenkins, Docker, Kubernetes, GitHub Actions, and Azure DevOps, automating infrastructure provisioning with Terraform, Bicep, and Ansible, accelerating deployment by 40%.

Developed RESTful APIs and data integration pipelines using Python, FastAPI, and Flask, enabling seamless third-party data ingestion into cloud ecosystems.

Expert in business intelligence & reporting tools, including Power BI, Tableau, SSRS, and Excel Pivot Tables, developing scalable analytics solutions for financial, healthcare, and e-commerce industries.

Engineered interactive web-based dashboards and front-end analytics tools using ReactJS, JavaScript, HTML, and CSS, integrating with cloud-based data applications for improved user engagement.

Led and mentored teams of 10+ data engineers, establishing best practices in cloud data engineering, big data, DevOps, and automation across global enterprise teams.

Proficient in Agile (Scrum) & Waterfall methodologies, with expertise in JIRA for sprint planning, backlog management, and end-to-end project tracking.

EDUCATION

Bachelor of Technology in Computer Science Aug 2014

Osmania university

TECHNICAL SKILLS

AWS Services

S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, Cloud formation

Hadoop Components / Big Data

HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Salesforce, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, PySpark Airflow, Kafka, Snowflake Spark Components

Databases

Oracle, Microsoft SQL Server, MySQL, DB2, Teradata

Programming Languages

Java, Scala, Impala, JavaScript, Python.

Web Servers

Apache Tomcat, Web Logic.

BI Tools

Tableau, SSRS, Looker, Amazon QuickSight, Excel Pivot Tables, Power BI

Methodologies

Agile (Scrum), Waterfall, UML, Design Patterns, SDLC.

Currently Exploring

Apache Flink, Drill, Tachyon.

Cloud services

AWS, Azure, Azure Data Factory / ETL/ELT/SSIS Azure Data Lake

Other Tools

JIRA, Confluence, Slack, Microsoft Teams

EXPERIENCE

Client: Capital One Chicago, IL Jan 2023-Till Date

Role: Lead Data Engineer

Description: Capital One is a diversified bank that offers a broad array of financial products and services to consumers, small businesses, and commercial clients. Capital One Financial Corporation is a bank holding company specializing in credit cards, auto loans, banking and savings products headquartered in McLean, Virginia.

Responsibilities:

Led development activities for 15+ high-performance data solutions, gathering requirements, designing scalable architectures, and mentoring a team of 10+ developers to ensure secure, efficient, and compliant implementations.

Designed and implemented 30+ ETL pipelines, extracting, transforming, and loading terabytes of data daily into Azure Data Storage services, leveraging Azure Data Factory (ADF), T-SQL, Spark SQL, and U-SQL in Azure Data Lake Analytics.

Developed and optimized 50+ Azure Data Factory pipelines, configuring Linked Services, Datasets, and Triggers to automate data movement across Azure SQL, Blob Storage, and Azure Synapse (DW), reducing manual processing time by 40%.

Engineered JSON scripts to automate deployment of ADF pipelines, cutting down deployment time by 60% while ensuring seamless SQL activity execution.

Optimized 100+ complex transformations using T-SQL and Spark SQL, delivering 50% faster query execution and enhancing data processing efficiency in Azure Synapse Analytics (SQL DW).

Developed and deployed 25+ high-performance Databricks ETL pipelines leveraging Spark DataFrames, Spark SQL, and Python scripting, reducing batch processing time by 35%.

Integrated and configured JDBC connectors in Azure Databricks to streamline ETL operations across 10+ relational databases, improving cross-platform data accessibility.

Implemented Hadoop Distributed File System (HDFS) solutions, enabling large-scale parallel data processing and seamless integration with Azure-based services.

Designed and optimized 20+ data warehouse schemas and tables, improving query performance by 45% and enabling real-time analytics.

Migrated 50TB+ of data from on-premises data warehouses to Azure Data Lake Gen2, ensuring 99.9% uptime and data integrity during the transition.

Automated 80% of data movement workflows by deploying and configuring Azure Data Factory pipelines, significantly reducing manual effort and operational costs.

Implemented Change Data Capture (CDC) and incremental data load mechanisms in ADF pipelines, eliminating redundant processing and improving ETL efficiency.

Built and deployed real-time data processing solutions using Azure Stream Analytics, Azure Event Hub, and Service Bus Queue, improving event-driven analytics latency by 70%.

Developed and launched 50+ Power BI dashboards and reports, implementing advanced DAX queries and Row-Level Security (RLS) to enable secure, data-driven decision-making for 500+ business users.

Implemented role-playing dimensions in Power BI tabular models using DAX functions, optimizing analytical capabilities for multi-perspective reporting.

Configured and managed Power BI Enterprise and Personal Gateway, ensuring 99.5% data availability with automated refresh schedules and real-time synchronization.

Developed and deployed Informatica mappings to meet client-specific ETL and data transformation requirements, improving data integration workflows.

Transferred large datasets using PySpark connectors from Azure Synapse workspace, ensuring efficient cross-platform data movement.

Configured and monitored end-to-end Azure integrations using Azure Monitor, proactively identifying and resolving performance bottlenecks.

Integrated Azure Logic Apps to automate workflow-based data processing, improving task execution efficiency and enabling real-time decision-making.

Led end-to-end functional and regression testing on 100+ data models, eliminating 95% of data inconsistencies and enforcing governance rules for data security, access control, and compliance.

Environment: SQL Database, Azure data factory, Azure data lake storage, Azure synapse analytics, Azure synapse workspace, Synapse SQL pool, Power BI, Python, Data masking, Azure data bricks, Azure SQL Data warehouse, Azure stream analytics.

Client: Aditi Consulting Seattle, WA Sep 2021-Dec 2022

Role: Azure Data Architect

Description: Aditi is a Digital Engineering Services company. We partner with established and emerging enterprises by leveraging borderless talent across three continents to achieve transformative outcomes that will reshape their trajectory.

Responsibilities:

Migrated client data warehouse from on-premises to Azure Cloud, enhancing data retrieval and minimizing maintenance overhead.

Engineered ETL pipelines for seamless migration of 100+ TB of data to Azure Synapse Analytics & Azure SQL DB, optimizing data processing.

Created Linked Services in Azure Data Factory (ADF) to seamlessly extract, transform, and load (ETL) data from multiple sources, including Azure SQL, Blob Storage, and Azure SQL Data Warehouse.

Automated 50+ data pipelines in Azure Data Factory, streamlining cloud data movement and reducing manual workloads.

Designed batch processing solutions using Azure Data Factory & Databricks, improving data transformation speed.

Developed Azure Databricks clusters, notebooks, jobs, and auto-scaling features, optimizing compute costs.

Mounted Azure Data Lake & Blob Storage in Databricks, enabling real-time analytics with improved query performance.

Automated file validation using Python scripts in Databricks, triggered through ADF, reducing data processing errors.

Developed a Near Real-Time (NRT) data pipeline with Azure Stream Analytics & Event Hub, cutting processing latency.

Integrated Cosmos DB for catalog storage and event sourcing in order processing pipelines, improving processing speed.

Created cloud-based storage accounts for end-to-end job execution, ensuring high availability and security.

Implemented data auditing, masking, and security frameworks, ensuring full GDPR & SOC 2 compliance.

Implemented end-to-end monitoring using Azure Monitor & Application Insights, reducing system downtime.

Secured credentials and access management using Azure Key Vault, strengthening data security.

Developed and executed ETL test strategies, ensuring data pipeline accuracy.

Conducted business requirement analysis through direct stakeholder interactions, improving data accuracy and decision-making speed.

Reviewed project plans, identified gaps, and provided execution feasibility feedback, streamlining development time.

Designed 50+ Power BI dashboards with advanced DAX functions for real-time insights and secure reporting.

Optimized Power BI dashboards, reducing report generation time and enhancing real-time insights.

Applied advanced Excel formulas for data analysis, improving financial reporting accuracy.

Deployed automated CI/CD pipelines via Azure DevOps, accelerating release cycles and infrastructure provisioning.

Optimized data modeling and query execution using Azure Synapse Analytics, SQL Azure, and Azure Analysis Services, improving report efficiency.

Environment: SQL Database, Azure data factory, Azure data lake storage, Azure synapse analytics, Synapse SQL pool, Power BI, Python, Data masking, Azure data bricks, Azure SQL Data warehouse, Azure Cosmos DB, Azure stream analytics.

Client: Prudential Financial, Newark, New Jersey, USA Mar 2020-Aug 2021

Role: Senior Azure Data Engineer

Description: Prudential Financial is a global financial services company offering insurance, investment management, and retirement solutions. Played a key role in designing, implementing, and Mastered data pipelines and storage solutions, ensuring efficient data integration, security, and process automation while enhancing data engineering practices.

Responsibilities:

Discovered, built, and expanded over 25 data pipelines using Azure Data Factory and other Azure services, processing and integrating data from 10+ disparate sources, leading to a 40% improvement in data availability.

Engineered and Composed data storage solutions with Azure Data Lake Storage, Azure Blob Storage, Salesforce Data Cloud and other services, attained handling data volumes exceeding 100 TB.

Initiated and secured relational databases using Azure SQL Database and Azure Synapse Analytics, reducing query processing times by 35% and improving database security compliance by 20%.

Devised a multi-layered data security strategy that included encryption and strict access controls; resulted in 100% compliance during internal audits, reinforcing the organization's commitment to safeguarding customer data.

Monitored and Drove resolved issues in data pipelines, cutting pipeline downtime in half, improving system reliability.

Evaluated and recommended 5+ new Azure technologies and tools, which resulted in a 25% improvement in data engineering workflows and cost savings.

Activated resources (code and infrastructure) in Azure using Azure DevOps & Jenkins pipelines, streamlining deployment processes, and reducing release times by 30%.

Instrumented over 15 business processes and workflows using Azure Logic Apps, leading to a 20% reduction in manual intervention and operational costs.

Instituted Azure Data Factory (ADF) In-depth for ingesting data from 20+ relational and unstructured data sources, achieving business functional requirements with a 99% accuracy rate.

Integrated data from a wide array of internal and external sources, including APIs, flat files, and third-party services, into Azure-based data systems, resulting in a 45% reduction in data processing times.

Invented and Launched CI/CD pipelines for data engineering projects using Azure DevOps, GitHub Actions, and Jenkins, reducing build, test, and deployment cycle times by 25%.

Pioneered Python scripts to automate ETL processes and interact with Azure services, reducing manual coding efforts by 30% Boosting the speed of data processing operations efficiency.

Monitored servers using Nagios, CloudWatch, and ELK Stack (Elastic Search, Kibana), achieving 99.9% system uptime and timely identification of server performance issues.

Applied Python libraries such as Pandas, NumPy, and PySpark to process and analyze large datasets, improving data processing speeds by 40%.

Integrated and transformed data from both on-premises (MySQL, Cassandra) and cloud sources (Blob Storage, Azure SQL DB) using Azure Data Factory, improving data pipeline performance by 35%.

Environment: API, Azure, Azure Analysis Services, Azure Synapse Analytics, Blob, Cassandra, CI/CD, Data Factory, Elastic search, ETL, Factory, HBase, HDFS, Java, Jenkins, Kafka, Kubernetes, lake, Oracle, Power BI, PySpark, Python, RDBMS.

Client: Bristol-Myers Squibb, Princeton, New Jersey, USA May 2018-Feb 2020

Role: AWS Data Engineer

Description: Bristol-Myers Squibb is a global biopharmaceutical company dedicated to discovering, developing, and delivering innovative medicines for serious diseases. Contributed to designing and resolved scalable cloud infrastructure, automating deployments, and implementing data processing and disaster recovery solutions.

Responsibilities:

Designed, Mobilized, and Guided scalable and secure AWS cloud infrastructure to support over 50 applications and services, ensuring 99.9% availability and reliability across AWS services.

Developed and Administered cloud architecture strategies aligned with business goals, reducing infrastructure costs by 20% while increasing performance by 30%.

Systematized infrastructure provisioning and management using AWS CloudFormation, Terraform, and Ansible, reducing manual efforts by 85% and accelerating deployment times by 40%.

Ensured security and compliance of AWS environments by implementing best practices for access control, data protection, and adhering to HIPAA and GDPR regulations, achieving 100% compliance in security audits.

Monitored AWS infrastructure and applications using Amazon CloudWatch, led to addressing performance bottlenecks, leading to a 25% increase in system efficiency.

Developed and Overhauled disaster recovery plans and backup strategies, ensuring data integrity and 100% availability during system failures.

Performed data transformation and cleansing using AWS Glue, AWS Lambda, and AWS EMR, processing data streams in real-time using Amazon Kinesis, improving data pipeline efficiency by 35%.

Configured and monitored AWS infrastructure using Terraform, reducing infrastructure management time by 90% and improving system uptime to 99.9%.

Played a part in all phases of the Software Development Lifecycle (SDLC), including requirements gathering, design, development, deployment, and analysis, which Intensified overall project delivery timelines by 20%.

Integrated Kubernetes with AWS EKS and GCP GKE, achieved a 50% improvement in scalability and management of cloud-native services.

Revitalized Kibana dashboards for near real-time log analysis, integrating multiple source and target systems into Elasticsearch, reducing troubleshooting time by 40%.

Wrote AWS Lambda functions in Spark with cross-functional dependencies, automating data ingestion and refinement into ADLS, reducing manual processes by 60%.

Automated deployments to Diverse environments (development, staging, production) using AWS Code Deploy, AWS Elastic Beanstalk, and Kubernetes, increasing deployment frequency by 50%.

Established secure network communication using Kerberos authentication principles, setting up and testing HDFS, Hive, Pig, and MapReduce for over 200 users across the cluster.

Capitalized on the AWS SDK for Python (Boto3) to programmatically interact with AWS services like EC2, S3, RDS, and Lambda, automating processes to reduce manual intervention 75%.

Crafted a data optimization strategy using Python libraries, leading to a refined process for handling 750,000 data points daily: this initiative improved data accuracy and reliability, supporting key business outcomes.

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau.

Client: NCL Industries Ltd, Hyderabad, India Jan 2017-Apr 2018

Role: GCP Data Engineer

Description: NCL Industries Ltd. is a significant player in the Indian manufacturing sector, focusing on building materials and related products. Involved in designing, deploying, and upgraded scalable data infrastructure and workflows to ensure efficient data processing and integration while maintaining security and compliance.

Responsibilities:

Introduced, deployed, and Revamped data infrastructure on Google Cloud Platform (GCP), utilizing services such as Google Big Query, Google Cloud Storage, and Google Cloud SQL, supporting datasets of up to 100 TB.

Ensured scalable data infrastructure capable of managing a 50% annual growth in data workloads, improving system performance by 30%.

Arranged and maintained ETL pipelines using Google Cloud Dataflow, Google Cloud Composer, and Apache Airflow, processing over 10 million records per day and reducing data processing time by 40%.

Robotized data ingestion and processing workflows, eliminating manual intervention requirements by 70% and improving data processing efficiency by 35%.

Organized and Led APIs to integrate data across 15+ internal and external systems, improving data accessibility and reducing API response times by 25%.

Executed Google Big Query and Google Data Studio analyze and visualize data, providing actionable insights that reduced decision-making time by 20%.

Serviced Enhancing the execution of queries and data workflows, resulting in a 40% reduction in processing costs and a 35% improvement in query performance.

Enforced security measures to protect data through encryption and access controls, achieving 100% compliance with GDPR and HIPAA regulations.

Innovated audit logs and conducted regular security assessments, identifying, and resolving potential vulnerabilities, reducing security incidents by 30%.

Stayed updated with the latest GCP technologies and implemented 5+ improvements, increasing overall data engineering efficiency by 25%.

Wrote Python DAGS in Airflow to orchestrate end-to-end data pipelines for multiple applications, improving workflow automation and reducing pipeline failure rates by 20%.

Serviced and Upgraded Azure cloud services, including Azure Data Lake Storage, Azure SQL Database, Azure Synapse Analytics, and Azure Blob Storage, handling data migrations of over 50 TB.

Facilitated data movement between GCP and Azure using Azure Data Factory, streamlining cross-cloud data processes, and improving data transfer speeds by 35%.

Applied expertise in Cloud Dataflow and Apache Beam, deploying scalable services via cloud shell, leading to a 20% reduction in task completion times.

Environment: GCP, Big Query, Cloud SQL, Cloud Storage, Airflow, BQ-ML, Data Studio, MySQL, Git, Azure, Atlassian, Natural Languages, Windows server, python, shell scripts, IAM Security, Service Data Transfer, VPC Configuration, Data Catalog. VPN Google-Client.

Client: L&T Finance Holdings Ltd, Hyderabad, India Nov 2014 –Dec 2016

Role: Data Engineer

Description: L&T Finance Holdings Ltd is a prominent financial services company in India, part of the L&T Group. Attained data pipelines and warehousing solutions to automate and streamline the processing of financial data, enhancing performance and reporting.

Responsibilities:

Redesigned and maintained over 50 ETL pipelines to automate data ingestion, transformation, and loading processes, reduced manual intervention by 75% and increased data throughput by 40%.

Built and optimized scalable data pipelines to process large volumes of market data, trade transactions, and financial reports, handling over 1 billion transactions monthly using Apache Kafka, Apache NiFi, and AWS Glue.

Created and Boosted ETL workflows that integrated data from 20+ market feeds, trading platforms, and internal databases into data lakes, improving data availability by 30%.

Promoted and managed data warehousing solutions using Amazon Redshift, Google Big Query, and Microsoft Azure Synapse Analytics, organizing over 100 TB of trading data and reducing data retrieval times by 40%.

Accelerated SQL queries and data processing tasks, improving performance, and reducing latency by 35% in retrieving and analyzing financial data.

Integrated data into Integrating BI tools for data analysis and reporting like Tableau, Power BI, and Looker, developing over 30 dashboards and visualizations for stakeholders, which improved decision-making speed by 25%.

Set up monitoring and alerting systems using AWS CloudWatch and Google Cloud Monitoring, reducing pipeline downtime by 50% and improving the overall health of data workflows.

Integrated data from diverse sources, including relational databases, NoSQL databases, APIs, and file systems, into AWS-based data solutions with AWS Glue, Amazon RDS, and AWS Lambda, processing over 10 TB of data daily.

Connected AWS data solutions with business intelligence tools such as Amazon Quick Sight, Tableau, and Power BI, enabling real-time data analysis and improving stakeholder reporting by 35%.

Environment: AWS, CI/CD, Cluster, Data Factory, Docker, ETL, Factory, HBase, Hive, Java, PowerBI, AWS Glue, Amazon RDS, AWS CloudWatch, Google Cloud.

Contact this candidate