Data Engineer Azure

Location:

St. Louis, MO

Salary:

Posted:

March 10, 2025

Contact this candidate

Resume:

Sr. Data Engineer

Name: Navya Kakani

Email: ************@*****.***

Contact: +1-314-***-****

Linked In: https://www.linkedin.com/in/navya-kakani-42558521b/

PROFESSIONAL SUMMARY:

•10+ years of experience in Data Engineering, specializing in designing and implementing scalable data ingestion pipelines using Azure Data Factory, ADLS Gen2, and Snowflake, optimizing data integration and transformation.

•Expertise in ETL/ELT workflows using Apache Spark, Apache Beam, and Apache Airflow, ensuring efficient data extraction, transformation, and loading. Integrated ML flow for tracking and managing machine learning model lifecycles within data pipelines, ensuring reproducibility and performance monitoring.

•Designed and configured data lineage and cataloging using Azure Purview, providing end-to-end visibility into data flows and transformations.

•Strong experience in SQL development, working with DDL, DML, stored procedures, and indexing, optimizing query performance across Vertica, Teradata, and Snowflake environments.

•Implemented event-driven data processing using EventHub, Apache Kafka, and Azure Functions, enhancing real-time analytics capabilities.

•Configured and optimized MySQL and PostgreSQL databases, ensuring data integrity and performance.

•Experience in data modeling and optimization using Apache Hive, HBase, and Snowflake, applying compression techniques for efficient data storage and retrieval.

•Expertise in handling structured and unstructured data, leveraging various file formats like Avro, Parquet, JSON, ORC, and text to enable efficient storage and processing across Hadoop and RDBMS environments.

•Built and maintained CI/CD frameworks for data pipelines using Jenkins, Git, and Azure DevOps, ensuring automated and seamless deployments. Deployed containerized data workloads using Kubernetes, improving scalability and fault tolerance in cloud-based data engineering projects.

•Developed Power BI reports and dashboards, enabling data-driven decision-making across multiple business units.

•Skilled in implementing data quality checks and cleansing techniques to ensure data accuracy and integrity throughout the pipeline.

•Replaced AWS CloudWatch with Google Cloud Monitoring, providing real-time monitoring, logging, and alerting for GCP infrastructure, improving proactive issue detection and system performance optimization.

•Skilled in cloud-based data solutions with Azure and hybrid architectures, integrating on-premise and cloud data sources while maintaining security and compliance standards.

•Worked in Agile environments, participating in sprint planning, daily stand-ups, and iterative development to streamline data engineering processes and ensure continuous improvements.

TECHNICAL SKILLS:

Big Data Technologies

HDFS, MapReduce, Hive, Sqoop, Oozie, Zookeeper, Kafka, Apache Spark, Spark

Streaming, Apache Spark, Hadoop, Hive, Control-M, Airflow, Informatica

Hadoop Distribution

Cloudera, Horton Works

Cloud Services

Azure Data Factory, Databricks, Logic Apps, Functional App, Synapse Analytics,

HDInsight, Stream Analytics, EventHub, Azure Data Lake Power Bi, Snowflake, Azure DevOps, Azure Data Lake Storage Gen 2 & Gen 1

Languages

Java, SQL, PL/SQL, Python, HiveQL, Scala, MS SQL, C, C++, HTML, Spark, T-SQL

Shell Scripting

Bash, PowerShell, Azure CLI

Operating Systems

Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS

Automation Tools

Ant, Maven, Terraform, Teradata, Redshift, Snowflake

Version Control & CI/CD

GIT, GitHub, Jenkins, Bitbucket, GitLab, Azure DevOps

Tools

Eclipse, Visual Studio, Notepad ++, Share Point

IDE & Build Tools, Design

Databases

MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse, MS Excel, MS Access,

Oracle 11g/12c, Cosmos DB, PostgreSQL

Visualization Tools

Power BI, Tableau, SSRS, MS Excel

EDUCATION:

Bachelors In Electronics and Communication Engineering - Jawaharlal Nehru Technological University, India 2013.

PROFESSIONAL EXPERIENCE:

Client: Cisco, IL/Remote Aug 2023 - Present

Role: Azure Data Engineer

Responsibilities:

•Designed and implemented scalable data ingestion pipelines using Azure Data Factory to integrate SQL, Oracle, CSV, and REST API data into Azure Blob Storage, improving data accessibility and analytics.

•Designed and implemented data models for storing and organizing raw, transformed, and aggregated data in the Azure Data Lake, optimizing for cost and performance.

• Managed Control-M job scheduling, optimizing workflow dependencies and troubleshooting failures.

•Developed optimized data processing workflows in Azure Databricks using Spark, accelerating real-time analytics for operational reporting.

•Developed and maintained large-scale data solutions using Azure Data Lake for storing and processing structured and unstructured data.

•Established data quality checks and cleansing protocols using Azure Data Factory and Databricks, ensuring high data accuracy.

•Created end-to-end data pipelines from data ingestion to transformation using Azure Data Factory, moving data from on-premises databases and third-party systems to Azure Data Lake and Azure SQL Database.

•Developed end-to-end ETL pipelines using copy activity for data movement and lookup activity for validation, leveraging linked services to connect on-premise and cloud data sources.

•Implemented optimized query techniques and indexing strategies for improved data fetching, utilizing SQL and ADLS Gen2 for scalable storage.

•Integrated Snowflake with Azure cloud services, enabling secure data warehousing for strategic marketing analysis.

•Architected real-time data processing solutions using Kafka and Spark Streaming, supporting high-volume streaming analytics.

•Implemented machine learning models in Databricks to analyze network traffic patterns and detect anomalies, improving security monitoring.

•Integrated PySpark with Azure Data Factory and Azure Blob Storage, ensuring seamless data ingestion and processing.

•Designed Databricks Lakehouse solutions for centralized network traffic, customer usage, and IoT device logs.

•Deployed Delta Lake for real-time event streams, ensuring fault tolerance and ACID compliance.

•Applied data warehousing techniques such as SCD handling, surrogate key assignment, and CDC for Snowflake modeling.

•Optimized PySpark jobs using partitioning, caching, and broadcast joins, reducing processing times by 25%.

•Developed Azure Functions to extract, transform, and load data from databases, APIs, and file systems using ADLS Gen2.

•Created complex SQL queries and data models in Azure Synapse Analytics, enhancing big data processing.

•Built ETL workflows using SSIS for integrating multiple data sources into SQL Server databases.

•Developed ETL transformations and validations using Spark SQL and Azure Databricks, leveraging lookup activity for data accuracy.

•Automated Databricks cluster deployment using Terraform for consistent, repeatable deployments.

•Built and optimized data models using Hive, HBase, and Snowflake to support efficient analytics and reporting.

•Integrated GitHub with Azure DevOps and Pipelines to enhance collaboration and automate deployment workflows.

•Collaborated with Azure DevOps teams to improve CI/CD pipeline efficiency.

•Developed interactive Power BI dashboards, providing stakeholders with actionable data insights.

•Leveraged ML algorithms to generate predictive insights from operational data, enhancing decision-making capabilities.

•Partnered with data analysts and business teams to develop customized reports driving strategic business decision.

Environment: Azure Data bricks, Azure Data Lake Data Factory, Logic Apps, Functional App, Snowflake, MS SQL, Oracle, Spark, Hive, SQL, Python, Scala, PySpark, Shell scripting, GIT, JIRA, Jenkins, Kafka, ADF Pipeline, Control-M, Apache Airflow, AWS, Snowflake, Redshift, Hadoop, Spark, Kafka, Tableau, Power BI, SQL, Python, Git, Docker, Kubernetes

Client: CIBC Bank, IL Aug 2021 - Jul 2023

Role: Azure Data Engineer Responsibilities:

•Developed and documented use cases performed requirement analysis, and prepared comprehensive project documentation.

•Designed and developed data pipelines for ingesting and transforming massive datasets from multiple sources into Azure Data Lake using Azure Data Factory.

•Designed and maintained web applications as per the Project Lead’s specifications, ensuring seamless integration with backend services.

•Applied Apache Spark with Databricks for big data analytics and optimized data processing, allowing for faster insights and predictive modeling.

•Designed and developed RESTful Web Services APIs to support integration with the company’s website.

•Developed Azure Synapse Analytics-based data warehousing solutions, optimizing query performance and storage through partitioning, indexing, and materialized views.

•Integrated MLflow with Azure Databricks, enabling experiment tracking, model lifecycle management, and version control for predictive analytics.

• Automated data transformations and optimized data storage using Azure Databricks and SQL queries to create data-ready formats for downstream analytics and reporting.

•Conducted performance tuning of Spark jobs to optimize resource usage and reduce cost in Azure Databricks.

•Collaborated with data scientists and business intelligence teams to create automated data pipelines for machine learning models using Databricks Notebooks and MLflow.

•Designed and implemented scalable ETL pipelines using Azure Data Factory (ADF), Azure Databricks, and PySpark, improving data ingestion and transformation efficiency.

•Automated machine learning model deployments using MLflow Models and Azure DevOps, streamlining transition from development to production.

•Built real-time data pipelines using Azure Event Hubs, Apache Kafka, and Spark Streaming, enabling instant financial transaction analysis.

•Migrated on-prem SQL workloads to Azure SQL Database and Cosmos DB, ensuring 99.9% availability and performance optimization.

•Optimized Spark jobs on Azure Databricks, reducing execution time by 30% through caching, optimized shuffle operations, and broadcast joins.

•Developed Delta Lake architecture (Bronze, Silver, Gold layers) to enhance data integrity, ACID compliance, and real-time updates.

•Designed, developed, and maintained ETL pipelines using PySpark and AWS Glue to process and load metadata into Enterprise Data Catalog (EDC).

•Managed and optimized email marketing campaigns using HTML and CSS, ensuring high engagement rates.

•Leveraged extensive experience with AWS services (EC2, VPC, S3, IAM, RDS, ELB, Route53, Cloud Watch, CloudFormation, Elastic Beanstalk, DynamoDB, etc.) for scalable cloud solutions

•Developed CI/CD pipelines with Azure DevOps, automating ETL deployments and improving operational efficiency.

•Built Power BI dashboards for real-time financial insights, integrating data from Synapse, ADLS, and Azure SQL Database.

•Applied Data Lakehouse architecture principles to unify structured and semi-structured data for customer behavior analytics.

•Supported real-time data processing using Azure Stream Analytics and Event Hubs for processing live event data streams.

•Optimized SQL queries for Azure Synapse Analytics, improving the performance of reporting queries and reducing processing time by 80%.

•Conducted schema evolution in Snowflake, implementing time travel and SCD Type 2 for historical financial reporting.

•Improved data security and governance by implementing RBAC, Managed Identities, and encryption policies in ADLS Gen2 and Synapse.

•Integrated Azure Functions with Event Grid for serverless data processing, improving scalability and cost efficiency.

•Designed and maintained Kubernetes clusters on Azure Kubernetes Service (AKS) to deploy containerized data engineering workloads.

Environment: Azure Data Factory (ADF), Azure Synapse Analytics, Azure Databricks, Azure SQL Database, Cosmos DB, ADLS Gen2, Power BI, PySpark, MLflow, Apache Kafka, Azure Event Hubs, Delta Lake, Azure DevOps, Azure Functions, Kubernetes (AKS), Snowflake, Terraform, PowerShell, CI/CD Pipelines

Client: Delta, GA Mar 2020 – Jul 2021

Role: Data Engineer

Responsibilities:

•Designed and implemented an Enterprise Data Lake to support analytics, real-time processing, and reporting for high-volume, rapidly changing data, improving business decision-making.

•Developed a reusable ETL framework to automate migration from RDBMS to the Data Lake, leveraging Spark and Hive, reducing manual effort by 30%.

•Built and managed data pipelines utilizing Control-M, Airflow, and Python-based ETL scripts.

•Automated ETL processes to integrate data into OLAP systems, ensuring timely and accurate financial data for business operations.

•Built data pipelines and complex transformations using Azure Data Factory and PySpark with Azure Databricks.

•Facilitated secure data sharing between airlines, airport authorities, and travel agencies using Delta Sharing in Databricks, enhancing collaboration and efficiency.

•Developed Databricks workflows to analyze historical and real-time booking data, optimizing dynamic pricing models for over 200 airline routes and increasing revenue by 15%.

•Built Kubernetes-based containerized workflows to scale ETL pipelines, ensuring high availability and efficient resource utilization for data processing workloads.

•Developed frameworks to ingest and process aviation weather data, historical route efficiency, and passenger preferences for optimized route planning.

•Worked with Azure BLOB and Data Lake Storage, loading data into Azure SQL Synapse Analytics (DW).

•Created PySpark-based models to analyze ticketing and pricing data, enhancing revenue management strategies through demand trend analysis.

•Integrated multiple data sources using PySpark to create a unified view of passenger profiles, improving personalized customer experiences.

•Worked on Apache Spark (SQL & Streaming) for intraday and real-time data processing.

•Implemented Kerberos authentication for secure network communication on clusters, ensuring compliance with enterprise security standards.

•Optimized Spark performance using partitioning, bucketing, broadcast joins, and distributed caching.

•Developed Tableau dashboards to visualize complex airline datasets, enabling data-driven decision-making for operational efficiency.

•Implemented Agile methodologies by participating in sprint planning, daily stand-ups, and retrospectives, ensuring continuous delivery of high-quality data solutions.

•Developed and deployed tabular models on Azure Analysis Services to meet business reporting requirements.

•Automated data pipeline deployment using Git-integrated CI/CD pipelines, improving testing and deployment efficiency.

Environment: Azure, Azure Data Factory, Databricks, PySpark, Python, Apache Spark, HBase, HIVE, SQOOP, Snowflake, SSIS, Tableau Control-M, Apache Airflow, AWS, Snowflake, Redshift, Hadoop, Spark, Kafka, Tableau, Power BI, SQL, Python, Git, Docker, Kubernetes

Client: Wells Fargo, TX May 2018- Feb 2020

Role: Big Data Developer

Responsibilities:

•Participated in gathering business requirements and defining the scope of Big Data Analytics to enhance manufacturing and financial data processing.

•Designed and developed data lake applications to transform and prepare structured/unstructured data for business analytics and reporting.

•Developed and optimized SSIS packages to efficiently process large-scale financial data, including transaction records, customer databases, and financial statements.

•Worked extensively with Hadoop architecture and ecosystem components such as HDFS, MapReduce, NameNode, DataNode, and YARN for large-scale data storage and processing.

•Built data pipelines using Flume, Sqoop, and Pig to ingest and process customer behavioral data and purchase histories into HDFS for analytics.

•Developed Spark SQL queries to load, transform, and structure JSON data, creating schema RDDs and Hive tables for efficient querying.

•Utilized Hive transformations, event joins, and pre-aggregations before storing structured data in HDFS, ensuring optimal performance.

•Engineered and optimized stored procedures and T-SQL functions for high-performance data processing in financial and patient data management systems.

•Implemented Apache Oozie workflows to automate data pipeline orchestration and task scheduling.

•Designed and developed MapReduce programs to process and parse large volumes of log files, structuring data for efficient querying and analysis.

•Automated end-to-end data management and cluster synchronization, optimizing data consistency across distributed environments.

•Established Fair Schedulers on Hadoop clusters, ensuring efficient resource allocation for MapReduce jobs.

•Integrated CI/CD pipelines using Jenkins, Maven, Nexus, and GitHub, streamlining deployment and version control for big data applications.

•Followed Agile methodologies, participating in sprint planning, daily stand-ups, and retrospectives, ensuring continuous delivery of scalable data solutions.

Environment: Cloudera CDH three-fourths, Hadoop, HDFS, MapReduce, Hive, Oozie, Pig, Shell Scripting, MySQL, Control-M, Apache Airflow, AWS, Azure, Snowflake, Redshift, Hadoop, Spark, Kafka, Tableau, Power BI, SQL, Python, Git, Docker, Kubernetes.

Client: ADP, India May 2013 - Dec 2016

Role: SQL / ETL Developer

Responsibilities:

•Developed complex stored procedures, triggers, and indexed views to optimize SQL Server performance and enhance query efficiency.

•Designed and implemented ETL data flows using SSIS, extracting data from heterogeneous sources (SQL Server, Access, Excel) and transforming it into data warehouse integration.

•Built OLAP cubes to support complex analytical queries, optimizing data retrieval and enhancing business intelligence reporting.

•Developed SSIS ETL processes for automated data integration, ensuring accurate and timely data flow into the data warehouse and SSAS cubes.

•Designed and optimized dimensional data models for data marts, identifying facts and dimensions while implementing Slowly Changing Dimensions (SCDs).

•Implemented T-SQL scripts for data validation, cleansing, and reconciliation, ensuring data quality and integrity.

•Automated report generation and cube refresh packages using SSIS jobs, improving reporting efficiency and reducing manual effort.

•Leveraged ML-based anomaly detection in ETL workflows to identify data inconsistencies and improve error handling in data pipelines.

•Integrated SQL Server Reporting Services (SSRS) to develop and deploy interactive and automated reports, supporting business intelligence initiatives.

•Followed Agile methodology, participating in sprint planning, stand-ups, and retrospectives to enhance collaboration and timely delivery of data solutions.

•Optimized OLAP cube performance using aggregation strategies, partitioning, and indexing, improving query execution times and overall reporting efficiency.

•Developed Kubernetes-based containerized deployments for ETL processes, ensuring scalability and streamlined data processing.

•Integrated Git and Team Foundation Server (TFS) for version control, ensuring smooth collaboration and CI/CD pipeline efficiency.

Environment: MS SQL Server 2016, Visual Studio 2017/2019, SSIS, SSRS, T- SQL, Share point, MS Access, Team Foundation server, Git.

Contact this candidate