Sr Azure Data Engineer

Location:

Denton, TX, 76201

Posted:

May 21, 2025

Contact this candidate

Resume:

KALADHAR G

Sr. Azure Data Engineer

Email: ***********@*******.***

Phone: +1-469-***-****

CAREER OBJECTIVE

Experienced Data Engineer with 8 years of expertise in designing, building, and optimizing scalable data pipelines and architectures. Skilled in handling large datasets, real-time data processing, and cloud-based solutions across AWS, Azure, and GCP. Proficient in SQL, Python, Spark, and Kafka, with a strong focus on ETL development, data modeling, and workflow automation. Seeking a challenging role where I can leverage my technical skills to improve data infrastructure, enhance processing efficiency, and support data-driven decision-making.

PROFESSIONAL SUMMARY

Designed and developed end-to-end ETL pipelines using Azure Data Factory, migrating data from on-premise sources to Azure SQL, Data Lake, and Blob Storage to enable scalable storage and analytics for large-scale enterprise applications.

Developed PySpark-based ETL solutions for processing structured and semi-structured data, optimizing workflows to enhance efficiency, reduce latency, and improve data transformation speeds for analytics and reporting.

Created complex Hive queries to extract, transform, and load data into HDFS and Azure Data Lake, implementing partitioning and indexing strategies to significantly boost data retrieval speed and query performance.

Built custom Hadoop applications using and Spark frameworks to optimize big data analytics in Azure, ensuring efficient processing of large-scale datasets for business intelligence and predictive analytics.

Integrated Azure Data Lake with Azure Databricks for real-time data processing and analytics, enabling advanced machine learning models with seamless access to structured and unstructured datasets.

Leveraged Azure Analysis Services to design optimized tabular models for business intelligence, improving data modeling, reporting capabilities, and decision-making processes.

Developed real-time data ingestion pipelines for high-velocity streaming data using Kafka and Spark Streaming, ensuring low-latency processing and near-instant data availability.

Implemented Lambda architectures combining Azure Data Lake, Azure SQL, and Azure Machine Learning, integrating batch and real-time processing to enable scalable and flexible data analytics solutions.

Utilized Azure Data Lake Analytics and HDInsight/Databricks for ad hoc analysis and reporting, providing fast, interactive querying of massive datasets for real-time business insights.

Built Spark applications to aggregate and analyze customer behavior data, applying machine learning models to predict customer trends, preferences, and engagement patterns.

Managed Azure Data Lake Storage (ADLS) and Databricks Delta Lake, optimizing data processing workflows for scalability, efficiency, and seamless transformation of structured and semi-structured data.

Automated HDFS and Hive data loading using Oozie workflows, reducing manual intervention and enhancing the reliability of data pipelines in production environments.

Designed Azure Data Factory pipelines to facilitate Data Science and AI projects, enabling seamless data movement and transformation for machine learning training and inferencing.

Implemented Role-Based Access Control (RBAC) and security measures for Azure resources, ensuring data governance compliance and preventing unauthorized access to sensitive data.

Designed and deployed AWS-based big data solutions for real-time analytics, constructing scalable data pipelines to process and analyze petabyte-scale datasets efficiently.

Migrated on-premise ETL workloads to AWS, utilizing AWS Glue, Lambda, and S3, automating data transformations and enhancing pipeline performance for cloud-native analytics.

Built data lakes on AWS S3 to support large-scale storage and analytics, enabling efficient querying using AWS Athena and Redshift Spectrum for cost-effective insights.

Designed Redshift-based data warehouses, optimizing query execution with indexing, distribution keys, and column encoding to significantly improve analytics performance.

Developed real-time streaming pipelines using AWS Kinesis, Apache Spark, and Kafka, enabling low-latency data processing for business-critical insights and decisions.

Processed large-scale datasets using AWS EMR, Spark, and Hadoop, optimizing cluster configurations and auto-scaling to enhance processing speeds and resource efficiency.

Built serverless ETL workflows using AWS Lambda and Step Functions, reducing infrastructure costs while maintaining high availability and seamless orchestration.

Implemented fine-grained security frameworks using AWS IAM roles and policies, ensuring access control and compliance with security best practices across cloud environments.

Created custom UDFs in PySpark and Scala for advanced data transformations, building reusable functions that enhanced ETL pipelines with complex business logic.

Developed data dashboards and reports using Power BI, Tableau, and Kibana, providing real-time visibility into key metrics, operational performance, and business trends.

Integrated AWS SageMaker with data pipelines for machine learning model training, automating model deployment and inferencing to support AI-driven applications.

Designed ETL frameworks supporting multi-stage ML workflows, ensuring seamless data preparation, feature engineering, model training, and evaluation.

Designed and maintained MySQL and MongoDB databases for customer review data storage, optimizing queries and indexing to enhance database performance and accessibility.

Developed SQL-based ETL pipelines to filter, aggregate, and join datasets, ensuring efficient data transformation for business intelligence and reporting.

Built data ingestion workflows using Python and Hadoop on AWS EC2, automating data collection and preprocessing for high-volume data sources.

Conducted exploratory data analysis (EDA) to identify patterns and trends, using statistical techniques and visualization tools to uncover insights from data.

Built machine learning models using Logistic Regression, SVMs, Random Forest, and Naive Bayes, applying model evaluation techniques to ensure accuracy and performance.

Built and maintained CI/CD pipelines for data workflows using Jenkins, Airflow, and Git, ensuring seamless deployment, automation, and version control.

Developed RESTful APIs using Flask and AWS Lambda, enabling secure and efficient third-party integrations for real-time data exchange between applications.

Worked in Agile environments, collaborating with cross-functional teams, driving data-driven decision-making through real-time analytics and continuous delivery.

Prepared project status reports for stakeholders and management, communicating key milestones, risks, and progress updates to ensure successful project execution.

Leveraged Amazon Q Business to streamline access to AWS documentation and service-related insights during the development lifecycle, reducing dependency on external documentation and accelerating troubleshooting across services like Kinesis, Glue, and S3.

Utilized Amazon Bedrock’s orchestration capabilities to simplify integration testing workflows involving multiple AWS services, particularly when validating data flow across ingestion, transformation, and storage layers.

Designed and developed advanced MLOps pipelines using Azure ML, Azure DevOps, and Databricks, enabling continuous integration and deployment (CI/CD) for machine learning models across development, staging, and production environments.

Designed scalable architecture for vLLM-based services that support data annotation, entity recognition, and summarization at runtime, embedding these services into existing Spark-based analytics workflows for data enrichment.

PROFESSIONAL EXPERIENCE

Navy Federal Credit Union

Sr. Azure Data Engineer September 2023 – Present.

Designed and developed Hadoop-based Big Data analytics solutions, engaging clients in technical discussions to ensure the system met their requirements and provided robust data processing capabilities.

Worked extensively with multiple Azure platforms, including Azure Data Factory, Data Lake, SQL Database, and HDInsight, leveraging them for seamless data integration, storage, and processing in cloud environments.

Developed and implemented custom Hadoop applications in the Azure environment, enabling efficient data processing and integration with Azure's advanced data analytics tools.

Created Azure Data Factory Pipelines for loading data from on-premises systems to Azure SQL Server and Data Lake, automating the process of data migration and ensuring data consistency across platforms.

Developed complex Hive queries to extract data from various sources within the Data Lake, ensuring efficient data storage and retrieval in HDFS for subsequent analysis.

Used Azure Data Lake Analytics, HDInsight, and Databricks to conduct ad-hoc analysis, providing timely insights and generating actionable reports for business decision-making.

Built custom ETL solutions, implemented batch processing, and developed real-time data ingestion pipelines using PySpark and shell scripting to efficiently move and transform data within the Hadoop ecosystem.

Implemented large-scale Lambda architectures using Azure services, enabling scalable, fault-tolerant data processing pipelines that supported real-time analytics and machine learning applications.

Worked with multiple Azure services for data ingestion, including Data Lake, SQL Server, and Data Warehouse, processing data in Azure Databricks for further analytics and transformation.

Contributed to all aspects of data management, including collection, cleaning, model development, and validation, applying best practices to ensure data integrity and accuracy across various stages.

Gained hands-on experience in managing Azure Data Lake Storage, Databricks Delta Lake, and integrated these with other Azure services to optimize data processing and storage solutions.

Developed and managed data pipelines using Azure Data Factory and Databricks to automate the flow of data from various sources to Azure Data Lake, ensuring smooth data processing and accessibility.

Utilized Sqoop to import and export enterprise data into HDFS, transforming the data using Hive, MapReduce, and loading it into HBase tables for further processing.

Responsible for cluster sizing, monitoring, and troubleshooting Hadoop clusters, ensuring optimal performance and minimizing downtime during data processing tasks.

Used Zeppelin, Jupyter notebooks, and Spark-Shell to develop and test Spark jobs before scheduling customized jobs for large-scale data processing tasks.

Worked with Azure Blob and Data Lake storage to manage and load data into Azure SQL Synapse Analytics, supporting complex data processing and analysis.

Applied Hive tuning techniques like partitioning, bucketing, and memory optimization to improve query performance and ensure efficient data retrieval from large datasets.

Developed Spark applications using PySpark and Spark-SQL to extract, transform, and analyze data from various file formats, uncovering insights into customer usage patterns and trends.

Designed and built modern data solutions using Azure PaaS services, enabling effective visualization and analysis of data for business intelligence and decision-making.

Integrated Azure Data Lake Storage with Databricks to streamline the process of passing parameters dynamically between Azure Data Factory and Databricks, improving workflow automation.

Designed data lake storage solutions tailored for data science projects using Azure Data Factory Pipelines, enabling scalable data processing and transformation capabilities.

Gained expertise in integrating data storage options with Spark, particularly leveraging Azure Data Lake Storage and Blob storage for efficient data processing.

Hands-on experience in creating and managing Spark clusters in both HDInsight and Azure Databricks environments, supporting batch and continuous streaming data processing needs.

Automated data loading processes using Oozie workflows, integrating HDFS and Hive to streamline the ETL pipeline and improve operational efficiency.

Created NoSQL tables using HBase to manage semi-structured data and handled large volumes of data migration and transformation for processing in Hadoop ecosystems.

Provisioned and managed Databricks clusters for both batch and continuous data processing, installing necessary libraries and ensuring performance optimization.

Worked on creating tabular models in Azure Analysis Services to support business reporting and enable advanced data analytics for decision-making.

Developed and deployed Business Intelligence solutions using SSIS, SSRS, and SSAS, ensuring timely data movement and visualization for reporting purposes.

Implemented complex MapReduce tasks in Scala for data cleansing and analysis in Impala, ensuring data quality and performance for analytics.

Retrieved live streaming data using Spark Streaming and Kinesis, enabling real-time data processing and analytics to support business operations.

Imported and exported data between HDFS and relational databases using Sqoop, transforming data into Hive tables and partitioning it for optimal storage and retrieval.

Integrated Large Language Models (LLMs) like GPT-4 via Azure OpenAI Service to automate business workflows, including intelligent document summarization, chatbot integration, and customer support automation.

Implemented real-time GenAI APIs using Flask, Azure Functions, and Azure API Management, enabling external applications to interact with fine-tuned GPT models deployed on Azure infrastructure

Pepsico

Sr. AWS Data Engineer January 2021 – July 2023.

Led initiatives that utilized Spark, SQL, and AWS cloud services to create and implement effective data solutions tailored to business needs, from ingestion to transformation, ensuring efficient processing at scale. These projects were designed to meet business requirements for real-time and batch processing.

Designed and managed AWS infrastructure in a managed service environment, ensuring its scalability and reliability for large data workloads, while maintaining performance and security standards to support growing business demands.

Developed data governance frameworks that brought structure and consistency to previously unregulated data environments, ensuring data integrity and compliance across various departments and aligning data management practices with industry standards.

Managed the end-to-end process of data ingestion, transformation, and processing using AWS tools like Kinesis, SQL, Glue, and Spark, facilitating seamless data flow and enabling real-time analytics for business decision-making.

Participated in requirement gathering sessions to understand key business needs, working closely with system analysts to analyze upstream data formats and patterns, and using this input to develop robust data pipelines and solutions that aligned with organizational goals.

Led the migration of data from traditional relational databases to NoSQL systems, streamlining access and improving processing efficiency across systems, and creating a unified view of data that enhanced data accessibility and performance.

Designed and implemented full-scale data solutions for storage, integration, processing, and visualization on AWS, creating a cohesive analytics ecosystem that empowered data-driven decision-making and improved overall business insights.

Oversaw the migration of datasets and ETL processes from on-premises infrastructure to AWS using Python, improving data processing speed, scalability, and the overall efficiency of data pipelines and workflow automation.

Developed PySpark-based data pipelines leveraging Spark DataFrame operations, processing vast amounts of data on AWS EMR and storing results in S3, enabling highly scalable and efficient processing for large data volumes.

Set up an Enterprise Data Lake on AWS to centralize data from multiple sources, supporting large-scale data processing, analytics, and reporting, which streamlined data access and facilitated real-time analytics for improved decision-making.

Partnered with clients and solution architects to ensure data quality through consistent data cleansing, transformation, and integrity maintenance in relational databases, ensuring the provision of high-quality data for business intelligence applications.

Configured and managed self-hosted integration runtimes on virtual machines, enabling secure access to private networks and improving data security and workflow flexibility, ensuring seamless data integration while maintaining privacy standards.

Created comprehensive data visualizations and dashboards using Power BI, delivering clear, actionable insights from complex datasets that simplified data analysis and helped decision-makers identify emerging trends.

Built and managed Apache Airflow workflows on AWS, automating multi-stage machine learning processes, integrating Amazon SageMaker for model training and evaluation, and reducing manual intervention in data processing pipelines.

Developed real-time streaming data pipelines using Apache Spark and Python, processing live data streams to deliver immediate insights, enabling quick decision-making and ensuring business agility.

Designed and implemented a security framework for AWS S3, leveraging AWS Lambda to enforce fine-grained access control, ensuring the security and compliance of sensitive data stored in the cloud.

Utilized AWS EMR to move and process large datasets across AWS services like S3 and DynamoDB, enabling seamless data processing at scale and supporting the efficient storage and analysis of big data workloads.

Built AWS Lambda functions and Step Functions to automate complex data pipeline workflows, improving efficiency and reducing manual oversight, ensuring seamless operation of data processing tasks.

Developed PL/SQL packages, database triggers, and comprehensive user manuals, providing detailed documentation for new programs and ensuring the smooth implementation and understanding of new data processes.

Created custom User-Defined Functions (UDFs) in Scala and PySpark to meet specific business requirements, enabling tailored data transformations that enhanced the flexibility and scalability of the data processing pipeline.

Collaborated in an Agile environment, managing tasks and user stories using Rally, ensuring project deadlines and priorities were met and promoting a dynamic and responsive development process that adapted to evolving business needs.

Utilized Informatica tools such as PowerCenter, MDM, and Workflow Monitor for comprehensive data management tasks, optimizing data integration processes and ensuring seamless data flow and consistency across systems.

Built Kibana dashboards and integrated various source and target systems into Elasticsearch, providing real-time tracking of transactions and improving visibility across departments to support timely decision-making based on live data.

Assisted enterprise data teams in maintaining Hadoop environments, including applying updates, patches, and resolving issues post-upgrade, ensuring continuous operation and optimal performance of Hadoop clusters.

Implemented test scripts for test-driven development and continuous integration, enhancing the quality and reliability of data processing workflows, minimizing errors, and improving overall system stability.

Leveraged Spark for parallel data processing, optimizing performance and enabling faster insights from large datasets, reducing processing time, and significantly improving operational efficiency.

Extracted data from web sources using Python, enabling real-time analysis and reporting from external data sources, adding valuable dimensions to the data analysis process and enhancing insights for decision-making.

Conducted training sessions on Big Data technologies, empowering teams with the knowledge to leverage new tools and techniques for data management, improving their ability to handle large-scale data challenges and enhancing technical proficiency.

Designed and executed personalized marketing campaigns using UNICA, creating customized customer offers that boosted engagement and sales, improving customer segmentation and enhancing marketing effectiveness.

Fidelity Investments

Data Engineer May 2018 – December 2020.

Followed Agile methodologies throughout the software development life cycle (SDLC), ensuring effective collaboration at every project stage, including requirements gathering, design analysis, planning, development, and testing of applications.

Used PySpark for implementing data transformations, deploying them in Azure HDInsight for processes like data ingestion, hygiene, and identity resolution.

Built a file-based data lake solution using Azure Blob Storage, Azure Data Factory, and Azure HDInsight, with HBase used for efficient data storage and retrieval.

Applied business rules for deduplication of sales and marketing contacts using Spark transformations in both Spark Scala and PySpark.

Developed graph database nodes and relationships with Cypher language to efficiently model connections and data flows.

Created a Spark job that used Spark DataFrames to transform complex JSON documents into a flat file format, simplifying further analysis and processing.

Built microservices with AWS Lambda to handle API calls to third-party vendors like Melissa and StrikeIron for data verification and enrichment.

Developed batch processing and scheduling pipelines in Azure Data Factory to automate data workflows, improving efficiency in data management.

Designed and implemented machine learning pipelines in Azure ML Studio, using Python modules to run Naive Bayes and XGBoost classifiers for persona mapping.

Managed the ingestion of data from various sources, including Salesforce, SAP, SQL Server, and Teradata, into Azure Data Lake via Azure Data Factory, ensuring smooth integration.

Developed a deduplication module for sales and marketing contact data, removing redundant records to enhance data accuracy in a confidential project.

Executed complex Hive queries on Parquet tables stored in Hive to generate insights and conduct data analysis.

Created a REST API with the Flask framework in Python, allowing the front-end UI to interact with and consume backend data efficiently.

Conducted thorough testing of the REST API using Python scripts and M to ensure the system was reliable and functional.

Wrote automation scripts in Azure Automation Runbooks to efficiently spin up, scale, and shut down HDInsight clusters, optimizing cluster management.

Stored the results of REST API calls in Redis to enable fast retrieval for repeated queries, improving overall performance.

Cigna Healthcare

Data Analyst May 2017 – April 2018.

Worked closely with the engineering team to design and manage MySQL databases for storing and retrieving customer review data.

Used SQL to develop ETL pipelines that filter, aggregate, and join multiple tables to extract relevant data from MySQL databases.

Imported, explored, cleaned, and combined data from MySQL and MongoDB databases hosted on AWS EC2, using Python and Hadoop for initial analysis and pattern discovery.

Provided business intelligence analysis to help the marketing team assess the project's impact on key performance metrics.

Queried data, performed statistical analysis, and generated reports and dashboards using R.

Regularly prepared and submitted project progress and status reports to the management team.

Designed engaging visualizations and dashboards in Tableau to present actionable insights.

Built feature engineering pipelines in Python for data preprocessing, including normalization, scaling, and tokenization of categorical variables. Also applied Principal Component Analysis (PCA) to reduce dimensionality.

Assisted in developing machine learning models using Python’s scikit-learn library, including Logistic Regression, Support Vector Machines (SVMs), Random Forest, and Naive Bayes models.

TECHNICAL SKILLS

Azure: Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure SQL Database, Azure SQL Data Warehouse (Synapse Analytics), Azure Analysis Services, Azure ML

AWS: AWS EMR, AWS Glue, AWS Lambda, AWS Redshift, AWS RDS, AWS S3, AWS DynamoDB, Amazon SageMaker, AWS Step Functions

Big Data & Distributed Computing: Hadoop, HDFS, MapReduce, Apache Spark, HDInsight, Kafka, Oozie, NiFi

Data Integration & Pipelines: Apache Airflow, Apache NiFi, Informatica PowerCenter, UNICA, Jenkins, SSIS

ETL & Data Processing: Azure Data Factory, AWS Glue, Sqoop, Spark, Hive, HBase, Pig, MapReduce, Spark-SQL

Batch & Real-time Processing: Spark Streaming, AWS Kinesis, Azure Event Hub

Programming & Scripting: Python, SQL, Scala, PySpark, Shell Scripting, Boto3

Relational Databases: MySQL, PostgreSQL, Azure SQL, AWS RDS, Redshift

NoSQL Databases: MongoDB, HBase, DynamoDB

Cloud Storage: Azure Data Lake, Azure Blob Storage, AWS S3

Data Lake Technologies: Databricks Delta Lake, Azure Synapse Analytics, Enterprise Data Lake

ML & Statistical Analysis: Scikit-learn, Amazon SageMaker, Azure ML

Feature Engineering: PCA, Normalization, Tokenization

ML Models: Logistic Regression, SVMs, Random Forest, Naive Bayes

BI & Data Visualization: Power BI, Tableau, Kibana, Matplotlib, Seaborn

Workflow & Automation: Apache Airflow, Oozie, Step Functions, Jenkins, Git, CI/CD

Security & Access Control: AWS IAM, AWS KMS, VPC, Fine-grained access control in S3

Development & Deployment: Docker, Git, Jenkins, Apache Zeppelin, Jupyter Notebooks

Contact this candidate