Data Engineer Big

Location:

Hyderabad, Telangana, India

Posted:

October 15, 2025

Contact this candidate

Resume:

Sai Ram

Sr.Data Engineer

612-***-****

************.***@*****.***

SUMMARY:

Around 6+ years of expertise as a Sr. Azure Data Engineer in the design, development, maintenance, integration, and analysis of Big Data systems. Extensive expertise in data extraction, processing, storage, and analysis using Cloud Services and Big Data technologies such as MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, and Kafka. Experience working with Cloud technologies like Microsoft Azure, Google Cloud Platform (GCP) and Amazon Web Services (AWS). Worked on a variety of projects including data lakes, migration, support, and application development.

Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica.

Experience in developing very complex mappings, reusable transformations, sessions, and workflows using Informatica ETL tool to extract data from various sources and load into targets.

Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.

Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns.

Used various file formats like Parquet, Sequence, Json, and text for loading data, parsing, gathering, and performing transformations.

Good experience in Hortonworks and Cloudera for Apache Hadoop distributions.

Designed and created Hive external tables using shared meta-store with Static & Dynamic partitioning, bucketing, and indexing.

Configure and maintain fully automated pipelines with custom and integrated monitoring methods using Jenkins and Docker.

Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive.

Hands-on experience working with data analytics and Big Data Services in AWS cloud like EMR, Redshift, Athena, DynamoDB, Kinesis and Glue. Parallelly, orchestrated data pipelines in AWS to optimize data processing.

Installed and configured Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.

Proficient in developing Spark applications using Spark - SQL in Azure Data bricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Experience managing Big Data platforms deployed in Azure Cloud.

Capable working with Snowflake Multi cluster and virtual warehouses in Snowflake.

Developed and deployed Apache Spark applications using Spark APIs like RDD, Spark Data Frame API, MLlib, Spark Streaming and Spark SQL.

Expertise in designing and deploying SSIS packages for data extraction, transformation, and loading into Azure SQL Database and Data Lake Storage.

Experienced in Optimizing the PySpark jobs to run on Kubernetes Cluster for faster data processing.

Experience in writing PySpark jobs in AWS Glue, to merge data from various tables and populate the AWS Glue catalog with metadata table definitions.

Experienced in working with structured data using HiveQL, and optimizing Hive queries.

Writing complex SQL queries using joins, group by, nested queries.

Experience in HBase to load data using connectors and write queries using NOSQL.

TECHNICAL SKILLS:

Operating Systems

Windows, Linux, Unix

Hadoop Ecosystem

Hive, Spark, Oozie, Sqoop, Flume, Hadoop, MapReduce, Impala, HDFS, Pig, YARN

Languages

Python, Scala, Bash, Java, PL/SQL

R-Databases

Oracle, SQL Server, DB2, MySQL

Other Tools

Jira, Informatica, ETL

Build Tools

Maven, Jenkins, Docker, Ansible, Ant

Web Technologies

Django, HTML5, CSS3, Bootstrap, JavaScript, JSON, XML, AJAX

Methodologies

AGILE, SCRUM, Waterfall

NoSQL Databases

HBase, MongoDB, Cassandra, PostgreSQL

Version Control

Git, GITHUB, CSV

Cloud Environments

Microsoft Azure - Azure Databricks, Azure Data Factory, Azure Logic Apps, Azure Data Lake Storage, Azure Active Directory, Amazon Web Services (AWS) - IAM, S3, ELB, VPC, RDS, DNS, Ec2, Route 53, DynamoDB, EMR, Lambda, SNS, Glue, Athena Redshift) GCP (Google Cloud Platform) - Dataflow, Big Query, Dataproc, Cloud Pub/Sub, Cloud Storage.

PROFESSIONAL EXPERIENCE:

Boston Children's Hospital, Minneapolis,MN Feb 2023- Present

Sr. Azure Data Engineer

Responsibilities:

Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuous data load.

Created Notebooks in Azure Data Bricks and integrated it with ADF to automate the same.

Designed various azure data factory pipelines to pull data various data sources and load the data into Azure SQL database.

Develop batch processing solutions by using Data Factory and Azure Data bricks

Designed and implemented CI/CD pipelines using GitHub Actions to automate build, test, and deployment processes.

Implemented ETL/ELT pipelines in Snowflake, leveraging tools like Snow pipe and Stored Procedures to automate data ingestion, transformation, and loading processes, reducing manual intervention and improving data accuracy.

Developed and maintained complex ETL jobs using IBM DataStage for extracting, transforming, and loading data from various sources (RDBMS, flat files, APIs) into data warehouses.

Designed and developed data solutions using Azure technologies such as Azure Data Factory, Databricks, Data Lake Storage, and Synapse Analytics.

Experience with Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.

Developed Spark code using Python and Spark-SQL for faster testing and processing of data.

Utilized GitHub Actions to ensure continuous integration and continuous delivery, improving software quality and delivery speed.

Integrated Utility Catalog components into CI/CD pipelines using Azure DevOps, Jenkins, or GitHub Actions, automating infrastructure provisioning, configuration, and deployment processes.

Integrated data from DataStage into Business Intelligence platforms (such as Tableau, Power BI, or Cognos) to create dashboards and reports for business decision-making.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.

Created Databricks and scheduled a spark job to extract data from files in ADLS gen2.

Provided training on Snowflake best practices, SQL optimization, and data warehouse architecture, fostering team proficiency.

Optimized DataStage jobs for improved performance, reducing run-time and resource consumption by streamlining data transformation logic and parallelism techniques.

Utilized DBT for transforming raw data into a more structured and analyzable format, so that it can be loaded to Snowflake

Build ETL/ELT pipeline in data technologies like PySpark, hive, presto and data bricks.

Modeled data integration workflows to consolidate data from various sources, including on-premises databases, cloud storage (Azure Data Lake), and external APIs, ensuring accurate and timely data availability.

Automate Hive job execution and data workflows using Python-based scheduling tools such as Apache Airflow or custom scripts.

Integrated monitoring and logging tools within GitHub Actions workflows to track pipeline performance and detect issues.

Created Tableau and PowerBI reports, incorporating complex calculations for strategic planning and ad-hoc reporting.

Developed batch scripts to fetch the data from ADL storage and do required transformations in PySpark using Spark framework.

Designed and implemented highly performant data ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks

Ensured compliance with data governance standards by implementing data access controls and auditing mechanisms within Business Intelligence solutions.

Optimized DBT models for performance, reducing data processing times and improving query efficiency.

Developed Json scripts for deploying the pipeline in Azure Data Factory that process the data using the Cosmos Activity

Created complex ETL Azure Data Factory pipelines using mapping data flows with multiple Input/output transformations, Schema Modifier transformations, row modifier transformations using Scala Expressions.

Deployed Kubernetes Clusters on cloud/on-premises environments, configured master/minion architecture, and created services like pods, deployments, auto-scaling, and load balancers with YAML files

Used BI tools like Tableau and PowerBI, data interpretation, modelling, data analysis, and reporting within AWS to deliver decision-making insights to the stakeholders.

Implemented data validation and error handling mechanisms in DataStage to ensure high data quality and consistency across all ETL processes.

Optimized data models for performance by implementing indexing strategies, partitioning schemes, and query optimization techniques to enhance data retrieval and processing speed.

Utilized standardized ARM templates and Terraform modules from the Utility Catalog to streamline the creation and management of Azure resources, reducing manual configuration errors.

Implemented Python-based data orchestration and workflow management solutions using tools like Apache Airflow and Luigi.

Developed data processing solutions using Azure Stream Analytics, Azure Functions, and Azure Databricks to analyze and process events from EventHub.

Implemented version control for Snowflake objects using tools like DBT (data build tool).

Working on Azure Data bricks to run Spark-Python Notebooks through ADF pipelines.

Implement Azure Data bricks clusters, notebooks, jobs and auto scaling.

Leveraged Azure EventHub for real-time data ingestion and streaming analytics, enabling near real-time insights and decision-making.

Extensively worked on Azure Data Lake Analytics with the help of Azure Data bricks to implement SCD-1, SCD-2 approaches.

Collaborated with data architects and analysts to design data warehouse schemas (Star and Snowflake) optimized for Business Intelligence reporting and analytics.

Conducted code reviews using GitHub Pull Requests, ensuring code quality and adherence to coding standards.

Led the development of DBT models, macros, and data marts in Snowflake showcasing proficiency in crafting efficient and structured data models to support advanced analytics and reporting needs.

Implemented Azure Synapse for efficient data warehousing, enhancing analytics capabilities and decision- making processes.

Creating pipelines, data flows, and complex data transformations and manipulations in ADF and PySpark with Databricks.

Worked on cloud deployments using maven, docker and Jenkins

Interact with business analysts and technology teams to create necessary requirements documentation such as ETL/ELT mappings, interface specifications, and business rules.

Developed Custom Email notification in Azure Data Factory pipelines using Logic Apps and standard notifications using Azure Monitor.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.

Worked closely with data engineers to define ETL processes, data transformations, and data integration workflows.

Environment: Teradata, Snowflake, Azure Databricks, Azure Data Factory, ETL, Python, Spark-SQL, JSON, PySpark, Azure Data Lake Storage, Scala, Maven, Docker, Jenkins, Oozie, Azure Logic Apps, SQL.

Elevance Health, Alexandria VA Dec 2021- Jan 2023 Sr. Azure Data Engineer

Responsibilities:

Created Pipelines in Azure Data Factory using Linked Services, Datasets, Pipeline to Extract, Transform and load (ETL) data in Azure data bricks from different sources like Azure SQL, Data Lake storage, Azure SQL Data warehouse, write-back tool and backwards.

Architecting and managing modern data warehouses and analytics solutions using Azure Synapse Analytics, integrating data lakes, data warehouses, and analytics tools to deliver a unified analytics platform for advanced data-driven decision-making.

Designed and implemented comprehensive data models for Azure-based data solutions, aligning data structures with business requirements and supporting analytics, reporting, and decision-making processes.

Led the migration of legacy ETL processes to DataStage, ensuring minimal downtime and data integrity during the transition.

Experienced with modern ETL tools such as Informatica, DBT, Apache Airflow, Talend, and Fivetran, streamlining data ingestion and integration processes.

Conducted ETL/ELT process automation using tools such as Apache Airflow and debt (data build tool), integrating them with Snowflake for seamless data pipeline management and execution.

Developed ETL workflow which pushes webserver logs to an Azure Blob Storage.

Empowered business users to create their own reports by designing easy-to-use data models and cubes within the Business Intelligence environment.

Integrated third-party services and tools within GitHub Actions workflows to extend pipeline capabilities.

Utilized Glue and Fully Managed Kafka for seamless data streaming, transformation, and preparation.

Using Terraform to orchestrate Infrastructure services in azure.

Good Knowledge in building interactive dashboards, performing ad-hoc analysis, generating reports and visualizations using Tableau and PowerBI.

Configured and managed Azure EventHub for real-time data ingestion and event streaming from various sources.

Managed source code repositories in GitHub, ensuring version control, branching strategies, and collaboration best practices.

Created comprehensive documentation for Kafka configurations, processes, and best practices.

Designed and implemented data warehouse schemas, including star and snowflake schemas, to support efficient querying and reporting in Azure Synapse Analytics and Azure SQL Data Warehouse.

Combined DBT with Azure Data Factory to automate processes for end-to-end data processing, making operations like loading, transforming, and inputting data easier.

Utilized advanced data modeling techniques (e.g., dimensional modeling, star schema, snowflake schema) to design highly performant databases and data warehouses in Azure, supporting enterprise-level analytics and reporting solutions.

Contributed towards maintaining Spark Applications using Scala and Python in conjunction with data development and software engineering teams.

Used Terraform to create new resource groups, organize resources effectively and maintaining a structured environment.

Reviewed and refined Terraform configurations for shared resources, removing redundancies and optimizing the setup.

Experience in creating Kubernetes replication controllers, Clusters and label services to deployed Microservices in Docker.

Participated in the upgrade of IBM DataStage environments, ensuring compatibility of existing jobs and optimizing new features for enhanced data processing capabilities.

Developed and tested disaster recovery plans within Snowflake, ensuring data resilience and minimizing potential data loss or downtime in case of unexpected events.

Handled POWERBI dataset refresh through Azure Data Factory pipelines and Rest APIs.

Communicated effectively with team members and management regarding Kafka-related initiatives, challenges, and solutions.

Developed Apache Airflow modules for seamless workflow management across cloud services.

Successfully integrated Azure Data Factory (ADF) with various Azure services, including Azure Data bricks, Azure Synapse Analytics, and Azure SQL Data Warehouse, creating seamless end-to-end data solutions.

Experience in Data Pipelines, phases of ETL, ELT data process, converting Bigdata/unstructured data sets (JSON, log data) to structured data sets for Product analysts, Data Scientists.

Created, and provisioned different Data bricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

Implemented effective data integration patterns, including star schema and snowflake schema, optimizing data flow and storage structures within Azure Data Factory (ADF) pipelines.

Implemented monitoring and logging utilities from the catalog to enhance observability of Azure services, using Azure Monitor, Log Analytics, and Application Insights for proactive issue detection.

Implemented SQL scripts and queries in Python code to handle User requests and work with the data in database.

Responsible for on boarding Application teams to build and deploy the code using GitHub, Jenkins.

Collaborated with stakeholders to gather requirements and create logical and physical data models that represent complex business processes, enabling more effective data management and decision-making.

Responsible for fully understanding the source systems, documenting issues, following up on issues and seeking resolutions.

Created databases, tables, triggers, macros, views, stored procedures, functions, indexes in Teradata database.

Developed and Designed ETL jobs to load data from multiple source system to Teradata Database to load data into target schema.

Designed workflows using Airflow to automate the services developed for Change data capture.

Responsible for estimating the cluster size, monitoring and troubleshooting of spark data bricks cluster.

Involved in migration projects to migrate data from data warehouses on multiple databases to Teradata.

Written Python script to perform data transformation and data migration from various data sources and build different databases to store the Raw data and filtered Data.

Written Spark applications to perform data cleansing, transformations, aggregations, and other useful datasets as per downstream team requirements.

Involved in event enrichment, data aggregation, de-normalization and data preparation needed for downstream model learning and reporting.

Developed SQL Queries, Stored Procedures, and Triggers Using Oracle.

Implemented SQL scripts and queries in Python code to work with various databases and data sources.

Worked on migration of data from on-prem SQL server to Cloud database (Azure SQL DB).

Developed remote integration with third party platforms by using Restful web services and Successful implementation of Apache Spark and Spark Streaming applications for large scale data.

Responsible for on boarding Application teams to build and deploy the code using GitHub Jenkins, Nexus and Ansible.

Environment: Azure Databricks, Azure Data Factory, Azure SQL, ETL, Scala, Python, AWS IAM, S3, ELB, VPC, RDS, DNS, Ec2, Route 53, DynamoDB, AWS EMR, Lambda, SNS, Glue, Athena Redshift, PySpark, Kafka, Hive, Apache Spark, HBase, Tableau, JSON, SQL, REST API, Django, HTML5, CSS, JavaScript, Jira, Bootstrap, GitHub, ETL, Jenkins, Teradata, Oracle, MongoDB, Ansible, Docker, Nexus.

CVS Health, Boston. Jan 2020 – Dec 2021 Data Engineer

Responsibilities

●Developed and managed PostgreSQL functions and triggers to automate data processing tasks, maintaining data integrity across systems.

●Leveraged AWS services, including S3, EC2, RDS, and Lambda, to develop and deploy scalable, highly available cloud-based data solutions.

●Containerized ETL workflows using Docker, ensuring consistency in development and production environments while streamlining deployment.

●Orchestrated data processing workflows using Kubernetes, enabling scalable and high-availability data solutions.

●Efficiently created and managed ETL workflows using Python and Informatica, ensuring robust pipeline management with minimal downtime.

●Automated infrastructure provisioning and deployment using Terraform, enabling repeatable, consistent cloud deployments and simplifying infrastructure management.

●Architected and implemented scalable data models in PostgreSQL, optimizing data storage and retrieval through SQL, PL/SQL, and stored procedures.

●Incorporated Hadoop ecosystem tools such as HDFS, Hive, and Spark to handle large datasets and perform complex data analysis.

●Set up Jenkins pipelines to automate CI/CD processes, improving the reliability and speed of deploying data engineering code.

●Established and enforced data quality and security protocols to adhere to regulatory compliance requirements, ensuring robust data governance.

●Performed performance tuning and optimization of ETL processes and database queries, significantly boosting system efficiency and minimizing latency.

●Automated data workflows using AWS Lambda and integrated them with Kubernetes, streamlining operations and reducing manual intervention.

●Utilized Sqoop for efficient data transfer between Hadoop and relational databases, improving data integration and workflow automation.

●Managed code versions using Git, enabling seamless collaboration across cross-functional teams and ensuring effective version control.

●Employed ServiceNow to handle data-related incidents and service requests, ensuring timely resolutions and minimizing operational disruptions.

●Implemented system monitoring using CloudWatch and Grafana, ensuring the reliability of data pipelines and proactively identifying potential issues.

●Collaborated with cross-functional teams to align data solutions with business objectives, ensuring clear communication and successful project execution.

Environment: AWS (S3, EC2, RDS, Lambda), Python, Informatica, PostgreSQL, SQL, PL/SQL, Docker, Kubernetes, Terraform, Jenkins, Hadoop ecosystem (HDFS, Hive, Spark), ServiceNow, Sqoop, Git, CloudWatch, Grafana.

Citigroup, NY. Mar 2018- Jan 2020 Data Engineer

Responsibilities:

Developed CI/CD pipelines for Azure Big Data solutions, focusing on seamless code deployment to production.

Designed custom dashboards in Azure Monitor for real-time visualization of critical metrics.

Implemented security monitoring using Azure Security Center and Log Analytics.

Demonstrated proficiency in Azure services such as Azure DevOps, contributing to streamlined development and deployment processes.

Automated application deployments to various environments (e.g., development, staging, production) using GitHub Actions.

Established cloud-based data warehouse solutions on Azure using Snowflake for rapid analysis of real-time customer data.

Designed and deployed SSIS packages for Azure service data extraction, transformation, and loading.

Maximized fault tolerance by configuring and utilizing Kafka brokers for data writing and leveraged Spark for faster processing when needed.

Integrated the Big Query ML Engine across Data Ingestion, Preprocessing, and Data Transformation stages to seamlessly incorporate ML features into structured, semi-structured, and unstructured data formats for enhanced analytics and automation functionalities.

Built and installed servers, created resources in Azure using Azure Resource Manager Templates (ARM) or Azure Portal, and provisioned them with Terraform templates

Orchestrated Data Extraction and Ingestion from varied external repositories like MySQL server and Oracle DB into Big Query, employing Big Data technologies.

Utilized GitHub Pages for deploying static websites and documentation directly from the repository.

Integrated datasets and crafted materialized views for streamlined data visualization and analysis.

Developed visually appealing and interactive dashboards, using data aggregation techniques, to help users gain insights into key business metrics.

Used Terraform and AWS CloudFormation to define and provision infrastructure resources as code, enabling reproducible and scalable infrastructure deployments in both AWS and Azure environments

Integrated with Looker for exploratory analysis and visualization.

Developed custom applications to consume and process event data from EventHub, integrating with downstream systems.

Proposed and implemented continuous improvements to enhance the efficiency and reliability of Kafka clusters.

Seamlessly integrated Snowflake into the data processing workflow, elevating data warehousing capabilities.

Designed and implemented both batch and streaming data pipelines using cutting-edge Google Cloud technologies such as PubSub, Cloud Storage, Dataflow, and Big Query.

Optimized IBM MQ configurations to enhance message throughput, reduce latency, and improve overall performance.

Built and provisioned servers and Azure resources using ARM templates and terraform, adhering to infrastructure-as-code principles

Integrated dbt Cloud into existing data workflows, allowing for modular and reusable data models.

Involved in advanced scripting and coding for Infrastructure as Code (IaC) using tools such as Terraform, Ansible, and SaltStack to automate deployment and management tasks for efficient operations

Optimized query performance and empowered data-driven decision making through Snowflake's scalable data solutions and SQL-based analysis.

Designed and deployed SSIS packages for Azure service data extraction, transformation, and loading.

Managed diverse data formats including JSON, CSV, PARQUET, and XML.

Implemented event-driven architecture using Azure Event Grid for real-time event processing.

Utilized Python scripts to gather data from CSV, Parquet, JSON, and SQL Server sources, transferring it to cloud storage.

Employed cloud functions for data transformation, modeling, and loading into Big Query.

Generated KPI reports and interactive dashboards for data visualization employing MS Power BI, Tableau.

Leveraged Azure Data bricks and HDInsight for large-scale data processing and analytics using PySpark.

Orchestrated data pipelines utilizing Azure Data Factory, T-SQL, and Spark SQL Azure Data Lake Analytics.

Designed ADF Pipelines for data extraction, transformation, and loading.

Specialized in Data Migration using SQL, SQL Azure, Azure Storage, Azure Data Factory, SSIS, and PowerShell.

Efficiently transformed and loaded the extracted data into Big Query for further analysis, while ensuring secure storage and retrieval in Cloud Storage.

Environment: Azure Data Factory, Azure SQL GCP Dataflow, Azure Data Lake Storage, Big Query, Dataproc, ETL, MySQL, Oracle, DB2, Hive, Cloud Pub/Sub, Cloud Storage, JSON, XML, CSV, CI/CD, PySpark, Python, SQL.

Contact this candidate