Data Engineer Senior

Location:

Redmond, WA

Posted:

February 13, 2025

Contact this candidate

Resume:

AISHWARYASINGIREDDY

Senior Data Engineer

+1-248-***-**** *********.****@*****.*** linkedin.com/in/aish1412

Professional Summary

Senior Data Engineer with nearly 8 years of expertise in designing, developing, and optimizing data pipelines and architectures across multiple cloud platforms including Azure, AWS, and GCP. Proven track record of leading successful data migration projects, utilizing advanced ETL frameworks like Apache NiFi, Talend, and Databricks, and implementing cloud-based solutions to drive business growth and operational efficiency. Skilled in building scalable data lakes, leveraging distribute d data processing technologies such as Apache Spark, Hadoop, and Azure Databricks, and optimizing data storage formats like Parquet and Delta Lake. Expertise in utilizing tools like Apache Airflow, Terraform, and Kubernetes for orchestrating cloud workflows and managing infrastructure as code (IaC). Strong background in leveraging SQL, Python, and PySpark for data transformation and analysis, as well as integrating real-time data streams using Kafka and Azure Stream Analytics. Adept at collaborating with cross-functional teams, ensuring high availability, fault tolerance, and security for data-driven solutions.

Known for delivering high-performance data solutions, ensuring seamless data integration, and optimizing analytics workflows for organizations in various industries.

Skills

Development & Scripting Python, Scala, Java, Shell Scripting, JavaScript, SQL Scripting

Data Engineering & ETL Apache NiFi, Talend, Databricks, Apache Spark, PySpark, MapReduce, SSIS, Informatica, Informatica Cloud, Apache Airflow, ETL/ELT

Data Storage & Databases Snowflake, Oracle (19c, SQL Server), SQL, T-SQL, PL/SQL, MySQL, PostgreSQL (AWS Aurora), MongoDB, Redshift, DynamoDB, Data Lakes, Delta Lake, HBase, Hive, Parquet, Delta Lake, GCP BigQuery, GCP Cloud Storage, AWS S3, AWS Kinesis

Big Data & Computing Hadoop, Spark Streaming, Kafka, Databricks, GCP Dataflow, Google DataProc Data Analytics & BI Tools Power BI, Tableau, MicroStrategy, SSRS, SSAS, Google Data Studio, Excel

Infrastructure & Orchestration Terraform, Kubernetes, Apache Airflow, Jenkins, Chef, Docker, AWS Lambda, GCP Composer, GCP Dataflow, Step Functions

Cloud Platforms Azure, AWS, GCP

Machine Learning & Data Science scikit-learn, TensorFlow, Machine Learning Models, NLP, Predictive Analytics Version Control & CI/CD GitHub, GitLab, Jenkins

Reporting & Automation Power BI (Dashboarding, Data Modeling, Data Refresh Automation), SSRS (Report Subscriptions, Snapshots), Tableau (Dashboard Development), Airflow (Orchestration)

Monitoring & Logging Stack driver, CloudWatch, Datadog, New Relic

Tools & Technologies Kubernetes, Terraform, JIRA, Apache Tomcat, Quarkus, Actimize ActOne, AWS Redshift, GCP Composer, Kafka

Experience

CARDINALHEALTH – Dublin, OH

Senior Data Engineer February2023–Present

Key Contribution:

Led successful on-premises data migrations to Azure using Apache NiFi and Oracle 19c to Snowflake Data Warehouse.

Defined optimized Snowflake virtual warehouse sizing for various workloads.

Designed migration plans and selected Azure services for hosting Oracle databases.

Developed and tested data extraction scripts for smooth transfers.

Utilized Apache NiFi for diverse data ingestion workflows into Azure.

Built data integration solutions with Oracle Data Integrator (ODI) and customized Talend components.

Set up an Azure Data Lake with Azure Blob Storage, Azure Data Factory, and Azure HDInsight.

Automated workflows with Apache Airflow for Change Data Capture (CDC) services.

Implemented Generative AI models for automated data quality checks, reducing manual effort in anomaly detection and ensuring data accuracy.

Configured Azure services using Azure CLI and cloud shell SDK, including Azure Data Lake Storage, Azure Databricks, and Azure Data Factory.

Implemented scalable data solutions with Hadoop, including MapReduce programs and ETL pipelines in Databricks using Python.

Successfully migrated data to Azure HDInsight and Azure Blob Storage.

Optimized Apache NiFi pipelines utilized Kafka and Spark Streaming, and designed data ingestion pipelines into Druid on Azure.

Created Databricks Spark Jobs with PySpark for various operations.

Extracted and analyzed data from data lakes, EDW, and validated data flows using SQL and PySpark.

Designed Talend jobs, ODI mappings, and Sqoop scripts for data movement.

Conducted data transformation and cleansing using SQL and PySpark.

Implemented data pipelines, flows, and transformations with Azure Data Factory and Databricks.

Proficient in Azure, SQL, Python scripting, PySpark, Airflow, Kubernetes, Terraform, and data modeling.

Skilled in infrastructure as code (IaC) and cloud shell usage.

Experience with the complete SDLC process, including code reviews, source code management, and build process.

Ingested data in mini-batches and performed RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.

Developed and architected multiple data pipelines, end-to-end ETL and ELT processes for data ingestion and transformation in Azure.

Utilized LLMs (Large Language Models) to generate automated documentation for complex data pipelines, improving collaboration across teams.

Employed AI-powered chatbots for real-time monitoring and troubleshooting of data pipeline workflows, enabling faster issue resolution.

Loaded data to Snowflake using Azure Data Factory, Databricks, and Azure Blob Storage.

Developed Terraform script and deployed it in cloud deployment manager to spin up resources like cloud virtual networks.

Used Spark-SQL to load JSON data, create Schema RDD, and loaded it into Hive Tables, handling structured data using Spark SQL.

Set required Hadoop environments for clusters to perform MapReduce jobs, monitored Hadoop cluster connectivity and security using tools such as Zookeeper and Hive.

Developed ETL systems using Python and in-memory computing framework (Apache Spark), scheduled and maintained data pipelines at regular intervals in Apache Airflow.

Involved Python scripts for data transformations on large datasets in Azure Databricks or Azure Stream Analytics.

Designed and deployed AI-powered data validation scripts using Python and Azure Databricks, leveraging machine learning to predict and correct inconsistent data patterns.

Integrated Gen AI APIs (like OpenAI and Azure OpenAI) into data pipelines for generating insights, summarizing large datasets, and enhancing reporting efficiency.

Worked with HBase and MySQL for optimizing data and over File Sequence, AVRO, and Parquet file formats.

Created data frames in Azure Databricks using Apache Spark to perform business analysis.

Optimized and implemented data storage formats e.g., Parquet, Delta Lake through Databricks, effective partitioning strategies.

Built complex data pipelines using PL/SQL scripts, Cloud REST APIs, Python scripts, Azure Data Factory, Azure Data Lake.

Developed and fine-tuned domain-specific Gen AI models to generate business intelligence dashboards and actionable insights, leveraging data from Snowflake and Azure Data Lake.

Monitored Azure Data Factory, Databricks, and Dataflow jobs via Azure Monitor.

Performed data wrangling to clean, transform, and reshape data using pandas library.

Used digital signage software to display real-time metrics for monitoring migration progress.

Analyzed data using SQL, Scala, Python, Apache Spark, & presented analytical reports to management and technical teams.

Designed and executed Apache Spark jobs within Databricks to perform complex data transformations, utilizing Scala and Python for enhanced analytics.

Implemented data wrangling processes using Databricks, employing pandas library for cleaning, transforming, and reshaping data in an efficient manner.

Optimized and implemented storage formats like Parquet and Delta Lake in Databricks, incorporating effective partitioning strategies for streamlined data processing.

Created firewall rules to access Azure Databricks from other machines.

Used Python for data ingestion into Azure Data Factory.

Involved in porting on-premises Hive code migration to Azure HDInsight.

Monitored Hadoop cluster connectivity and security using tools like Zookeeper and Hive.

Managed and reviewed Hadoop log files.

Followed AGILE (Scrum) development methodology for application development.

Involved in managing backup and disaster recovery for Hadoop data.

FINYS – Troy, MI

Senior Data Engineer December2022-February2023

Key Contribution:

Managed cluster responsibilities and optimized performance for large-scale data processing.

Collaborated on gathering requirements for scalable distributed data solutions within a Hadoop cluster environment, leveraging Azure HDInsight for cloud-based big data processing.

Developed Spark scripts using PySpark and Scala for data transformations and actions on RDDs, integrating with Azure Databricks for seamless cloud-based Spark processing.

Designed and implemented Oozie workflow engine for scheduling multiple Hive and Pig jobs, orchestrated through Azure Data Factory for better pipeline management.

Worked with different file formats and converted Hive/SQL queries into Spark transformations, integrating with Azure Data Lake for efficient data storage and retrieval.

Participated in analysing business requirements and designed database systems on SSMS using T-SQL, with integration to Azure SQL Database for scalable, cloud-based data management.

Developed and implemented database triggers, stored procedures, and functions in T-SQL, optimizing query performance within Azure SQL Database and other cloud databases.

Handled data in batches through ETL (Talend) and Unix shell scripting, migrating data to Azure Blob Storage and ensuring smooth integration with Azure Data Lake.

Migrated data using Talend jobs and implemented complex business rules for processing in Azure environments.

Analysed ETL process failures, data anomalies, and data warehousing issues, leveraging Azure Monitor and Log Analytics for proactive troubleshooting.

Automated data loading through batch job design and implementation, optimizing data flow using Azure Data Factory for orchestration.

Transformed data based on business requirements, cleaning, aggregating, and modifying Slowly Changing Dimensions (SCD) data, integrating with Azure Synapse Analytics for high-performance analytics.

Loaded transformed data into target systems like Azure Synapse Analytics, Data Warehouses, and Data Marts.

Integrated DataStage metadata with Informatica metadata for streamlined data processing, utilizing Azure Data Catalog for centralized metadata management.

Integrated Talend transformation jobs with orchestration jobs in Azure Data Factory for seamless cloud -based data workflows.

Configured report subscriptions and snapshots in SSRS, leveraging Azure Report Server for scalable reporting solutions.

Implemented custom user roles and security measures in Power BI, integrating with Azure Active Directory for secure access management.

Automated data refresh in Power BI, scheduled, and maintained SQL Server Agent jobs for Talend orchestration jobs, utilizing Azure Automation for cloud job management.

Performed performance tuning of Spark jobs on Azure Databricks for improved processing speed and resource utilization.

Troubleshot and debugged issues using shell scripting, integrated with Azure DevOps for CI/CD pipeline management and version control.

Built data pipelines using Python frameworks like Apache Airflow, integrated with Azure Data Factory for cloud orchestration and scheduling.

Developed scripts for ETL processes, managing and scheduling jobs using Apache Airflow, deploying on Azure Kubernetes Service for efficient scaling.

Developed machine learning models using libraries like scikit-learn and TensorFlow, deploying models on Azure Machine Learning for model training and management.

Migrated data from on-premise systems to AWS cloud, leveraging Amazon S3 for storage and AWS Glue for data transformation and loading.

Utilized Amazon Redshift for scalable data warehousing solutions, optimizing query performance and integrating with other AWS analytics services.

Integrated AWS Lambda with data processing workflows to enable serverless computing and automate tasks such as data extraction and transformation.

Developed and deployed Spark jobs on Amazon EMR for distributed data processing, optimizing performance and scalability in the AWS environment.

OPTUM – Eden Prairie, MN

Data Engineer / Software Engineer January 2022 - December2022

Key Contribution:

Developed Python-based delivery date estimation models using historical data and predictive algorithms, improving accuracy and efficiency.

Automated delivery date predictions, reducing manual efforts and enhancing operational efficiency.

Applied machine learning techniques (scikit-learn) to analyze trends, seasonality, and delays for accurate estimations.

Built custom algorithms considering order characteristics, location, and external factors like weather and holidays.

Integrated API data sources (e.g., traffic, weather) to dynamically adjust delivery estimates in real time.

Optimized algorithms with Python’s Pandas and NumPy for efficient data processing and reduced computation time.

Collaborated with teams to integrate delivery estimation functionality by Python APIs and front-end features in Vue.js.

Used GraphQL for efficient data fetching, Kafka for event-driven messaging, and MongoDB for scalable data storage.

Integrated AWS S3 for data storage, Lambda for serverless execution, and Redshift for analytics on delivery patterns.

Implemented ETL processes with Talend to transform and load data from various sources into the system.

Leveraged AWS Glue for seamless data integration and SQL for querying structured data.

Automated ETL workflows using Talend and AWS Data Pipeline for scheduling and execution.

Utilized Amazon SNS for real-time notifications and AWS CloudWatch for monitoring delivery prediction performance.

VIRTUSA – Hyderabad, INDIA

Data Engineer June 2019 - July 2021

Key Contribution:

WorkedinAgileSCRUMSDLCprocesstoimplement2-weekSprints

Developed and optimized ETL pipelines using Azure Data Factory, Databricks, and Spark for efficient data integration and processing across multiple sources.

Managed and optimized data storage solutions in Azure SQL Database and SQL Server, including schema design, indexing strategies, and partitioning for high performance.

Implemented and managed a Data Lake using Azure Data Lake for storing large unstructured datasets, ensuring data accessibility and scalability for analytics.

Developed automated data validation and cleansing scripts to ensure data integrity and quality before ingestion into databases and analytics systems.

Optimized complex queries in Entity Framework and LINQ for improved performance and efficiency in large-scale data retrieval operations.

Integrated real-time data processing using Kafka and Azure Stream Analytics to enable up-to-date business insights and reporting.

Created and managed Power BI dashboards by developing complex data models and ensuring seamless data flow from databases to reporting systems for real-time KPI tracking.

Collaborated with backend and frontend teams to ensure data was accurately integrated into business logic, supporting dynamic dashboards and UI elements.

Implemented CI/CD pipelines for data workflows using Azure DevOps, automating the deployment and continuous delivery of data updates and processing workflows.

Developed and containerized microservices for data-related tasks using Docker and Kubernetes, ensuring scalable and efficient data processing and deployment.

Ensured data security by implementing encryption and access control measures in line with industry standards and compliance regulations.

Leveraged cloud infrastructure to deploy and scale data services, ensuring high availability, fault tolerance, and resource optimization on Azure.

Collaborated with cross-functional teams to support data-driven decision-making by providing reliable, accurate, and real- time data for business analytics.

Microsoft – Hyderabad, INDIA

Data Engineer – Associate July 2017 –May 2019

Key Contribution:

Managed cluster responsibilities and optimized performance for large-scale data processing.

Collaborated on gathering requirements for scalable distributed data solutions within a Hadoop cluster environment, leveraging Azure HDInsight for cloud-based big data processing.

Developed Spark scripts using PySpark and Scala for data transformations and actions on RDDs, integrating with Azure Databricks for seamless cloud-based Spark processing.

Designed and implemented Oozie workflow engine for scheduling multiple Hive and Pig jobs, orchestrated through Azure Data Factory for better pipeline management.

Worked with different file formats and converted Hive/SQL queries into Spark transformations, integrating with Azure Data Lake for efficient data storage and retrieval.

Participated in analyzing business requirements and designed database systems on SSMS using T-SQL, with integration to Azure SQL Database for scalable, cloud-based data management.

Developed and implemented database triggers, stored procedures, and functions in T-SQL, optimizing query performance within Azure SQL Database and other cloud databases.

Handled data in batches through ETL (Talend) and Unix shell scripting, migrating data to Azure Blob Storage and ensuring smooth integration with Azure Data Lake.

Migrated data using Talend jobs and implemented complex business rules for processing in Azure environments.

Analyzed ETL process failures, data anomalies, and data warehousing issues, leveraging Azure Monitor and Log Analytics for proactive troubleshooting.

Automated data loading through batch job design and implementation, optimizing data flow using Azure Data Factory for orchestration.

Transformed data based on business requirements, cleaning, aggregating, and modifying Slowly Changing Dimensions (SCD) data, integrating with Azure Synapse Analytics for high-performance analytics.

Loaded transformed data into target systems like Azure Synapse Analytics, Data Warehouses, and Data Marts.

Integrated DataStage metadata with Informatica metadata for streamlined data processing, utilizing Azure Data Catalog for centralized metadata management.

Integrated Talend transformation jobs with orchestration jobs in Azure Data Factory for seamless cloud-based data workflows.