Senior Data Engineer

Location:

Frisco, TX

Posted:

September 10, 2025

Contact this candidate

Resume:

Rishitha Baddam

Email- *******************@*****.***

Mobile #: +1-469-***-****

PROFESSIONAL SUMMARY:

Having around 10+ years of experience with expertise in Big Data environment Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.

Capable of using AWS utilities such as EMR, S3, Lambda, EC2, AWS Glue and CloudWatch to run and monitor Hadoop and Spark jobs on Amazon Web Services (AWS).

Knowledge in working with Azure cloud platform (HDInsight, Data Lake, Azure Data Bricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).

Developed ETL solutions for GCP Migration using GCP Dataflow, GCP Composer, Apache Airflow and GCP Big Query.

Migrated legacy ETL jobs to Spring Batch-based cloud-native workflows, reducing processing time and infrastructure costs.

Partnered with AWS-native services (S3, Lambda, Step Functions, CloudWatch) to orchestrate scalable and fault-tolerant AI pipelines.

Skilled in building robust ETL workflows, managing large datasets, and enabling end-to-end machine learning pipelines on AWS using tools like Spark, Airflow, Docker, and SageMaker.

Experience of software development in Python (libraries used: Beautiful Soup, NumPy, SciPy, Matplotlib, python-twitter, Panda's data frame, network, urllib2, MySQL for database connectivity)

Achieved complex HiveQL inquiries for required information extraction from Hive tables and composed Hive User Defined Functions (UDF's) as required.

Excellent Knowledge in utilizing partitions, bucketing concepts in Hive, and planned both Managed and External tables in Hive to optimize execution.

Profound experience in creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.

Capable in changing over Hive/SQL inquiries into Spark changes utilizing Data edges and Data sets.

Extensive experience with Text Analytics, developing different Statistical Machine Learning solutions for different business issues and producing data visualizations utilizing Python and R.

Expertise in feature engineering, model data preparation, and integrating ML workflows into production using PySpark, Delta Lake, Kafka, and cloud-native tools on AWS and Azure.

Adept at containerizing ML data pipelines with Docker and orchestrating them using Airflow and CI/CD tools.

Specializing in Java Spring frameworks, AWS Batch, and data orchestration for mission-critical systems.

Strong knowledge in using Amazon EC2 to to provide a complete solution for computing, query processing, and storage across a wide range of applications.

Ingested data into Snowflake cloud data warehouse using Snow pipe. Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive to staging area.

Proven ability to collaborate with data scientists and MLOps teams to deliver reliable, production-ready data for model training, inference, and monitoring.

Extensive knowledge in programming with Resilient Distributed Datasets (RDDs).

Better working experience on Tableau and empowered the JDBC/ODBC information network from those to Hive tables.

TOOLS AND TECHNOLOGIES:

Bigdata Ecosystem

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Oozie

Hadoop Distributions

Apache Hadoop, Cloudera CDP, Hortonworks HDP, Amazon AWS - EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS, Quick Sight, Kinesis, Microsoft Azure - Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory).

Scripting Languages

Python, Java, Scala, R, PowerShell Scripting, HiveQL.

Cloud Environment

Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).

NoSQL Database

Cassandra, Redis, MongoDB, Neo4j.

Database

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2.

ETL/BI

Snowflake, Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI.

Operating systems

Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10).

Version Control

Git, SVN, Bitbucket.

Others

Machine learning, NLP, Spring Boot, Jupyter Notebook, Terraform, Docker, Kubernetes, Jenkins, Ansible, Splunk, Jira.

PROFESSIONAL EXPERIENCE:

Humana (Plano, USA) Apr.2024 - Till Date

Data Engineer

Responsibilities:

Designed and deployed scalable ETL pipelines using AWS Glue, Lambda, and Step Functions, processing over 2 TB of data daily with fault-tolerant execution.

Migrated legacy on-prem ETL jobs to serverless AWS Glue and S3, reducing infrastructure costs by 70%.

Automated end-to-end data workflows using AWS Step Functions and CloudWatch Events, improving data availability SLA from 95% to 99.9%.

Integrated AWS Lambda with DynamoDB and SNS for real-time data ingestion and alerting system.

Collaborated with cross-functional content owners, enabling alignment between documentation strategy, metadata design, and AI-powered retrieval systems.

Migrated historical datasets from AWS S3 and on-premises Hadoop clusters to Azure Data Lake, implementing schema evolution handling for minimal downtime during ingestion.

Built streaming pipelines combining Kafka and Databricks Structured Streaming for near real-time ingestion and processing of healthcare transactions and patient interactions.

Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary.

Developed custom ETL solutions, batch processing, and a real-time data ingestion pipeline to transport data into and out of Hadoop Using Pyspark and shell scripting.

Performed Multiple ETL testing such as Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.

Collaborated with ML engineers to deploy models using MLflow and SageMaker Pipelines, automating training-to-deployment lifecycles.

Implemented CI/CD pipelines using Jenkins and Terraform for automated deployment, reducing release time from days to hours.

Designed & implemented scalable documentation ingestion pipelines using Amazon Q Business and Bedrock, enabling AI-powered enterprise search and knowledge discovery.

Developed real-time ingestion pipeline using Kafka + Spark Streaming to deliver low-latency data to ML models in production.

Assessed LLM response quality via ground truth datasets and automated evaluation pipelines, ensuring outputs aligned with compliance and business standards.

Containerized legacy applications using Docker, enabling seamless deployment across environments.

Collaborated in Agile teams (SCRUM, Kanban) to deliver incremental features, ensuring rapid feedback loops and improved customer satisfaction.

Implemented Retrieval-Augmented Generation (RAG) solutions using Amazon Bedrock, improving accuracy and trustworthiness of LLM responses by integrating structured/unstructured enterprise data.

Worked on creating Hive tables and written Hive queries for data analysis to meet business.

requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.

Extensive experience with Hadoop Big Data Integration and ETL for data extraction, loading, and transformation for ERP data

Written SQL Scripts and PL/SQL Scripts to extract data from Database and for Testing Purposes

Strong Hands-on experience in creating and modifying SQL stored procedures, functions, views, indexes, and triggers.

Experience in using various Python libraries such as Pandas, SciPy, TensorFlow, Keras, Scikit-learn.

Worked on both waterfall and agile methodologies in a fast-paced environment.

Environment: Hadoop, Hive, Spark, AWS, ECS, Snowflake, Python, Java, Scala, SQL, Kafka, Airflow, HBase, Teradata, Cassandra, Talend, Tableau, Git.

Meta, Remote, USA May.2022-Apr.2023

Data Engineer

Responsibilities:

Improved data warehouse query performance by 30% by rewriting ETL logic in Presto/Scuba and

introducing smarter partitioning, materialized views, and Z-order clustering for better scan efficiency.

Built and maintained real-time and batch data pipelines processing over 5+ PB of daily events using Spark, Presto, and Scuba, ensuring SLA compliance and data freshness for 30+ internal product teams.

Collaborated with ML engineers to build and serve real-time features to Meta’s Feature Store, reducing model training latency by 25%.

Worked with internal Facebook data platforms like ODS, Scribe, Scuba, AirFlow to build resilient and observable data pipelines supporting ML feature engineering and model evaluation.

Constructed data warehouse from scratch utilizing Python, SQL and AWS tools and worked with DevOps team to ensure seamless integration of ETL pipelines into data warehouse.

Designed and maintained normalized and denormalized data models for core entities such as user, post, event, and interaction graphs, ensuring consistency across thousands of internal uses cases.

Implemented end-to-end logging, error handling, and recovery logic in Spring Batch flows to support transactional integrity and data consistency.

Designed a modular ETL framework in Cloud Composer (Airflow) for 200+ DAGs, enabling standardized pipeline deployment across teams and reducing onboarding time by 30%.

Optimized BigQuery storage and compute costs by implementing partitioning, clustering, and materialized views, reducing monthly query costs by 40% without performance degradation.

Led schema evolution and backfill strategy for GCS and BigQuery datasets supporting changing product analytics requirements, ensuring consistency across 10+ teams with zero data loss.

Established content quality metrics & monitoring dashboards, driving continuous improvements in documentation ingestion and AI response reliability.

Designed DAGs in Apache Airflow to automate and monitor data workflows across AWS S3, Redshift, and Hive, reducing data delivery latency by 60%

Engineered distributed ETL pipelines using Apache Spark on AWS EMR, processing over 10TB of clickstream and engagement data daily to power real-time user analytics.

Reduced streaming job lag by 60% through batch size tuning, parallelism adjustment, and optimized join strategies for high throughput use cases.

Designed and maintained scalable ETL pipelines using Apache Spark and Airflow to process structured and semi-structured data across multiple data sources.

Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster processing of data.

Built data pipelines using Spark and Airflow to transform and load data into Redshift, supporting stakeholder-facing dashboards built in Tableau and Power BI.

Delivered clean, well-modeled datasets for BI dashboards by working closely with stakeholders and analysts, reducing reporting turnaround time by 50%.

Environment: Hadoop, Hive, Spark, AWS, EC2, S3, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Snowflake, Python, Java, Scala, SQL, Kafka, Airflow, Teradata, Cassandra, Talend, Tableau, Git.

Wells Fargo, Charlotte, USA Sept.2019-April.2022

Data Engineer

Responsibilities:

Azure SQL Database, Azure DataBricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.

Designed and implemented Spring Batch jobs to automate high-volume data ingestion and transformation workflows, reducing manual processing by 80%.

Created and tuned complex SQL queries (Oracle, Snowflake) to support reporting and analytics, improving query performance by 40%.

Automated ETL pipelines using Python and Control-M, achieving 99% on-time job completion with zero data loss incidents.

ADF Pipelines were created to load data from on-premises storage and databases into AZURE cloud storage and databases.

Implemented role-based access control and encryption for data-at-rest and in-transit across Azure platforms.

Leveraged Delta Lake’s time travel and data versioning features for auditing and compliance use cases, enabling reproducible reporting and historical trend analysis.

Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure synapse) and processing the data in Azure DataBricks.

Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.

Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.

Planned and deployed information pipelines using Data Lake, DataBricks, and Apache Airflow.

Data Migration from Oracle using SSIS and Azure data factory (ADF).

Responsible for the migration of critical components from on-premises to Azure Cloud Services, Snow SQL Programming: Snowflake SQL Queries.

Built scalable data warehouse solutions in Azure Synapse, using dedicated SQL pools for analytics and partitioning strategies to improve query performance by 45%.

Implemented CI/CD workflows for ML pipelines using Azure DevOps and MLflow, enabling automated training, testing, and deployment across dev/stage/prod environments.

Proficient in composing Python scripts to construct ETL pipeline and Coordinated Non-cyclic Chart (DAG) workflows utilizing Airflow, Apache NiFi.

Led data quality initiatives by building preprocessing workflows in Python, incorporating chunking, vectorization, and metadata strategies to improve downstream model performance.

Worked on Kafka used to process ingested data in real-time from flat files and APIs.

Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.

Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.

Developed spark-based ingestion framework for ingesting data into HDFS, creating tables in Hive and executing complex computations and parallel data processing.

Troubleshoot and update existing PowerApps and Power Automate flows.

Enhanced scripts of existing Python modules. Worked on writing APIs to load the processed data to HBase tables.

Participate in daily scrum and some other design-related meetings while working in an Agile development environment.

Environment: Azure, Azure DataBricks, Data Lake (ADLS), MySQL, Snowflake, MongoDB, PowerBI, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala Spark, Airflow, Python, SQL, Java, Teradata, Oracle, MySQL, Tableau, SVN, Jira.

Unicommerce Pvt Ltd, Bangalore, India Sept.2017-May.2019

Big Data Developer

Responsibilities:

Using Hadoop/Big Data concepts, loaded and transformed large sets of structured, semi - structured and unstructured data.

Defined Spark Streaming APIs to create a learner data model that collects real-time data from Kafka and links it to the Cassandra sink.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Software development in Python (libraries used: Beautiful Soup, NumPy, SciPy, Matplotlib, python, Panda's data frame, network, urllib2, MySQL for database connectivity)

Developed SQL queries to search the Repository DB for deviations from the Company's ETL Standards for items such as Sources, Targets, Transformations, Log Files, Mappings, Sessions, and Workflows that were produced by users.

Using PySpark and SparkSQL, I analyzed and improved relevant data stored in Snowflake.

Optimized Hive queries to extract the customer information from HDFS.

Worked on a Scala log producer that monitors application logs, transforms incremental logs, and transmits them to a Kafka-based log collecting platform.

Built scalable and fault-tolerant batch processing pipelines with Spring Batch and Java

orchestrating complex data ingestion and transformation tasks across multiple data sources.

Import data into Spark RDD from sources such as HDFS/HBase.

Responsible for performing SQL development and SSIS /SSAS/SSRS development.

Hands-on experience on developing SQL Scripts for automation purpose.

Used Talend for Big data Integration using Spark and Hadoop.

The Spark-SQL tables were read, processed, and generated using the Scala API.

Working on BI reporting with OLAP for Big Data

Worked in Agile Methodology and used JIRA for maintain the stories about project.

Documented batch job architecture, data lineage, and dependency flows using tools like Lucid chart and Confluence for audit and team collaboration.

Environment: Python, PySpark, Spark, Spark ML Lib, SparkSQL, PowerBI, YARN, HIVE, Pig, Scala, NiFi, Hadoop, NOSQL, Sqoop, SSIS, MYSQL.

Dunzo.Ltd, Hyderabad, India June.2016-July 2017

Data Analyst

Responsibilities:

Develop custom reporting dashboards, reports and scorecards mainly focusing on Tableau and its components from multiple sources, including Oracle databases, SQL Server, Teradata, MS Access, DB2, Excel, flat files, and CSV Performed the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

In Python, perform data collection, cleaning, wrangling, and analysis on the data sets.

Performing statistical data analysis and data visualization using Python and R.

Implemented Power BI as a front-end BI tool and MS SQL Server as a back-end database to design and develop dashboards, workbooks, and complicated aggregate computations.

Used Python (NumPy, SciPy, pandas, scikit-learn, seaborn) to develop variety of models and algorithms for analytic purposes.

Responsible for building data analysis infrastructure to collect, analyses, and visualize data.

Developed advanced SQL queries for data analysis using multi-table joins, group functions, subqueries, and set operations, as well as stored procedures and user-defined functions (UDFs).

Quality analytics for data, Communication, and execution of data validation. Analyzing data quality results, Measuring, and auditing large volumes of data for quality issues and improvements, Performing root cause analysis of data anomalies.

Work with others in team to analyze and evaluate the data.

Optimized data collection procedures and generated reports on a weekly, monthly, and quarterly basis.

To develop a variety of models and methods for analytic purposes, use Python and R libraries such as NumPy, Pandas, Matplotlib, SciPy, TensorFlow, Keras, Scikit-learn, OpenCV, and ggplot2.

Collect, analyze, and extract data from a variety of sources to create reports, dashboards, and analytical solutions and Assisting with the debugging of Tableau dashboards.

Worked in Agile Methodology and used JIRA for maintain the stories about project.

Environment: Python, GitLab, Tableau, Snowflake SQL, Oracle, MYSQL, PowerBI, SQL Server Management Studio, Excel, Agile Methodology

WinWire, Hyderabad, India June.2015-May2016

Software Engineer

Responsibilities: -

Developed and maintained scalable backend services using Java (Spring Boot), reducing API response times by 40%.

Led the migration of a monolithic Java application to microservices architecture, improving deployment speed and scalability.

Designed a multi-threaded batch processing system in Java, increasing data throughput by 3x.

Integrated third-party APIs using Java HTTP Clients, with robust error handling and retry mechanisms.

Wrote unit and integration tests using JUnit and Mockito, achieving 95% test coverage on critical services.

Worked in Agile Methodology and used JIRA for maintain the stories about project.

Environments: Java/J2EE, Spring, Oracle, Linux, JDBC, Maven, Git, Jira, HTML, CSS, Angular, Splunk, Log4j, Servlets, Struts, JSP, WebLogic/SQL, agile methodology.

EDUCATION DETAILS

Masters in Data science (University of Dayton)

Bachelors in computer science (JNTU)

Contact this candidate