Data Engineer

Location:

Schaumburg, IL

Posted:

May 14, 2025

Contact this candidate

Resume:

Sridevi Bommidi

Sr. Data Engineer

Email: *******.****@*****.***

Contact: 616-***-****

LinkedIn: https://www.linkedin.com/in/sridevi-bommidi-42945040/

PROFESSIONAL SUMMARY

Innovative and results-driven Senior Data Engineer with 10+ years of experience architecting high-performance data platforms across AWS, Azure, and GCP. I specialized in building real-time streaming and big data solutions using Spark, Kafka, and Databricks.

Adept at modernizing data ecosystems, driving cloud migrations, and enabling business insights through scalable ETL pipelines and advanced analytics. Known for result-driven and collaborative data engineer with a strong foundation in data architecture, cloud engineering, and analytics

Over 10 years of IT experience as a Data Engineer with strong expertise in building enterprise-scale data platforms, cloud data pipelines, and real-time analytics solutions.

Skilled in Big Data technologies including Hadoop, Apache Spark (PySpark & Spark SQL), Spark Streaming, MapReduce, Hive, Pig, Sqoop, Flume, Kafka, Beam, and data processing frameworks like Databricks and Snowflake.

Experienced with major cloud platforms:

AWS: EMR, EC2, S3, Lambda, Glue, Redshift, RDS, DynamoDB, Athena, SNS, SQS, Kinesis, Step Functions, CloudWatch

Azure: Blob Storage, ADLS Gen2, Azure Data Factory, Synapse Analytics, Azure SQL DB, Azure Functions, Event Hubs, Stream Analytics, Key Vault

GCP: Dataflow (Beam), Dataproc, Big Query, Composer, Cloud Functions, Pub/Sub, Cloud Storage

Proficient in coding with Python, Scala, and basic Java for data transformation, orchestration, automation, and ML/ETL development.

Strong experience in querying and managing data using SQL, PL/SQL, T-SQL; worked with databases including SQL Server, Oracle, MySQL, Postgres, Teradata, DB2, and Netezza.

Familiar with DevOps tools including Jenkins for CI/CD automation, Docker for containerization, and basic knowledge of Kubernetes for container orchestration.

Applied machine learning techniques such as regression, classification, clustering, and ensemble models using Spark MLlib and TensorFlow in PoC pipelines.

Experience with orchestration and scheduling tools including Airflow, Cloud Composer, Step Functions, Oozie, Autosys, Control-M, and Cron jobs.

Hands-on experience working with on-premises Hadoop platforms like Cloudera and Hortonworks for large-scale distributed data environments.

Expertise in ETL development using tools such as Informatica, SSIS, and Azure Data Factory; skilled in dimensional data modeling (Star/Snowflake schema), SCD, and data warehousing using Erwin.

Proficient in handling various data formats including CSV, JSON, AVRO, ORC, Parquet, XML, and TXT for ingestion, transformation, and analytics.

Built dashboards and reports using BI tools such as Tableau, Power BI, and Looker with filters, drilldowns, and live data connections to Big Query and Synapse.

Strong documentation and design abilities using tools such as Confluence, SharePoint, and Lucid chart for architecture diagrams and data flow specifications.

Worked with OLAP tools such as SSAS and AAS to create multi-dimensional cubes and integrate with Excel and BI platforms.

Experience with version control and collaboration tools including Git, GitHub, GitLab, Bitbucket, and SVN. • Deep involvement in SDLC activities including requirement gathering, data analysis, solution design, pipeline development, testing, deployment, and support.

Familiar with project management and ticketing tools such as JIRA, Rally, TFS, Azure Boards, and ServiceNow for sprint planning, tracking, and incident resolution.

Hands-on experience in building and deploying machine learning models for churn prediction, flight delay forecasting, and anomaly detection using Spark MLlib, TensorFlow, and Azure ML Studio.

Integrated real-time ML predictions into data warehouses like Big Query, Synapse, and Redshift to support operational analytics and business intelligence initiatives.

Skilled in feature engineering, model training, batch scoring, and real-time prediction streaming across financial, aviation, and public sector domains.

Self-motivated and collaborative team player with a quick learning curve, strong communication skills, and a commitment to continuous improvement.

TECHNICAL SKILLS:

Big Data Technologies

Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache

Flume, Apache Airflow, Cloudera, HBase, Kafka

Programming Languages

Python, PL/SQL, SQL, Scala, C, C#, C++, T-SQL, Power Shell Scripting, JavaScript, Perl script.

Cloud Technologies

AWS, Microsoft Azure, GCP, Databricks, Snowflake

Cloud Services

Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Azure Event Hubs, Azure Synapse Analytics,

AWS Redshift, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS, Data Flow, Big Query, Cloud functions, Compute Engine (VM)

Databricks, Delta Tables, Clusters.

Databases

MySQL, SQL Server, IBM DB2, Postgres, Oracle, MS Access, Teradata

NoSQL Data Bases

MongoDB, Cassandra DB, HBase

Development Strategies

Agile, Lean Agile, Pair Programming, Waterfall, and Test-Driven Development.

ETL, Visualization & Reporting

Tableau, Power BI, Informatica, Talend, SSIS, and SSRS

Frameworks

Django, DBT, Spark MLlib

Version Control & Containerization

tools

Jenkins, Git, CircleCI and SVN, Docker

Operating Systems

Unix, Linux, Windows, Mac OS

Tools

PyCharm, Eclipse, Visual Studio, SQL*Plus, TOAD, SQL Navigator, Query Analyzer, SQL Server

Management Studio, SQL Assistance, Eclipse, Postman

Machine Learning Techniques:

Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative

rules, NLP and Clustering, Pandas, NumPy, Matplotlib, TensorFlow, PyTorch

WORK EXPERIENCE:

Client: U.S. Bank, Minneapolis, Minnesota Senior Data Engineer Duration: Dec 2023 – Present

Roles & Responsibilities:

Led the architectural design and implementation of a highly scalable enterprise-grade data lake solution using Microsoft Azure, supporting complex and high-volume airline operations data for real-time analytics and historical reporting across various departments.

Collaborated with data analysts, architects, and business stakeholders to gather technical requirements and convert them into optimized ETL workflows, ensuring smooth data transformation across raw, staging, and curated layers.

Designed and developed end-to-end ETL pipelines using Azure Data Factory (ADF) and SSIS for ingesting, cleansing, and transforming structured and semi-structured datasets including flight schedules, crew data, and customer feedback.

Built PySpark-based distributed processing frameworks in Azure Databricks to handle large-scale aviation and telemetry datasets, improving compute efficiency and reducing overall ETL job runtimes by over 30%.

Engineered ingestion workflows to automate the extraction and processing of JSON, CSV, and TXT files from various third-party vendors, APIs, and on-prem legacy systems into Azure Blob Storage.

Integrated Apache Kafka to enable near real-time ingredients of flight telemetry and IoT data, powering operational dashboards with low-latency streaming data for mission-critical alerts and analytics.

Led a complete data migration initiative from on-premises SQL Server data warehouses to Azure Synapse and Blob Storage, modernizing the enterprise data platform and increasing scalability and availability.

Established a custom data quality framework in PySpark, implementing validations, null checks, schema enforcement, and automated alerts to ensure integrity of ingested and transformed data.

Implemented secure data access strategies using Azure IAM and custom RBAC policies, combined with row-level security and encryption to enforce privacy controls and meet organizational compliance requirements.

Delivered highly optimized OLAP data models within Azure Synapse and exposed curated datasets to business stakeholders via live Power BI dashboards, enabling real-time decision making.

Managed end-to-end job orchestration using ADF pipeline dependencies and Apache Airflow DAGs, handling both time-based and event-triggered ETL workflows with alerting and retries.

Enabled continuous integration and deployment (CI/CD) using Jenkins pipelines, automating deployment of ADF artifacts and Databricks notebooks across dev, test, and production environments.

Maintained complete source control, versioning, and peer reviews using GitHub, ensuring proper auditability and rollback in multi- developer environments.

Wrote and optimized T-SQL queries and stored procedures for data validation, audit reporting, and pipeline debugging in SQL Server environments.

Developed and maintained Linux Shell scripts to automate file validations, environment health checks, and system-level tasks within batch ETL pipelines.

Actively participated in Agile ceremonies using Jira, including sprint planning, daily stand-ups, retrospectives, and backlog grooming, contributing to the delivery of data products in a timely manner.

Enforced data privacy best practices by implementing column masking and encryption at rest and in transit, in compliance with organizational and GDPR security standards.

Environment: Azure, Azure Data Factory, Databricks, Azure Synapse, Blob Storage, SSIS, PySpark, Kafka, Power BI, GitHub, Jenkins, Jira, JSON, CSV, TXT, Oracle, SQL Server

Client: Delta Airlines, Atlanta, GA Senior Data Engineer Duration: Oct 2020 – Nov 2023

Roles & Responsibilities:

Designed and implemented a highly scalable and secure cloud-native data platform using Google Cloud Platform (GCP), enabling seamless ingestion, transformation, and consumption of high-volume banking and insurance analytics data to support enterprise reporting and machine learning use cases.

Worked closely with business analysts, product owners, and architects to gather requirements and designed robust data engineering workflows for batch and real-time financial data processing with clear lineage and transformation logic.

Led the adoption and integration of Ascend.io, replacing legacy Python scripts and complex Cloud Composer DAGs with declarative, autonomous data pipelines integrated with Big Query, Cloud Storage, and Cloud Functions, reducing orchestration overhead by over 40%.

Developed and optimized ETL pipelines in Python and Shell scripting to support ingestion and transformation of structured data from transactional banking systems into analytical layers for reporting and regulatory compliance.

Migrated legacy Oracle and Teradata warehouses into Big Query, orchestrating schema design, parallel extraction, transformation logic, and loading workflows, ensuring zero data loss and significantly improved query performance.

Built high-performance Spark-Scala batch jobs for large-scale analytical workloads that included data enrichment, deduplication, customer scoring, and time-based aggregation of sensitive financial transactions.

Designed ingestion pipelines to process semi-structured data such as JSON, CSV, and AVRO, automating delivery into Google Cloud Storage from both internal batch exports and external APIs.

Integrated Apache Kafka for real-time streaming ingestion of transaction and user behavior data, enabling near real-time analytics and fraud detection dashboards for compliance teams.

Led the end-to-end cloud migration of legacy on-prem datasets to GCP storage and Big Query, including sensitive workloads from Teradata and Oracle, with strict SLA adherence and rollback mechanisms.

Utilized Snowflake for specialized warehousing and cost-efficient analytical queries, bridging multi-cloud data needs across insurance data and marketing analytics domains.

Built and implemented data quality frameworks including automated validation checks, anomaly detection, and exception handling logic using Spark and Python, integrated with monitoring tools.

Enforced organizational security and privacy compliance by implementing row-level security, column masking, and encryption at rest and in transit, fully aligned with GDPR and internal policies.

Applied GCP IAM roles and custom access control logic to ensure fine-grained access management across data sets based on user roles, departments, and sensitivity labels.

Developed and managed semantic OLAP data models in Power BI, connected live to Big Query datasets for multi-dimensional analysis and interactive visualizations by leadership and operational teams.

Built modular orchestration workflows using Apache Airflow, managing dependencies across ingestion, transformation, and export tasks, with alerting, retries, and SLA monitoring.

Developed and deployed Dockerized Spark jobs across environments for consistent batch ETL execution and explored Kubernetes clusters for orchestrating Spark applications with auto-scaling capabilities.

Delivered curated data marts and analytical outputs directly to business users through Power BI dashboards and direct Big Query interfaces, enabling self-service and ad-hoc exploration of churn, revenue, and claim metrics.

Enabled CI/CD pipelines using Jenkins for version-controlled deployment of Spark jobs and pipeline artifacts and used GitHub for collaborative code development and governance.

Implemented file format optimization strategies using Parquet and AVRO for storing processed datasets in cloud storage, ensuring efficient compression and compatibility with analytics engines.

Played a key role in Agile sprint cycles, actively contributing to sprint planning, retrospectives, and Jira task tracking, while collaborating with cross-functional teams to deliver high-impact data solutions.

Environment: GCP, Spark (Scala), Big Query, Cloud Storage, Snowflake, Kafka, Ascend.io, Docker, Airflow, Jenkins, GitHub, Power BI, Oracle, Teradata, JSON, CSV, AVRO, Parquet, Kubernetes, Agile, Jira

Client: State of OH, Columbus, Ohio Sr Data Engineer Duration: Jan 2018 – Aug 2020

Roles & Responsibilities:

Designed and deployed robust hybrid cloud-based architecture on AWS to modernize the state’s public sector analytics

infrastructure, enabling centralized reporting and data sharing across various state agencies and departments.

Collaborated with stakeholders across public health, education, and administrative domains to gather requirements and design ETL workflows that ensured secure, consistent, and accurate data transformation using Informatica PowerCenter and Python.

Developed and maintained large-scale ETL pipelines to ingest and process government data from Oracle and SQL Server databases into AWS S3 and Redshift, enabling a unified data lake for reporting and analysis.

Built distributed Spark-Scala batch processing jobs to cleanse, enrich, and transform citizen and operational datasets, ensuring scalability and performance for regulatory and internal reporting.

Engineered ingestion workflows for structured and semi-structured data, including CSV, TXT, and AVRO formats, sourced from internal departments and third-party government contractors, and loaded into S3 for downstream processing.

Integrated Kafka streaming pipelines to support real-time ingestion of IoT data from sensors embedded in utility grids, street infrastructure, and emergency response systems, enabling real-time monitoring and alerts.

Led the migration of legacy on-premises Oracle data warehouses to AWS Redshift and S3, applying best practices for schema conversion, validation, and job replatforming to minimize business disruption.

Utilized Apache Hive for ad hoc querying of large volumes of raw historical data stored in S3 and established partitioning strategies to optimize performance.

Built analytical data marts and OLAP datasets using partitioned and clustered models in Big Query and Amazon Redshift to support fast querying by state analysts and departmental dashboards.

Designed and deployed data quality frameworks included rule-based profiling, anomaly detection, validation checks, and automated notifications, ensuring integrity and traceability across pipelines.

Enforced comprehensive data privacy controls by applying encryption, row-level security, and policy-based access in accordance with state privacy regulations and federal compliance mandates.

Applied AWS IAM policies and custom role-based access control mechanisms to securely isolate sensitive citizen data and enforce strict audit trails for data usage.

Enabled self-service data consumption through BI tools like Power BI, while also providing curated API endpoints for inter-agency reporting systems and third-party partners.

Orchestrated complex pipeline dependencies using a hybrid setup of Apache Oozie and Apache Airflow, ensuring workflow reliability, error handling, and SLA adherence across batch and streaming ETL workloads.

Integrated Jenkins for CI/CD automation of data pipeline deployments and used GitHub for version control, branching strategies, and peer reviews to ensure code quality and governance.

Participated in Agile ceremonies including sprint planning, backlog grooming, and retrospectives, using Jira for user story management, release tracking, and reporting progress to technical leads and stakeholders.

Environment: AWS, S3, Redshift, Spark (Scala), Kafka, Hive, Informatica PowerCenter, Talend, Oracle, SQL Server, Big Query, Oozie, Airflow, Power BI, Jenkins, GitHub, JSON, CSV, AVRO, TXT, Encryption, IAM, Agile, Jira

Client: Merck, St. Louis, MO ETL Developer Duration: Dec 2016 - Dec 2017

Roles & Responsibilities:

Designed and implemented a secure and scalable Azure-based data integration architecture to support clinical trials, regulatory reporting, and pharmaceutical data pipelines across multiple therapeutic areas within the organization.

Collaborated with data analysts, clinical researchers, and compliance stakeholders to gather business requirements and define ETL transformation logic for consolidating heterogeneous datasets into a centralized analytics platform.

Developed complex ETL workflows using Informatica PowerCenter to extract, cleanse, and load research data from various systems into Azure SQL Database and Azure Synapse for enterprise-wide accessibility.

Built batch data transformation pipelines using Spark (Scala) within Azure HDInsight clusters to process large volumes of clinical trial data, medical records, and patient responses from multiple external labs and internal systems.

Implemented data ingestion workflows for structured and semi-structured files, including CSV and TXT formats, sourced from laboratory systems, partner portals, and clinical trial platforms into Azure Data Lake Storage (ADLS).

Migrated legacy Informatica-based workflows and flat file ingestion routines into Azure Data Factory (ADF) pipelines for improved automation, modularity, and scheduling flexibility.

Scheduled and orchestrated data transformation pipelines using ADF triggers integrated with Apache Airflow for job dependency management, error recovery, and SLA compliance.

Applied automated data validation rules and built checkpoints to ensure clinical data accuracy and adherence to study protocols, significantly reducing downstream reporting errors and rework.

Enforced data access security using Azure Active Directory (Azure AD) integration and custom role-based access controls (RBAC) to ensure HIPAA-compliant protection of sensitive patient information.

Implemented end-to-end encryption mechanisms to secure sensitive research and trial datasets both at rest and in transit, aligning with internal data governance and industry privacy regulations.

Designed and published SSAS-based OLAP cubes to facilitate interactive data slicing and drill-down reporting for auditors and regulatory affairs teams using Excel and Power BI.

Delivered curated data sets for consumption via Power BI dashboards and direct Azure Synapse SQL queries, enabling advanced analytics on trial performance, site compliance, and participant metrics.

Enabled CI/CD automation of ETL components and PowerCenter deployment scripts using Jenkins, with version control managed via SVN to maintain transparency, rollback, and team collaboration.

Participated in internal Agile ceremonies including sprint planning and retrospectives and worked closely with QA and compliance teams to prepare deployment documentation and validation scripts.

Documented metadata, data dictionaries, and data flow mappings for clinical regulatory audits and internal data governance processes.

Environment: Azure, Azure Data Factory, Azure Data Lake, Synapse, HDInsight, Informatica PowerCenter, Spark (Scala), Azure SQL DB, Power BI, SSAS, Jenkins, SVN, Airflow, CSV, TXT, RBAC, Encryption, Agile, Azure AD

Client: Biocon, IND SQL Developer Duration: Aug 2014 – Sep 2016

Roles & Responsibilities:

Designed and maintained an enterprise-grade SQL Server-based data architecture to support pharmaceutical research, clinical data management, and business operations reporting across drug development workflows.

Collaborated with data architects and scientists to define database models and schemas optimized for high-volume transactional data, integrating key research KPIs across functional units.

Developed complex T-SQL scripts, stored procedures, window functions, and CTEs to handle advanced data transformations, statistical computations, and dynamic reporting views used by compliance and research teams.

Built and automated robust ETL pipelines using SSIS, enabling seamless ingestion of research, clinical trial, and operational datasets from MS Excel, CSV, and SQL Server instances into centralized Azure SQL Databases for unified access and analysis.

Led the on-prem to cloud migration of legacy SQL Server 2008/2012 databases to Azure SQL Database and Blob Storage, applying indexing strategies and performance tuning for optimized query execution and reduced storage costs.

Created and deployed SSAS-based OLAP cubes for pharmaceutical R&D, enabling cross-dimensional analysis and drill-through visualizations used by clinical operations teams and senior scientists.

Ensured data privacy and governance by implementing encryption, access control policies, and row-level security to comply with data classification and protection policies around patient and compound data.

Used SVN for version control and Visual Studio to develop, debug, and deploy database projects, while maintaining modular and reusable components for standardized ETL packages.

Participated in continuous improvement and CI/CD pipeline setup using Jenkins for scheduled deployments and monitoring of critical database and ETL assets.

Produced technical documentation including entity-relationship diagrams, data dictionaries, and job runbooks to support knowledge transfer, audit readiness, and SOP development.

Environment: SQL Server 2012, T-SQL, SSIS, SSRS, SSAS, Azure SQL Database, Azure Blob Storage, Hadoop (HDFS), MS Excel, SVN, Visual Studio, Jenkins, Encryption, SQL Server Agent, Access Control

EDUCATION:

Bachelor of Science Kakatiya University, Warangal, India May 2011 – May 2014

Contact this candidate