Sai Siddipeta
Data Engineer
************.****@*****.***
Data Engineer with 6and years’ experience across Azure, AWS, and GCP, specializing in scalable ETL/ELT pipelines, data modeling, and streaming. Proficient in SQL, Python, Spark, and Airflow, with domain expertise in healthcare, finance, and pharma. Skilled in CI/CD, IaC, and governance frameworks to deliver secure, compliant data solutions. PROFILE SUMMARY:
• Experienced Data Engineer with a strong background in designing, implementing, and optimizing leading cloud platforms, including Azure, AWS, and GCP. Proficient in leveraging Azure Data Factory, AWS Glue, and Google Cloud Dataflow to build scalable ETL processes that transform raw data into actionable insights. Skilled in managing large datasets using services like Azure Synapse Analytics, Amazon Redshift, and Google Big Query, ensuring high availability, performance, and security.
• Adept at integrating diverse data sources, automating data workflows with Python, and deploying CI/CD practices to streamline data operations.
• Hands-on experience with infrastructure as code (IaC) tools such as Terraform and Azure Resource Manager (ARM) templates, enabling consistent and repeatable deployment of cloud resources.
• Experience in writing complex SQL queries and optimizing database like Azure SQL Database, Amazon RDS, and Google Cloud SQL.
• Implemented CI/CD pipelines using Azure DevOps, enhancing the automation of data workflows and reducing deployment times.
• Experienced in setting up and managing Azure Data Lake Storage, ensuring efficient and secure data storage and retrieval.
• Developed serverless architectures using AWS Lambda and AWS Glue, streamlining ETL processes and reducing infrastructure overhead.
• Strong command over Amazon Redshift for data warehousing solutions, optimizing query performance and cost.
• Engineered data processing pipelines with Google Cloud Pub/Sub and Dataflow, ensuring low-latency and high-throughput data streaming.
• Applied IAM best practices in GCP to secure data access and ensure compliance with industry standards.
• Worked with Pentaho Data Integration (PDI) for visual ETL workflows in earlier stages of data pipeline development, simplifying onboarding and rapid transformation of healthcare and finance data.
• Advanced proficiency in Python and SQL for data manipulation, automation, development.
• Implemented machine learning models within cloud ecosystems, leveraging cloud-native tools like Azure Machine Learning, AWS SageMaker, and Google AI Platform.
• Strong advocate for data governance and best practices, ensuring data quality, integrity.
• Backed by a strong foundation in computer science fundamentals including algorithms, data structures, and systems design.
TECHNICAL SKILLS:
Cloud Platforms:
Azure: Azure Data Factory, Azure Synapse
Analytics, Azure Databricks, Azure Data
Lake Storage, Azure SQL Database.
AWS: AWS Glue, Amazon Redshift,
Amazon S3, AWS Lambda, Amazon RDS,
AWS Step Functions, Amazon DynamoDB.
GCP: Google BigQuery, Google Cloud
Storage, Google Cloud Dataflow, Google
Cloud Pub/Sub, Google Cloud SQL,
Data Engineering & ETL Tools:
Apache Spark, Apache Kafka, Hadoop,
Airflow, Snowflake.
Programming Languages:
Python, SQL, Java, Scala, Shell Scripting,
Go, Candand, C#, TypeScript, RUST.
Infrastructure as Code (IaC):
Terraform, AWS CloudFormation, Azure
Resource Manager (ARM) Templates.
CI/CD & DevOps:
Azure DevOps, AWS CodePipeline,
Jenkins, Git, Docker, Kubernetes
Monitoring & Logging:
Azure Monitor, AWS CloudWatch, Google
Stack driver, ELK Stack.
Machine Learning & AI:
Azure Machine Learning, AWS
SageMaker, Google AI Platform.
Data Visualization & Reporting:
Power BI, Tableau, Looker, Microsoft Excel,
QuickSight
Data Security & Governance:
Azure Security Center, AWS IAM.
EDUCATION:
University of Central Missouri, Masters
in Technology, USA.
Work Experience:
TSYS - Columbus, Georgia, USA May 2024 - Present
Azure Data Engineer
Description: Total System Services, Inc. (commonly referred to as TSYS), is an American financial technology company I design, build, and maintain scalable and robust data pipelines and ETL processes to handle large volumes of payment and transaction data. Develop and manage data platforms and frameworks to enable efficient data processing and analytics Responsibilities:
• Built incremental ETL pipelines in SQL Server, ADF, and Databricks using Change Data Capture, reducing batch load windows by 60%.
• Designed curated marts and dimensional models in Snowflake, supporting HR, Finance, and compliance dashboards in Power BI and Looker.
• Developed dbt models with automated tests for freshness, uniqueness, and referential integrity, improving financial dataset accuracy by 30%.
• Leveraged Snowflake Streams, Tasks, and Snowpipe for incremental ingestion and near real-time reporting.
• Automated schema validation and anomaly detection with Python and dbt, reducing manual interventions by 35%.
• Implemented row-level and column-level security in Snowflake to strengthen data governance and compliance.
• Partnered with external vendors to establish data contracts and delivery standards, improving ingestion quality by 25%.
• Optimized Snowflake queries with clustering, partitioning, and result caching, lowering compute costs by 30%.
• Orchestrated complex pipelines in Airflow and ADF, improving SLA adherence and reducing downtime by 25%.
• Built standardized KPI layers in dbt and Snowflake, enabling consistency across dashboards and reducing reconciliation efforts.
• Developed and automated large-scale ETL pipelines with Informatica IICS, integrating payment transaction data into Azure/Snowflake for risk and compliance analytics
• Developed Python frameworks for error handling and automated recovery, ensuring resilient production pipelines.
• Configured monitoring with Azure Monitor and Splunk to track SLA performance, data anomalies, and pipeline health.
• Partnered with ML engineers to integrate structured Snowflake datasets into natural language query workflows, accelerating insights.
• Designed and optimized ETL/ELT pipelines to integrate data from payment systems into finance reporting marts.
• Developed and tuned complex SQL queries and stored procedures for reconciliation, settlement, and compliance reporting.
• Implemented job scheduling and monitoring solutions (Control-M, Airflow) to ensure reliability of daily financial data feeds.
• Collaborated with finance stakeholders and business users to deliver BI-ready datasets supporting accounting and audit workflows.
• Developed ETL pipelines using Hive and Spark for structured and unstructured datasets, optimizing query performance by 30%.
• Built and optimized data pipelines using Microsoft Fabric integrated with Azure Data Factory and Synapse for unified analytics.
• Automated data workflows with Unix Shell scripting to improve reliability of ingestion jobs.
• Developed and optimized ETL/ELT pipelines using Apache Spark with Java and PySpark, ensuring performance and scalability for large-scale transactional data.
• Monitored and troubleshot data pipelines using Airflow and Azure Data Factory, proactively resolving performance bottlenecks and data quality issues.
• Documented lineage, business definitions, and runbooks for Snowflake and dbt models, improving audit readiness.
• Automated ingestion of ERP, HR, and payment data sources into Snowflake, reducing manual integration by 40%.
• Collaborated with compliance and finance teams to design secure Snowflake marts, ensuring audit-ready reporting and traceability.
Environment: Azure, Azure Analysis Services, CI/CD, Docker, ETL, Hive, Java, Jenkins, Jira, Kafka, Azure Data Lake, MySQL, PySpark, Python, Scala, Services, Snowflake, Spark, Spark SQL, SQL, Sqoop. AstraZeneca PLC – Chennai, India July 2022 - June 2023 Data Engineer
Description: AstraZeneca plc is a British Swedish multinational pharmaceutical and biotechnology company. I have Built and optimized scalable data pipelines to ingest and process clinical trial, real-world evidence, and genomics data using platforms like Azure Data Factory, Databricks, and Synapse Analytics, enabling faster insights for R&D and drug discovery. Responsibilities:
• Automated ingestion of Oracle, PostgreSQL, and API feeds into Databricks and Snowflake, reducing manual integration effort by 30% and accelerating access to clinical and HR datasets.
• Designed and deployed dbt semantic models to standardize transformations, enabling AI-ready HR and clinical dashboards in Power BI and Tableau.
• Implemented Unity Catalog and Azure Purview lineage to enforce data governance, security, and auditability across clinical and HR pipelines.
• Applied medallion architecture (bronze, silver, gold) to build scalable transformation pipelines that improved data consistency and maintainability.
• Integrated Great Expectations tests into Databricks pipelines, preventing low-quality or incomplete data from propagating into curated datasets.
• Partnered with R&D analysts to deliver clinical outcomes marts, reducing ad-hoc reporting by 25% and accelerating insights for drug development.
• Optimized Snowflake queries using caching and Z-ordering, resulting in a 30% reduction in compute costs.
• Automated error recovery workflows with Airflow to improve SLA adherence and reduce pipeline failures by 25%.
• Built semantic data models in Snowflake and Databricks, aligning with knowledge graph concepts for interoperability across HR and clinical systems.
• Ingested and transformed relational datasets from PostgreSQL, MySQL, and SQL Server into Databricks and Snowflake, enabling standardized analytics marts for clinical reporting.
• Documented data models, lineage, and ingestion workflows, ensuring compliance, transparency, and reusability across teams.
• Applied strong debugging skills and attention to detail to resolve complex data integration issues, maintaining accuracy in downstream analytics.
• Integrated Azure OpenAI and AWS Bedrock into data pipelines, enabling LLM-powered analytics and enterprise AI use cases.
• Implemented advanced Snowflake features including Dynamic Tables, Streams, Tasks, and Snowpipe to support incremental ingestion and near real-time analytics on clinical and operational datasets.
• Extracted and transformed ERP/finance datasets from Oracle Fusion/FAW into Snowflake using IICS and Databricks, enabling reliable downstream analytics for clinical and financial reporting.
• Implemented automation and query optimization techniques that reduced data pipeline runtime by 25%.
• Partnered with business teams to deliver product analytics insights from clinical and marketing data.
• Implemented data audit, archival, and restoration frameworks ensuring compliance with regulatory and internal policies.
• Performed daily monitoring, error resolution, and troubleshooting of ingestion pipelines for clinical, operational, and finance data.
• Built scalable pipelines on Azure Data Factory integrating heterogeneous sources into a governed lakehouse architecture.
• Documented lineage, KPIs, and definitions, ensuring consistent use of metrics across multiple BI tools.
• Mentored junior engineers on SQL query optimization, orchestration best practices, and dbt usage.
• Delivered secure Snowflake marts with row-level access controls, ensuring compliance with healthcare data standards. Environment: Azure Data Factory, Azure Data Lake, Azure Databricks, Azure Synapse Analytics, PySpark, SQL, Azure Purview, Git, Power BI, JIRA, Hadoop, Hive, MapReduce, Spark. Bank of Maharashtra, Mumbai, India April 2021 - May 2022 GCP Data Engineer
Description: Bank of Maharashtra is a prominent public sector bank Provides a wide range of retail banking services. Designing and implementing efficient data pipelines using GCP services like Dataflow, Dataproc, Pub/Sub, and Cloud Composer to ingest, transform, and load data from various sources and Managing data storage solutions on GCP, including Big Query, Cloud Storage, and Cloud SQL, ensuring data availability, reliability, and scalability. Responsibilities:
• Built streaming Dataflow pipelines to process millions of daily transactions and engagement events into BigQuery with near real-time delivery.
• Designed partitioned and clustered BigQuery warehouses, reducing query costs by 40% and improving performance for compliance analytics.
• Developed dbt transformations in BigQuery to produce curated marts for compliance and HR dashboards, improving SLA adherence across reporting.
• Configured Data Catalog metadata and lineage, enhancing dataset discoverability, auditability, and governance.
• Automated deployments with GitLab CI/CD pipelines and terraform, reducing release times by 50% and ensuring consistent infrastructure provisioning.
• Implemented real-time anomaly detection pipelines with Pub/Sub and Python, enabling proactive detection of workforce and campaign issues.
• Implemented vector database pipelines (Pinecone/Weaviate) to support semantic search and AI-driven insights.
• Integrated marketing campaign and media performance datasets into data marts, supporting analytics on customer engagement, ad spend efficiency, and channel ROI.
• Delivered regulatory-ready marts to compliance teams, ensuring audit-readiness and alignment with financial reporting standards.
• Engineered finance and accounting data pipelines powering BI dashboards for audits, forecasting, and operational reporting.
• Built and optimized dimensional models and ETL workflows to consolidate banking transactions into enterprise reporting warehouses.
• Developed high-performance data transformation utilities in Rust, improving pipeline efficiency and reducing latency.
• Contributed to Rust-based microservices for log parsing and streaming ingestion frameworks.
• Built dashboards in Tableau and Looker to provide product analytics insights into customer acquisition and long-term trends.
• Applied statistical analysis to identify anomalies and prevent common pitfalls in financial reporting.
• Designed and maintained SQL Server stored procedures to streamline reporting for compliance and risk analysis.
• Authored continuity and disaster recovery documentation, strengthening long-term pipeline resilience.
• Designed customer segmentation marts integrating CRM and financial data, boosting campaign ROI by 20%.
• Implemented schema evolution validation checks to prevent pipeline failures due to upstream source changes.
• Monitored SLAs and SLOs with Composer and Stackdriver, improving system reliability and reducing missed refreshes.
• Developed knowledge-graph aligned marts linking HR, CRM, and financial datasets, enabling advanced customer and workforce analytics.
Environment: Google Cloud Dataflow, Apache Beam, Pub/Sub, Google Cloud Composer, Apache Airflow, Snowflake, Apache Kafka, Google Cloud Pub/Sub, Apache Spark, Hadoop, HDFS, Hive, MapReduce, Oracle, Impala, Apache Flume, Terraform, Jenkins, GitLab CI, Docker, Kubernetes, Apache Atlas, Apache NiFi, Kafka Connect, Looker, QlikView, Power BI, Python, SQL, Big Query, Google AI Platform, scikit-learn.
GlaxoSmithKline Pharmaceuticals Ltd, Mumbai, India June 2018 – Mar 2021 Data Engineer
Description: GlaxoSmithKline Pharmaceuticals Ltd (GSK) is a global healthcare company with a strong presence in pharmaceuticals, vaccines, and consumer healthcare products. Selecting appropriate data storage solutions, optimizing database systems, and designing data models to support analytics and machine learning initiatives and Implementing data validation checks, ensuring data accuracy, and adhering to data governance policies. Responsibilities:
• Modernized ingestion workflows with AWS Glue and EMR Spark, cutting infrastructure costs by 25% and reducing pipeline complexity.
• Designed and optimized Redshift marts and staging layers in S3, improving HR and compliance reporting by reducing query runtimes by 30%.
• Automated freshness, lineage, and SLA monitoring in CloudWatch, boosting trust in HR and compliance datasets by 30%.
• Implemented IAM and KMS encryption with role-based access controls, ensuring compliance with HIPAA and SOC2 standards.
• Delivered self-service BI dashboards for HR, finance, and compliance teams using Tableau, QuickSight, and QlikView.
• Automated infrastructure provisioning with CloudFormation templates, embedding cost-tracking tags for transparency.
• Migrated legacy ETL pipelines into serverless AWS Glue jobs, reducing operational overhead and increasing agility.
• Partnered with compliance teams to design audit-ready curated datasets, enabling smooth regulatory reviews.
• Built Snowflake and Redshift views optimized for BI reporting, improving query efficiency for HR and financial analysis.
• Standardized KPI definitions across Tableau, QuickSight, and Looker, eliminating reporting inconsistencies.
• Deployed and managed containerized workloads on AWS Fargate and ECS, reducing infrastructure overhead and ensuring cost-optimized scaling.
• Implemented autoscaling and HA strategies across clusters, improving system reliability.
• Integrated Fargate services with S3, IAM, and CloudWatch for secure and monitored execution.
• Enhanced data quality checks and error-handling frameworks across ingestion and transformation processes.
• Partnered with business analysts to deliver self-service BI data marts supporting finance, operations, and commercial teams.
• Delivered HR compliance and financial data products that supported external audits and regulatory inspections.
• Developed serverless AWS Lambda functions to support micro-ingestions and automated validation checks. Environment: AWS S3, Amazon RDS, Amazon Redshift, Amazon DynamoDB, AWS Glue, Amazon EMR (Spark), AWS Athena, AWS Lambda, AWS Step Functions, OpenShift, AWS CloudFormation, AWS IAM, AWS KMS, AWS GuardDuty, AWS CloudWatch, Amazon QuickSight, Redshift Spectrum, Python, SQL.