Data Engineer Big

Location:

West Haven, CT

Posted:

September 10, 2025

Contact this candidate

Resume:

United States **************@*****.*** +1-203-***-****

PROFESSIONAL SUMMARY

●Results-driven Data Engineer with 6+ years of experience in designing, developing, and optimizing complex data pipelines and architectures using GCP, AWS, and Azure. Skilled in Big Data technologies, Python, C#, SQL, and REST APIs to manage and analyze large datasets, ensuring data quality, integrity, scalability, and performance across cloud platforms.

●Experience working in Linux/Unix environments for scripting, data pipeline management, performance monitoring, and automation.

●Expertise in building batch and real-time data processing systems leveraging AWS services (S3, Redshift, EMR, Lambda, DynamoDB, Glue, IAM) and Azure services (ADLS, ADF, Databricks, Synapse Analytics).

●Highly skilled in integrating big data components such as Hadoop, Spark (Apache Beam) and Kafka with Google Cloud Platform (GCP) services, including BigQuery, Dataflow, Dataproc, and Pub/Sub.

●Proven expertise in designing and implementing scalable data architectures, optimizing data processing and analysis workflows, and GCP tools to enhance performance and efficiency.

●Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Impala, Sqoop, Oozie, Flume, Storm.

●Reduced data pipeline latency by 40% using Apache Beam and GCP Dataflow for batch and real-time workloads.

●Improved query performance by 30% through indexing, partitioning, and materialized views in Snowflake and Azure SQL.

●Decreased deployment times by 50% via CI/CD automation using Terraform, Azure DevOps, and GitHub Actions.

●Enabled 99.9% data availability for mission-critical healthcare dashboards through resilient pipeline design on Azure.

●Enhanced infrastructure security and compliance by implementing RBAC across GCP and AWS Kubernetes clusters.

●Experience in developing spark application using Spark RDD APIs, Data frames Spark-SQL and Spark-Streaming APIs.

●Designed and implemented a scalable data architecture on AWS using Kubernetes, Terraform, and Snowflake, enabling seamless data integration and processing across multiple data sources.

●Expert in creating Micro services using Kafka, PubSub for seamless data streaming in public cloud.

●Highly proficient in building ETL solutions using SSIS, Informatica, Alteryx depending on the requirement, data complexity.

●Architected, implemented medium to large-scale BI solutions on Azure, leveraging Azure Data Platform services such as Azure Data Lake, Databricks, Azure Data Factory, Data Lake Analytics to drive data-driven decision-making.

●Experience in Creating, Debugging, Scheduling and Monitoring jobs using Airflow, Control-M and Oozie. Experience with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, Google Cloud Storage to S3 Operator.

●Proficient in implementing Infrastructure as Code (IaC) using Terraform, AWS CloudFormation, Azure Resource Manager (ARM) templates, and Google Cloud Deployment Manager to automate and manage cloud infrastructure deployments, ensuring consistency, scalability, and efficient resource management across AWS, Azure, and Google Cloud environments.

●In depth knowledge about Data Warehousing (gathering requirements, design, development, implementation, testing, and documentation), Data Modeling (analysis using Star Schema and Snowflake for FACT and Dimensions Tables), Data Processing, Data Acquisition and Data Transformations (Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters).

●Highly proficient in crafting and optimizing advanced SQL queries to extract, transform, and analyse large datasets.

●Skilled in designing efficient database schemas (Star, Snowflake), OLAP, OLTP cubes ensuring data integrity, and implementing complex joins and aggregations to support robust data engineering solutions and decision-making processes.

●Highly skilled in writing robust, scalable scripts using Unix, PowerShell, and Python to automate data pipelines, streamline ETL processes, and enhance data management efficiency.

●Experience in working along Cross-Functional teams with Agile, Scrum and Waterfall development methodologies and Test-Driven Development (TDD), Git for version control fostering iterative development and high-quality software delivery.

EDUCATION

Master of Science in Computer Science University of Bridgeport, USA

Certifications: Azure Data Fundamentals Certified by Microsoft, Data Engineer Certified by AWS

SKILLS

Big Data /Hadoop Components: HDFS, Hue, MapReduce, PIG, Hive, Solr, HCatalog, HBase, Sqoop, Impala, Apache Beam, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, Trino, PySpark, Airflow, Kafka, Ab Initio, Snowflake, Spark

Databases: Oracle, SQL Server, S4 HANA, MySQL, PostgreSQL, DB2, Teradata

NoSQL Databases: HBase, Cosmos, Cassandra, MongoDB, Bigtable

Programming Languages: Java, Scala, Impala, Python, C#

Machine Learning: Linear & Logistic Regression, Clustering, Random Forest, KNN, K-Means, Decision Tree, SVM, ARIMA

AWS Services: S3, DynamoDB, Glue, ECS, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, Cloud formation

GCP Services: Artifact Registry, BigQuery, Bigtable, Cloud Build, Cloud Storage, Cloud Functions, Dataflow, Dataproc, Pub/Sub

Azure Services: Azure Data Lake Storage, Azure Data bricks, Azure Data Factory, LogicApps, Azure Synapse, Functions, ARM

IDE: Eclipse, Dreamweaver, Visual Studio, Visual Studio Code, Jupyter Notebooks

Methodologies: Agile (Scrum), Waterfall, UML, Design Patterns, SDLC

Automation tools, scripting languages: Ansible, PowerShell, Bash, Shell scripting, Terraform, LINUX/UNIX, Jenkins, Control-M

Operating Systems: Linux, Unix, Windows, MacOS

Reporting and ETL Tools: Tableau, DAX, Alteryx, Power Query, Power BI, MS Excel, AWS GLUE, SSIS, SSRS, Informatica, Talend

Others: Ansible, git, Bitbucket, Agile development, Kanban, Data Structures, SDLC, OLAP, OLTP, Microservices, GitHub, Jenkins, Kubernetes, DevOps, CI/CD, IaC, IaaS, SaaS, PaaS,GitHub, JIRA, Apache Flink, Drill, MS Excel, JSON, XML, Parquet, HIPAA, Panda

EXPERIENCE

Client: Interactive Brokers, Connecticut, USA December 2023 - Present

Role: Azure Data Engineer

Description: Interactive Brokers is a global electronic brokerage firm known for offering trading platforms and technology solutions for a wide range of financial instruments. Developed and implemented data pipelines, ETL processes, and real-time analytics solutions to optimize data ingestion, transformation, and visualization across cloud and on-premises environments, ensuring high performance and accuracy in financial data processing.

Responsibilities:

Designed and optimized scalable ETL pipelines using Azure Data Factory, Databricks, Apache Spark, and Synapse Analytics, improving data processing efficiency by 30% and enabling seamless ingestion and transformation of healthcare claims and clinical data.

Developed real-time fraud detection pipelines using Kafka, Spark Streaming, and AWS Kinesis, processing over 1M+ events/sec and reducing detection latency by 50%.

Created interactive Power BI dashboards with Snowflake and Azure SQL, leveraging DAX and Power Query to reduce reporting lag by 50% and improve KPI tracking for fraud and healthcare analytics.

Built cloud-native data warehouses using Snowflake, Redshift, and BigQuery, utilizing features like Time Travel, Cloning, and Materialized Views to improve query performance by 15%.

Engineered robust data models (3NF and Star Schema) in Snowflake to support population health, risk adjustment, and OMOP-compliant analytics, ensuring data standardization and improved query scalability.

Automated CI/CD pipelines with Azure DevOps, Terraform, and Git, reducing deployment times by 40% and streamlining cloud provisioning across AWS and Azure environments.

Enhanced SQL performance by 40% using indexing, partitioning, and advanced optimization techniques in Azure SQL, PostgreSQL, and MySQL.

Integrated DBT into the CI/CD pipeline using Azure DevOps, enabling automated testing, version control, and documentation of data models, which improved deployment speed and reduced data defects by 30%.

Deployed machine learning models using TensorFlow, BigQuery ML, and MLflow within Databricks, achieving a 35% reduction in prediction latency for real-time analytics.

Leveraged AKS for deploying microservices, managed secure data transfers (SCP/SFTP), and applied system monitoring with Linux/Unix tools (top, vmstat, netstat, etc.) to ensure reliable and secure data operations.

Integrated RESTful APIs and GraphQL microservices for cross-cloud data exchange (AWS S3, Azure Blob, GCS), reducing API response time by 25%, and implemented web automation testing using Selenium WebDriver.

Scheduled and monitored batch jobs using Autosys, ensuring efficient ETL operations and data quality KPIs in production healthcare workflows.

Led Agile/Scrum ceremonies, improved team efficiency by 20%, collaborated with healthcare domain experts, and actively pursued Snowflake SnowPro Certification to solidify cloud data warehousing expertise.

Environment: Azure Data Lake, Azure Data Factory, LogicApps, Azure Synapse Analytics, Azure CLI, CI/CD, C#, Elasticsearch, ETL, HBase, Hive, Java, Jenkins, Kafka, MySQL, .Net, Oracle, Powershell, Power BI, PySpark, Python, SAS, Scala, Shell Scripting, Spark, Spark SQL, SQL, YAML, Git, DAX, Power Query,Linux,Unix

Client: McKesson Corporation, Connecticut, USA March 2023 - November 2023

Role: Sr Data Engineer

Description: McKesson Corporation, a leading healthcare services and pharmaceutical distribution company, provides innovative solutions to improve healthcare delivery and patient outcomes. Contributed to enhancing data processing workflows, developing automated data pipelines, and implementing cloud-based solutions to improve system scalability, reliability, and real-time analytics capabilities.

Responsibilities:

Architected and deployed data solutions on Google Cloud Platform (GCP) using BigQuery, Bigtable, Cloud Storage, Dataflow, and Dataproc, optimizing storage, retrieval, and analytics scalability.

Built and optimized large-scale ETL pipelines using Apache Beam, Dataflow, and Apache Spark, supporting both real-time and batch processing for analytics and ML workloads.

Led RDBMS migrations from Oracle and Teradata to GCP, utilizing Debezium for Change Data Capture (CDC), Dataproc for transformations, and BigQuery for data warehousing, reducing latency by 30%.

Integrated third-party loyalty partner data into Snowflake tables, improving accuracy and automation of tracking and reporting for membership activities.

Created self-service analytics platforms with Looker Studio and BigQuery, enabling business teams to access key metrics and monitor KPIs independently.

Migrated legacy ETL workflows to Azure Data Factory and deployed scalable reporting with Azure Synapse Analytics, enhancing automation and cloud performance.

Deployed containerized ETL components via Azure Kubernetes Service (AKS) and supported hybrid cloud strategies using Windows Azure for secure, scalable data architecture.

Implemented end-to-end CI/CD pipelines using Tekton and Bitbucket, integrated with Prometheus and Grafana for observability, and configured alerting and dashboards.

Developed scalable data ingestion pipelines using Cloud Functions, Pub/Sub, MongoDB, and Bigtable, supporting real-time ingestion of high-volume NoSQL data streams.

Deployed machine learning models using BigQuery ML and integrated them into Looker Studio, increasing forecast accuracy for customer behavior insights by 15%.

Enhanced infrastructure security by enforcing Kubernetes RBAC policies across GCP and AWS, enabling secure and compliant access control.

Maintained metadata management, data governance, and data quality monitoring practices aligned with CDMP principles using industry-standard tools.

Used Bazel, Cloud Build, and Bamboo to automate testing and deployment workflows for high-performance, scalable data pipelines across GCP and AWS.

Collaborated in Agile/Scrum environments using JIRA, led sprint planning, and mentored junior engineers in Python scripting, query optimization, and engineering best practices.

Configured and optimized Elasticsearch mappings, improving indexing efficiency and query performance for analytics integration.

Environment: Apache Flume, RESTful APIs, Apache Beam, Dataflow, BigQuery, Bitbucket, CI/CD, Dataproc, EC2, EMR, ETL, Flume, GCP, HDFS, JS, Jira, Python, RDBMS, Spark, SQL, Teradata, PubSub, Cloud Storage, gcloud CLI, Hadoop, bq ml, Kubernetes

Client: Providence Limited, Hyderabad, India January 2021 - November 2022

Role: GCP Data Engineer

Description: Providence Limited is a technology-driven company specializing in providing innovative IT and business solutions across various industries. Contributed to developing and optimizing data pipelines, automating ETL workflows, and implementing cloud-based data solutions to enhance operational efficiency, data accuracy, and scalability.

Responsibilities:

Designed and developed large-scale data solutions across Azure (Data Lake, Synapse, Databricks, Data Factory) and Google Cloud Platform (GCP) (BigQuery, Bigtable, Dataproc, Pub/Sub, Cloud Storage, Cloud Functions), enabling seamless data integration and scalable analytics workflows.

Built scalable ETL pipelines using Apache Beam, Apache Spark, PySpark, Pandas, and Spark-SQL, optimizing data transformations from structured and semi-structured sources to support behavioral analytics.

Migrated legacy SQL Server, Oracle, and Teradata warehouses to BigQuery, leveraging Debezium for Change Data Capture (CDC), streaming ingestion, and external tables to boost query performance by 22%.

Developed and enforced custom data quality, metadata management, and data governance frameworks using Python, SQL, and industry best practices aligned with CDMP principles.

Designed secure analytics platforms using row-level security in Tableau, Looker Studio, and BigQuery authorized views, enabling compliance and controlled access across cloud platforms.

Implemented Kafka message partitioning and replication tuning for scalable, fault-tolerant event-driven processing in distributed pipelines.

Orchestrated real-time data ingestion and processing pipelines using Pub/Sub, Cloud Functions, MongoDB, and NoSQL solutions for serverless event-driven architectures.

Created batch ingestion workflows using Apache Sqoop and Dataproc, automating ingestion into GCS, and reduced latency in Dataflow pipelines.

Automated and orchestrated complex data workflows using Apache Airflow, integrated with BigQuery, Cloud Storage, and Azure Synapse Analytics to eliminate manual intervention.

Implemented CI/CD pipelines using Azure DevOps, Tekton, Bitbucket, and Cloud Build, integrated with observability tools like Prometheus, Grafana, Bamboo, and Bazel for proactive monitoring.

Deployed and monitored microservices and ETL workloads using Azure Kubernetes Service (AKS), with RBAC policies and governed containerized environments across hybrid cloud.

Delivered cost-optimized data solutions on Windows Azure Cloud Services, focusing on scalability and security for enterprise analytics use cases.

Participated in Agile and Scrum ceremonies, using JIRA for sprint planning, and collaborated cross-functionally to deliver data-driven products iteratively.

Migrated legacy SAS-based metrics to Snowflake on GCP, improving query execution efficiency and aligning analytics pipelines with cloud-native data warehousing best practices.

Environment: BigQuery, Apache Airflow, Microsoft Azure, Cloub Build, Cloud Run, Data Bricks, Dataform, Cloud Spanner, Cloudera, Azure Data Factory, ETL, GCP, PL/SQL, PySpark, Python, Services, Spark, SQL, SAS, Snowflake, Sqoop, Tableau, VPC

Client: Swiss Re Limited, Hyderabad, India April 2019 - December 2020

Role: Data Engineer

Description: Swiss Re Limited is a leading global reinsurance company providing risk management solutions across multiple sectors. Contributed to designing and optimizing data pipelines, automating ETL processes, and deploying cloud-based analytics solutions to improve data reliability, scalability, and decision-making capabilities. Responsibilities:

Designed and implemented scalable real-time ETL pipelines using Kubernetes and Docker, ensuring efficient data integration, transformation, and deployment automation in production environments.

Automated AWS infrastructure provisioning with Terraform, CloudFormation, and CI/CD pipelines, significantly reducing deployment times and operational complexity.

Built and tuned Spark Streaming applications using Scala and PySpark to process high-throughput data from Kafka and AWS S3, enabling robust event-driven analytics.

Created advanced SQL queries in Trino, reducing query runtime by 30% through execution plan optimization for large-scale distributed datasets.

Implemented Apache Sqoop for high-performance data migration between Teradata, Oracle, and Hadoop, streamlining forecasting and analytics workflows.

Designed and deployed Snowflake Snowpipes for near-real-time ingestion and successfully migrated legacy Teradata pipelines to modern Snowflake architecture.

Orchestrated complex ETL pipelines using Apache Airflow DAGs, automating data movement across Amazon Redshift, S3, and other cloud-based storage systems.

Managed containerized services and scheduled data jobs using Azure Kubernetes Service (AKS) to support scalable and secure workload deployment.

Built analytics dashboards and reporting solutions hosted on Azure Cloud, enhancing real-time decision-making through business insights.

Designed data lake solutions using Azure Databricks and Delta Lake, improving data storage, processing efficiency, and supporting large-scale analytics.

Transformed semi-structured JSON data into Apache Parquet format using Spark DataFrames, enhancing compression, storage, and query performance.

Improved observability across data pipelines by integrating AWS CloudWatch, Prometheus, and Grafana, reducing incident response times and improving system reliability.

Participated in Agile sprints, retrospectives, and requirement refinements, ensuring alignment with project timelines and data engineering deliverables.

Environment: AWS, CI/CD, ADF, Docker, EC2, ETL, Factory, HDFS, Java, Jenkins, JS, Kubernetes, lake, Lambda, Oracle, PySpark

Saipavanteja B

Data Engineer

Contact this candidate