Senior Data Engineer Big Data & GCP Expert

Location:

Ashburn, VA

Posted:

December 08, 2025

Contact this candidate

Resume:

Mahesh Suryadevara

Email: ******.*******@*****.***

Phone: 862-***-****

Summary:

Senior Data Engineer with 8+ years of expertise in Big Data, GCP, and end-to-end data pipeline orchestration. Proven track record in migrating on-prem Hadoop to GCP, building scalable ETL/ELT workflows, and leveraging Google Cloud tools (BigQuery, Dataflow, Dataproc, Composer, Doc AI, Vertex AI) for data processing and machine learning.

• Expertise in Big Data tools (Spark, Hadoop, Kafka) and DevOps practices (CI/CD, Terraform, Kubernetes).

• Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and am familiar with building custom Airflow operators and orchestrating workflows with dependencies involving multiclouds.

• Experience in using various tools like Sqoop, Flume, Kafka, and Pig to ingest structured, semi structured, and unstructured data into the cluster.

• Having hands-on experience in versioning using bit bucket.

• Used DBT as an open source ETL tool to effectively transform data into tables with SQL commands to author their own data pipelines.

• Proficient with Apache Spark ecosystems such as Spark and Spark Streaming using Scala and Python.

• Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement. Experience in RDD architecture and implementing spark operations on RDD also optimizing transformations and actions in Spark.

• Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)

• Strong Experience in implementing Data warehouse solutions in Confidential Redshift

• Worked on various projects to migrate data from on-premise databases to Confidential Redshift, RDS, and S3.

• Implementing DevOps practices (CI/CD, Terraform, Kubernetes) with JIRA-Jenkins integration for automated deployments

TECHNICAL SKILLS Cloud

Platforms:

GCP: BigQuery, Dataflow, Dataproc, Composer, Vertex AI, Doc AI AWS: Glue, Lambda, Redshift, EMR, S3, Athena

Azure: Data Factory, Databricks, Synapse

Big Data & ETL: Apache Spark (PySpark/Scala), Kafka, Hadoop (HDFS, Hive), Airflow, DBT, Snowflake DevOps & Automation: Terraform, Docker, Kubernetes (EKS/GKE), Jenkins, CI/CD (GitOps, Blue/Green), Ansible

Databases: BigQuery, Hive, Teradata, PostgreSQL, SQL, Snowflake, Cosmos, Mongo DB, Cassandra, DynamoDB

Languages: Python, SQL, Scala, Shell Script, Java

Data Processing: Doc AI (OCR, claims processing), BigQuery ML, Vertex AI (ML workflows) DevOps

& APIs: Postman, Cloud Functions, IAM, Service Accounts, Jenkins, Terraform, Docker, Kubernetes (EKS/GKE), Ansible, Helm

Version control systems: GIT, GITHUB, Bitbuckets, Gitlab. IDE’s: PyCharm, Eclipse, Visual studio, IntelliJ

Operating systems: UNIX, LINUX Red hat, Ubuntu, Windows RDBMS databases: Teradata, Five Tran, NoSQL, HBase, RDS, Kibana, Oracle 91/10g/11g/12c Data visualization: Power BI, Tableau

Automation & Monitoring: SQL alerts, Airflow scheduling, cost-optimized pipelines Collaboration & Project Tools: JIRA, Confluence, Bitbucket, GitHub, GitLab Methodologies: Agile, Scrum

Professional Summary:

Aetna/CVS – Hartford, CT Nov 2023 – present

Role: Sr. Data Engineer (GCP Focus)

• Built a large-scale patient data indexing and retrieval system using Elasticsearch to provide near real-time search over clinical records, lab results, and treatment histories.

• Implemented data mining and ML-based patient risk scoring models using SparkML / PyTorch, predicting readmission likelihood and chronic condition progression.

• Designed distributed data pipelines processing high-volume healthcare data (structured + semi-structured) with Spark and cloud data lakes.

• Created NoSQL-backed feature stores (MongoDB/Cassandra) to store patient-level embeddings and model signals for fast retrieval in downstream clinical decision systems.

• Enabled physicians and case managers to perform faster patient stratification, improving proactive care outreach and reducing preventable high-risk events.

• Designed and developed scalable data pipelines on GCP to ingest, normalize, and curate EHR/EMR data from multiple hospital and claims systems, enabling unified clinical analytics.

• Built dimensional data models in BigQuery (Star and Snowflake schemas) to support patient history, provider performance, and care outcome dashboards.

• Implemented synchronous & asynchronous data integration workflows using Pub/Sub for real-time HL7/FHIR data and Dataflow/Spark (Dataproc) for large batch ETL pipelines.

• Ensured HIPAA-compliant storage and processing via Cloud IAM, CMEK encryption, and DLP masking policies for PHI protection.

• Developed interactive clinical and operational reporting dashboards using Looker / Data Studio / Power BI, reducing manual analytics cycles by ~40%.

• Partnered with business analysts, clinical teams, and compliance to guarantee accurate metric interpretation and regulatory alignment.

• Designed and implemented scalable data pipelines on Google Cloud Platform using Dataproc, BigQuery, and GCS, reducing batch processing time by 40%.

• Automated end-to-end data ingestion, transformation, and validation workflows using Apache Airflow, improving data availability SLAs.

• Developed distributed Spark jobs for processing multi-terabyte datasets in real-time, achieving up to 50% reduction in latency for key business reports.

• Built and optimized data models (star/snowflake schemas) and partition strategies in BigQuery and Hive for efficient querying and cost optimization.

• Integrated Kafka to support real-time data streaming into the data lake; implemented checkpointing and semantics.

• Instrumented data quality checks and unit tests using PyTest and TDD practices to ensure 99.9% data accuracy in production.

• Used Shell and Perl scripts to automate data archival, cleanup, and GCS bucket management tasks.

• Monitored job performance and resource utilization using Stackdriver, implementing optimizations that reduced Dataproc costs by 30%.

• Automated pipeline setup using config sheets (storage/compute projects, encryption, BQ load type) and triggered via Postman API Airflow DAGs.

• Built end-to-end ETL pipelines in Airflow (GCP Composer) using:

• BigQuery InsertJob Operator, Python VirtualEnv Operator, and custom functions via DIY Portal.

• Optimized costs by refactoring legacy code and reducing pipeline runtime by 30%.

• Implemented SQL-based monitoring

• Automated email alerts for table refresh failures or <5% record count deviations.

• Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query ·

• Worked on Spark streaming using Apache Kafka for real-time data processing.

• Orchestrated streaming/batch pipelines with Dataflow and Spark (PySpark).

• Using Spark and PySpark to load all datasets from source CSV files into Cassandra and Hive, respectively.

• Create Python Cloud Functions to handle source JSON files and import them into Big Query.

• Involved in ETL, Data Integration, and Migration by writing Pig scripts.

• Created a POC for project migration from the on-premise Hadoop MapR system to GCP.

• Used Snowflake for storing structured and semi-structured data from various sources. • Used Spark API over Cloudera Hadoop YARN to perform analytics on data.

• Utilized Airflow DAGs to automate the execution of BigQuery SQL scripts for data transformation, aggregation and analysis, making them reusable across different environments, enhancing the efficiency of ETL processes.

• Developed persistent SQL, Python functions in GCP which played crucial role in maintaining code consistency, code abstraction.

• Designed and implemented incremental data loading using surrogate keys in ETL pipelines, ensuring efficient change tracking and data deduplication.

• Optimized BigQuery partitioning and clustering to augment query efficiency and curtail costs.

• Executed performance benchmarking between Teradata and BigQuery to substantiate performance enhancements.

• Designed and deployed distributed storage systems (e.g., Ceph, Glisters) on Kubernetes, ensuring scalable and highly available storage solutions for large datasets.

• Examined the SQL scripts and developed the Pyspark implementation plan. To configure the services Data, Proc, Storage, and Big Query on GCP, use the cloud shell SDK

• Written Python script to perform data transformation and data migration from various data sources and build different databases to store the Raw data and filtered Data.

• Developed Spark SQL transformations in Big Query to unify clinical and operational datasets.

• Implemented Dataflow (Apache Beam) for streaming claims validation, improving data quality

• Using Python and other technologies, statistical analysis of healthcare data was performed. In charge of leveraging Hadoop to construct scalable distributed data solutions.

• Configured JIRA-Jenkins automation to trigger pipeline deployments and update tickets in realtime.

Technologies: SparkML, PyTorch/TensorFlow, Elasticsearch,NoSQL (MongoDB/Cassandra) GCP (BigQuery, Dataflow, Looker/Datastudio,Dataproc, Composer, Pub/Sub, Airflow, PySpark, SQL, Postman, PowerBI,Python, Hadoop

BCBS – Delaware(Remote) Feb 2022 – Oct

2023

Role: Sr. Data Engineer

● Built feature engineering and model training pipelines in Dataproc + BigQuery to generate patient risk scores for chronic disease progression and readmission prediction.

● Designed production-grade ML workflows in Vertex AI / BigQuery ML, enabling reproducible training, model versioning, and deployment aligned to clinical validation cycles.

● Created curated semantic layers and BigQuery feature stores to support Data Science teams and downstream analytic use cases.

● Integrated real-time patient monitoring signals (diagnostics, lab results, medication adherence) using Pub/Sub streaming into BigQuery for up-to-date risk scoring.

● Delivered care optimization dashboards in Looker / Power BI enabling physicians and care coordinators to identify high-risk patients earlier and intervene proactively.

● Worked with privacy/security teams to implement RBAC, audit logging, and data retention policies ensuring HIPAA and HITRUST compliance.

● Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP.

● Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery · Worked on Spark streaming using Apache Kafka for real-time data processing.

● Used Spark Streaming which allows organizations to perform real-time analytics on streaming data. Ingested data from various sources such as Kafka, Flume, Kinesis, etc., process it usingcomplex algorithms or SQL queries, and then stored results in databases or data lakes for further analysis or visualization.

● Spark Streaming is used for real-time Extract, Transform, Load (ETL) operations where data from multiple sources is ingested, transformed, and loaded into data warehouses or other storage systems in near real-time.

● Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators

● Analyzed the SQL scripts and designed the solution to implement using Pyspark. · Used cloud shell SDK in GCP to configure the services Data Proc, Storage, and BigQuery.

● Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery · Worked on Spark streaming using Apache Kafka for real-time data processing.

● Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP.

● Developed data pipelines in airflow on GCP for ETL-related jobs utilizing various airflow ● operators. Gained experience in writing Spark applications in Scala and Python.

● Knowledge of BigQuery, Cloud functions, and GCP Dataproc Using HDFS MapReduce, Kafka, Spark, HBase, Hive, Hive UDF, and Spark to analyze massive and important datasets. used the Scala Kafka Consumer API to get data from Kafka topics.

● Created a POC for project migration from the on-premise Hadoop MapR system to GCP. Built SSIS Packages to transfer data from flat files, Excel, and SQL Server using Business Intelligence Development Studio.

● Examined the SQL scripts and developed the Pyspark implementation plan. To configure the services Data, Proc, Storage, and BigQuery on GCP, use the cloud shell SDK.

● Using Spark and PySpark to load all datasets from source CSV files into Cassandra and Hive, respectively. Create Python Cloud Functions to handle source JSON files and import them into BigQuery.

● Created BigQuery authorized views for row-level security or exposing the data to other teams.

● Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities · Completed data extraction, aggregation, and analysis in HDFS by using PySpark and storing the data needed to Hive.

● Written Python script to perform data transformation and data migration from various data sources and build different databases to store the Raw data and filtered Data.

● Written Spark applications to perform data cleansing, transformations, aggregations, and other useful datasets as per downstream team requirements.

● Involved in event enrichment, data aggregation, de-normalization and data preparation needed for downstream model learning and reporting.

● Developed complex ETL mappings to extract data from different file formats to load the data into Teradata Database.

● Involved in ETL, Data Integration, and Migration by writing Pig scripts.

● Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.

● Designed and implemented MongoDB and associated RESTful web service.

● Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive. Environment: Dataproc (Spark), Vertex AI, BigQuery, BigQuery ML / ML Engine, Cloud Composer, GCS, Python, Jupyter, Looker / Power BI,GCP, Hadoop, ETL, Informatica,Cloud shell, Apache kafka, Map reduce, Mongo DB, Spark API, Hive, Scala,Spark.. RJ Reynolds, Hyderabad, India March 2020–Nov 2021 Role: Data engineer Responsibilities:

*Led end-to-end data migration project from Teradata to Google Cloud Platform (GCP), ensuring secure, accurate, and high-volume data transfer with minimal downtime.

*Designed and implemented scalable ETL pipelines using Google Cloud Composer (Apache Airflow) and Dataproc (PySpark) for automated batch data processing and orchestration. * Developed robust Python scripts to extract, transform, and load complex datasets into BigQuery, optimizing data transformation workflows and ensuring schema compatibility.

*Built and optimized BigQuery datasets and views for analytical use cases, achieving significant improvement in query performance and cost efficiency.

*Utilized PL/SQL and advanced SQL techniques to cleanse, validate, and manipulate large-scale datasets during the Teradata to GCP migration phase.

*Integrated Cloud SQL with BigQuery and DataProc jobs, enabling seamless real-time and batch data processing between transactional and analytical systems.

*Created reusable Python modules and DAG templates for workflow orchestration in Cloud Composer, improving code maintainability and pipeline consistency.

*Performed performance tuning and cost optimization of BigQuery queries and storage, reducing overall project cost by 30%.

*Executed thorough data validation and reconciliation post-migration to ensure complete data integrity between Teradata and GCP targets.

*Collaborated with cross-functional teams, including analysts and data scientists, to understand business requirements and translate them into technical solutions.

*Deployed monitoring and logging solutions using Stackdriver and custom Python scripts to proactively track job status and failures in Cloud Composer and DataProc.

*Migrated historical and incremental datasets leveraging GCS (Google Cloud Storage), Cloud SQL, and BigQuery with custom staging and transformation logic.

*Built dynamic parameterized SQL queries and scripts for data aggregation, filtering, and reporting across multiple data sources and environments.

*Documented pipeline architecture, data flow diagrams, and technical specifications, enabling easier onboarding and knowledge transfer.

*Demonstrated strong communication skills by leading stakeholder meetings, preparing status reports, and presenting architecture designs to leadership and non-technical teams. Technologies: GCP, Bigquery, Composer, Python, SQL, Teradata, GCS, Dataproc, Pyspark, Airflow, Cloudsql,Spark, Hive

Home Depot, Bangalore, India Nov 2018–

Feb 2020

Role: Data Engineer/Python developer Responsibilities:

● Participated in the development of the application utilizing the Scrum (Agile) technique for requirements gathering, design, and deployment.

● Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.

● Used Azure data bricks for data visualization which is a graphical presentation of the result by running a query.

● Worked on data bricks machine learning to manage services like experiment tracking, model training, feature development and management.

● Using Azure Databricks to build Spark clusters and set up high concurrency clusters to hasten the preparation of high-quality data.

● Created and provisioned different Databricks clusters, notebooks, jobs, and autoscaling.

● Experience in Migration project from on-premises Data warehouse to Azure Cloud.

● Provided technical solutions on MS Azure HDInsight, Hive, HBase, MongoDB, Telerik, Power BI, Spot Fire, Tableau, Azure SQL Data Warehouse Data Migration Techniques using BCP, Azure Data Factory, and Fraud prediction using Azure Machine Learning.

● Practical knowledge of creating Spark and Kafka streaming programs using Java

● Split streaming data into batches and fed each batch into the Spark engine for batch processing.

● Engineered Apache Spark batch processing jobs to run more quickly than MapReduce jobs.

● Created Spark apps for bespoke aggregation, data purification, and transformation.

● Participated in the planning of the Oozie workflow engine to run several Pig, Sqoop, and HiveQL.

● Using the ideas of lookup tables and staging tables, HBase row keys and data modeling were designed to insert to HBase tables.

● Built a data pipeline with Sqoop to analyze customer behavior and purchase histories as they are ingested into HDFS.

● Involved in designing user interactive web pages as the frontend part of the web application using various web technologies like HTML, JavaScript, jQuery and implement CSS for better appearance and feel.

● Deployed Django web application in Apache webserver and Carpathia cloud web deployment. ● Created Spark APIs that facilitate inserting and updating data into Hive tables using JSON data gathered from an HTTP source.

● Created shell scripts to execute Hive scripts in Impala and Hive.

● Participated in setting up an Oozie task to import real-time data to Hadoop on a daily basis. ● Integrated data from upstream to downstream systems using Azure Data Factory as an orchestration tool.

● Automatic tasks using ADF's Event, Scheduled, and Tumbling triggers.

● Stored catalog data and event sourcing in order processing pipelines using Cosmos DB.

● Made user-defined triggers, stored procedures, and functions for Cosmos Database.

● Developed DA requirements and mapped data flow, then gave the developer the information and HLDs.

● Using Docker Automation methods to implement Amazon Elastic Container Service (ECS) scheduler to automate application deployment in the cloud.

● The option to restart ETL for a specific date or a range of dates, from a failure point, or from the beginning.

● Using Pentaho DI, extract data from various social analytics platforms (Facebook, Google+, Twitter, YouTube, iTunes, Google Analytics, and Sony games).

● Using Python and other technologies, statistical analysis of healthcare data was performed.

● Use the Agile development technique to create a big data web application in Scala, which combines functional and object-oriented programming.

● Developing logical and physical data modeling for numerous data sources on Confidential is my responsibility.

● ETL jobs were designed and created to extract data from the Salesforce replica and load it into the Redshift data mart.

● Cassandra has been incorporated as a distributed persistent metadata store to provide metadata resolution for network entities.

● Quickly searched, sorted, and grouped data from DataStax Cassandra using queries and analysis.

● With Informatica Power Center Designer and Workflow Manager, all the jobs are interconnected utilizing intricate mappings and workflows.

● Several Databricks clusters, notebooks, jobs, and autoscaling were created and provisioned. ● Using PySpark to create several Databricks Spark jobs to carry out several tables-to-table operations.

● Manage MongoDB upgrades from 2.6 to 3.2, including hardware migrations.

● Setting up DevOps pipelines for CI/CD on GIT, Jenkins, Nexus repository. Technologies: Agile, Azure, Java, Spark, Hive, Oozie, MapReduce, FTP, HiveQL, Sqoop, Pig, HDFS, Cosmos, ETL scripts, MYSQL, Kafka, Cassandra, T-SQL, Spark SQL, and U-SQL. Azure Data Factory. Teradata, Cloud. EBS, ELB, RDS, Azure Data bricks, informatica, python, PySpark, scala, MongoDB, java, Jenkins, data bricks.

HSBC, Bengaluru, India Jan

2017– Oct 2018 Role:

Data Engineer

Responsibilities:

● Designed ARM templates in Azure and custom build PowerShell scripts to automate the resource creation and deployment which saved 140 hours of effort for creating every new environment.

● Created pipelines in Azure Data Factory using linked services/datasets to extract, transform, and load data between different storage systems like Azure SQL, Blob storage, Azure DW and Azure Data Lake

● Migrated data from on-premises SQL databases to cloud Azure SQL DB using SSIS packages.

● Created Databricks notebooks to perform ETL operations to stage business data based on the requirements. ● Involved in landing different source datasets into Azure Data Lake Storage

(ADLS) in the form of Parquet file.

● Extensively used Agile methodology as the Organization Standard to implement the data Models.

● Used Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.

● Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling.

● Decommissioning nodes and adding nodes in the clusters for maintenance.

● Adding new users and groups of users as per the requests from the client.

● Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.

● Performed migration of customer and employee databases to Snowflake database from on-prem SQL Database.

● As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.

● Cleaned and transformed data by executing SQL queries in Snowflake worksheets.

● Performed data quality issue analysis using Snow SQL by building analytical warehouses on Snowflake. ● Experience in Data warehouse technical architectures, ETL/ELT, reporting/analytic tools, and data security.

● Developed SQL scripts to Upload, Retrieve, Manipulate, and handle sensitive data (National Provider Identifier Data I.e., Name, Address, PAN, Aadhar, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project.

● Worked on retrieving the data from FS to S3 using spark commands.

● Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.

● Imported Legacy data from SQL Server and Teradata into Amazon S3

● Worked with stakeholders to communicate campaign results, strategy, issues or needs. Technologies: Azure, PowerShell Scripts, Azure SQL, SQL, SSIS, ETL, Agile, Regression Testing, Python

Scripting, snowflake, Spark, Teradata Education

• Masters in Information Technology from Wilmington university, New Castle, DE-2023

• Bachelors in computer science from Acharya Nagarjuna University, India-2017

Contact this candidate