Cloud Data Engineer

Location:

Arlington, TX

Posted:

June 10, 2025

Contact this candidate

Resume:

Akula Chandu Swamy

****************@*****.***

682-***-****

PROFESSIONAL SUMMARY:

Cloud Data Engineer with 9+ years of experience designing scalable, cloud-based data architectures and ETL pipelines on GCP, AWS, and Azure, driving data-driven decision-making across healthcare, finance, and enterprise domains.

Specialized in GCP services such as BigQuery, DataProc, Dataflow, Cloud Storage, Pub/Sub, Cloud Functions, and Cloud Composer for serverless and event-driven data processing.

Experienced with Google Cloud AlloyDB, including provisioning, configuration, query optimization, capacity planning, and high availability design.

Led database migration efforts from Oracle and PostgreSQL to AlloyDB using GCP Database Migration Service and custom data validation scripts.

Proficient in AWS services including Glue, Redshift, Lambda, S3, EC2, RDS, Athena, IAM, and CloudFormation for scalable data engineering workloads.

Experienced with Azure Data Factory, Databricks, ADLS, and Azure SQL Database for enterprise-grade data integration and processing.

Proficient in building advanced ETL workflows using tools like Informatica PowerCenter, AWS Glue, Azure Data Factory, and Apache Airflow for both batch and real-time data integration.

Expertise in enterprise data warehousing and modeling with Snowflake, BigQuery, and Redshift, including schema design, data partitioning, clustering, and optimization for high-performance analytics.

Strong background in distributed data processing using Spark, PySpark, Hadoop, Hive, MapReduce, and Kafka. Built scalable data pipelines for batch and streaming use cases, including fraud detection, campaign analytics, and trend forecasting.

Skilled in developing and tuning SQL and PL/SQL scripts for RDBMS (Oracle, SQL Server, Teradata, MySQL) and NoSQL (MongoDB, HBase, Cassandra) systems.

Hands-on with data lake and lakehouse implementations, data partitioning, complex schema evolution, CDC (Change Data Capture), and performance tuning techniques to support large-scale ingestion and analytical workloads.

Used Informatica to develop parameterized mappings, data quality rules, CDC logic, and automate complex ETL jobs across hybrid cloud platforms.

Automated infrastructure provisioning and CI/CD pipelines using Terraform, Jenkins, GitLab CI, and AWS CloudFormation. Managed containerized deployments using Docker and Kubernetes (GKE), ensuring reliable, reproducible, and scalable environments across dev, test, and prod.

Experienced with metadata management, data lineage, governance, and orchestration using tools like Glue Catalog, Apache NiFi, Oozie, and Azure Monitor.

Built dashboards and ML workflows using Tableau, Power BI, BigQuery ML, Vertex AI, and AI Platform for predictive analytics and executive reporting.

Led cloud migration and modernization initiatives by transitioning legacy ETL and database systems to GCP and AWS, improving scalability, performance, and cost efficiency.

Strong communication and collaboration skills with proven ability to work independently or in agile cross-functional teams.

Technical Skills:

Cloud Platforms

Google Cloud Platform (GCP): BigQuery, DataProc, Dataflow, Cloud Storage, Cloud Composer (Airflow), Cloud Functions, Pub/Sub, Stackdriver

Amazon Web Services (AWS): S3, EC2, RDS, Redshift, DynamoDB, Glue, IAM, CloudFormation, Lambda, CloudWatch, Elastic Beanstalk

Azure: Data Factory, Databricks, ADLS, Event Hubs

Big Data Technologies

Hadoop Ecosystem: HDFS, YARN, MapReduce, Hive, Sqoop, Spark, Spark Streaming, Kafka, Flume, Apache Beam, Zookeeper

Data Warehousing: Snowflake, BigQuery, Redshift

Programming & Scripting:

Python, SQL, Java, Scala, Unix Shell Scripting, HiveQL

Data Processing & Workflow Orchestration:

Apache Airflow, Alteryx, Apache NiFi, Terraform, Oozie

ETL & Data Pipelines:

PySpark, Pandas, Informatica PowerCenter, Apache Beam, Cloud Dataflow, AWS Glue

Databases

AlloyDB, Cloud SQL, MySQL, PostgreSQL, SQL Server, Teradata, MongoDB, HBase

Data Visualization & Analytics

Tableau, Power BI, BigQuery ML, AI Platform

Containerization & Automation:

Docker, Kubernetes (GKE), Terraform

Monitoring & Logging:

Stackdriver (GCP), CloudWatch (AWS), Azure Monitor

Version Control & CI/CD:

Git, GitLab CI/CD, Jenkins, Chef

Data Storage & Formats:

Parquet, ORC, Avro, JSON

Education:

Bachelor of Technology in Computer Science Engineering. GITAM University May 2011– Apr 2015

Related coursework: object-oriented programming with Java, Advanced databases, cybersecurity, Data science and Big data, Data Mining, and visualization.

Project Experience:

CVS Health, Dallas, TX Nov 2022– Till Date

GCP Data Engineer

Description:

CVS Health is a prominent healthcare solutions provider focused on improving patient outcomes through advanced analytics and innovative technology. As a Data Engineer working with the GCP stack, I design, develop, and manage cloud-based systems that support key business functions, including pharmacy analytics, patient insights, and supply chain management. By leveraging Google Cloud technologies, I drive enhancements in data processing efficiency, scalability, and accessibility, enabling informed, data-driven decisions.

Responsibilities:

Migrated on-premise ETL pipelines to Google Cloud Platform (GCP), leveraging tools like Cloud Dataproc, BigQuery, and Cloud Storage.

Provisioned, configured, and maintained Google Cloud AlloyDB clusters with high availability and automated failover mechanisms.

Designed database schemas, tables, indexes, and partition strategies tailored to healthcare data ingestion and query performance.

Led the migration of Oracle and PostgreSQL workloads to AlloyDB using GCP Database Migration Service, reducing latency and operational overhead.

Built scalable data pipelines using Apache Airflow in GCP Composer, utilizing operators like Bash, Python callable, and Hadoop branching.

Designed DAGs to load incremental data from on-prem CSV files to BigQuery.

Configured Snowpipe to automatically load data from Google Cloud Storage (GCS) into Snowflake tables.

Automated event-driven workflows with Cloud Functions for file ingestion and processing into BigQuery.

Migrated Oracle database structures and data to PostgreSQL (AlloyDB) using GCP Data Migration Service.

Designed and optimized database indexing strategies to improve query execution times in AlloyDB.

Configured and tuned autovacuum settings in PostgreSQL to optimize performance and storage management.

Developed PL/SQL stored procedures to enhance data processing efficiency in PostgreSQL.

Implemented disaster recovery strategies for AlloyDB, ensuring minimal downtime and high availability.

Integrated AlloyDB with BigQuery, Firestore, Memory Store, and Spanner for hybrid cloud analytics.

Developed Spark applications using Python on Cloud Dataproc for processing large datasets.

Used BigQuery ML to create predictive models for advanced analytics and trend forecasting.

Optimized query performance and analytics capabilities using BigQuery for distributed data processing.

Collaborated with security teams to enforce IAM roles, VPC Service Controls, and audit logging for AlloyDB compliance (HIPAA, CCPA).

Optimized big data workflows by leveraging Cloud Storage and BigQuery for analytics and reporting.

Streamlined CI/CD workflows with GitLab CI/CD and integrated deployments with Jenkins for GCP services.

Automated infrastructure provisioning using Terraform, ensuring consistent and scalable deployment of data pipelines.

Enhanced system reliability by containerizing applications with Docker and managing services using Kubernetes (GKE).

Monitored data pipelines using Cloud Monitoring and set up alerts for performance optimization.

Ensured compliance with GDPR and CCPA by managing access policies with IAM and implementing robust security practices.

Designed ETL workflows using Python, SQL, and BigQuery to transform raw data into actionable insights.

Built real-time streaming solutions using Spark Streaming and integrated with Kafka for ingestion pipelines.

Configured GCS buckets and implemented event-driven data processing using Cloud Functions.

Developed GCP Cloud Functions to handle data transformations for arriving CSV files in GCS.

Integrated Cloud Dataproc with Spark for batch and streaming data processing workflows.

Environment: GCP (BigQuery, Cloud Dataproc, Cloud Storage, Cloud Composer, Cloud Functions, IAM, Cloud Monitoring), Python, SQL, AlloyDB, Terraform, Docker, Kubernetes (GKE), Apache Airflow, Spark, Kafka, Snowflake, Jenkins, GitLab CI/CD.

Thomson Reuters, Dallas, TX Aug 2020 – Nov 2022

AWS Data Engineer

Description:

Thomson Reuters is a global leader in providing information, technology, and expertise to legal, tax, and corporate professionals. As a Data Engineer on the Data Engineering team, I was responsible for architecting, designing, and managing cloud-based data infrastructure to support advanced analytics and data-driven solutions for legal, tax, and financial services. Leveraging big data technologies, I contributed to improving operational efficiency, data accuracy, and scalability across data storage and processing pipelines, enabling actionable insights and enhanced customer experience.

Responsibilities:

Designed and implemented data processing pipelines using AWS services such as AWS Glue, AWS Lambda, AWS Step Functions, and AWS EMR, efficiently extracting, transforming, and loading (ETL) large volumes of data into Amazon S3 and Redshift.

Developed and organized HDFS staging areas to optimize data storage, ensuring smooth and scalable data flow across systems.

Managed data migration from legacy systems, including SQL Server and Teradata, to Amazon S3, transforming and validating data for seamless integration into the cloud environment.

Automated data workflows by creating AWS Lambda functions and UNIX shell scripts, reducing manual intervention and significantly improving process efficiency.

Wrote complex SQL scripts for data reconciliation and loading, ensuring consistency across systems such as Teradata, SQL Server, and Snowflake.

Developed and maintained ETL mappings and workflows using Informatica PowerCenter to extract data from SQL Server and Teradata into AWS S3 and Redshift.

Integrated Informatica with AWS Glue for downstream transformations, improving reusability of metadata and standardizing ETL pipelines.

Implemented data quality checks and validation rules in Informatica to ensure consistency and accuracy of campaign data before loading into analytics environments.

Leveraged Pandas and PySpark to perform data cleaning, transformation, and analytics, improving data quality and consistency for easier analysis.

Engineered ETL pipelines using AWS Glue to ingest campaign data from sources such as S3, ORC, Parquet, and Text files into AWS Redshift, optimizing data storage and analytics workflows.

Built data validation and quality checks in AWS Data Pipeline, AWS Glue, and Lambda to ensure data accuracy and consistency across ETL processes.

Worked with AWS Redshift, S3, Athena, RDS, and DynamoDB to store, query, and retrieve data for various use cases, ensuring secure and scalable data processing.

Automated Hadoop and Spark job deployments using AWS EMR and developed external tables in S3 and Hive, organizing and optimizing large datasets for analytics.

Migrated and loaded data into Snowflake using Amazon S3 as the source, streamlining ETL processes and ensuring data consistency.

Applied best practices in data orchestration and management, leveraging tools such as AWS EC2, IAM, RDS, and CloudFormation to automate resources and workflows.

Conducted data quality assurance by comparing datasets across multiple databases, identifying discrepancies, and correcting data corruption during transformations.

Designed and managed real-time data ingestion and processing pipelines using AWS Kinesis and Kafka, enabling near-real-time analytics and improved decision-making.

Worked extensively with Spark, Hive, and MapReduce to handle large-scale data transformations, performing performance tuning to optimize job execution.

Environment: AWS (S3, Lambda, Glue, Redshift, Athena, EMR, Kinesis, RDS, DynamoDB), Snowflake, Informatica PowerCenter, HDFS, Hive, MapReduce, Spark (PySpark), Sqoop, Teradata, SQL Server, Python, Java, SQL, UNIX, Tableau, Git, Agile.

Amtrak, New York, NY Aug 2019 – July 2020

Data Engineer

Description:

Amtrak is a leading provider of intercity passenger rail services, leveraging data-driven solutions to enhance operational efficiency, passenger experience, and infrastructure management. As a Data Engineer, I design, develop, and optimize cloud-based data pipelines and analytics platforms using AWS and Azure. By integrating big data technologies like Spark, Kafka, and Snowflake, I enable scalable ETL workflows, real-time analytics, and predictive maintenance models. My work supports critical business functions, including ridership insights, scheduling optimization, and operational intelligence, ensuring a seamless and efficient rail network.

Responsibilities:

Developed Python ETL pipelines for ingesting and transforming high-volume datasets from structured and unstructured sources.

Used Pandas, NumPy, SciPy, and Matplotlib for data processing, analysis, and visualization.

Created a Python/Django-based web application for real-time analytics and reporting.

Designed and maintained AWS-based big data solutions using S3, Lambda, EC2, RDS, Redshift, Glue, and DynamoDB.

Migrated on-prem SQL databases to Azure SQL DB and Snowflake, optimizing query performance and storage costs.

Created Azure Data Factory Pipelines for automated data ingestion and transformation.

Deployed Databricks Spark jobs for large-scale data processing and machine learning workflows.

Built PySpark-based ETL workflows for processing large datasets in Hadoop, Hive, and Spark clusters.

Optimized Spark jobs, reducing batch processing times by 30% for real-time streaming data.

Used Kafka with Spark Streaming for real-time event processing and fraud detection models.

Optimized SQL queries, stored procedures, and indexing strategies, improving database performance.

Managed Snowflake schema designs, partitioning strategies, and access control policies using Trino and Ranger.

Designed and managed Snowflake schema structures to optimize data storage and query performance in a data warehouse environment.

Implemented partitioning strategies to enhance query speed and reduce data scanning by segmenting large datasets based on key attributes.

Configured and enforced access control policies using Apache Ranger to ensure secure and compliant data access across the system.

Utilized Trino for distributed SQL querying across multiple data sources to efficiently access and analyze large datasets.

Implemented data security best practices and monitoring tools like CloudWatch and ELK Stack.

Automated ETL deployment workflows using Jenkins, GitLab, and Terraform.

Designed Dockerized applications for scalable and containerized ETL processes.

Environment: Python (Pandas, NumPy, PySpark, Django), AWS (S3, Lambda, Glue, Redshift, EMR, RDS, DynamoDB), Azure (Data Factory, Databricks, ADLS), Snowflake, HDFS, Hive, Spark, Kafka, SQL Server, Teradata, SQL, UNIX, Git, Jenkins, Docker, Kubernetes, Agile.

Publicis Sapient, India Aug 2017 – Jul 2019

Big Data Developer

Responsibilities:

Designed and implemented scalable big data pipelines to process and analyze large-scale datasets using tools such as Hadoop, Spark, and Hive, ensuring optimal performance and reliability.

Developed and maintained data ingestion frameworks using Sqoop, Kafka, and Flume, enabling seamless integration of structured and unstructured data into HDFS for analytics.

Engineered and automated ETL workflows using PySpark, AWS Glue, and Spark Streaming, delivering real-time and batch analytics to support critical business needs.

Migrated on-premises data systems to cloud platforms such as AWS and Google Cloud Platform, leveraging EMR, S3, BigQuery, and DataProc to enhance scalability and reduce costs.

Created and managed Hive external tables with optimized partitions, improving query execution speed and ensuring cost-effective data processing.

Designed complex data models and schemas in Snowflake and Redshift, supporting advanced analytics, reporting, and decision-making processes.

Developed advanced Python scripts for data transformation, validation, and cleansing, ensuring high-quality and consistent data across systems.

Automated data pipeline orchestration and job scheduling using tools such as Apache Airflow and Oozie, optimizing workflows with robust UNIX shell scripts.

Designed and implemented real-time data processing pipelines using Kafka and Spark Streaming, enabling time-sensitive analytics for mission-critical operations.

Tuned SQL queries and optimized MapReduce workflows, utilizing advanced indexing and query optimization techniques for processing massive datasets.

Integrated big data solutions with business intelligence tools like Tableau and Power BI, delivering interactive dashboards and actionable insights.

Implemented data governance frameworks, including metadata management, data lineage, and auditing mechanisms, ensuring compliance with regulatory standards such as GDPR and HIPAA.

Established CI/CD pipelines for data infrastructure deployment using Jenkins, Terraform, and Docker, accelerating delivery and reducing operational overhead.

Monitored and optimized cluster performance using tools like Ambari, ensuring high availability, reliability, and efficiency of enterprise data pipelines.

Supported streaming data architectures for high-velocity data sources using frameworks like Apache Flink and Kinesis Data Streams, enhancing real-time data processing.

Recommended and implemented emerging technologies such as Parquet and ORC file formats, improving storage efficiency and query performance.

Collaborated with data scientists, business analysts, and DevOps teams to design and deliver end-to-end solutions aligned with organizational goals.

Mentored junior engineers, conducted code reviews, and established best practices for big data development, ensuring scalability and efficiency.

Environment: Hadoop, Spark, Hive, Sqoop, Kafka, Flume, HDFS, AWS (S3, EMR, Redshift, Glue), GCP (BigQuery, DataProc), Snowflake, PySpark, Python, UNIX, Tableau, Power BI, Apache Airflow, Oozie, Terraform, Docker, Jenkins, Ambari, Git.

ICICI Bank, India June 2015 – August 2017

SQL Developer

Responsibilities:

Developed, optimized, and maintained SQL queries, scripts, and stored procedures for data retrieval, manipulation, and reporting using SQL and PL/SQL.

Collaborated with business analysts and stakeholders to understand and document database requirements.

Participated in the design and modeling of relational databases using tools like Microsoft SQL Server Management Studio and ER modeling tools, ensuring data accuracy, integrity, and performance.

Created and modified database objects, including tables, views, indexes, and triggers using SQL Data Definition Language (DDL) and Data Manipulation Language (DML).

Implemented data validation and cleansing procedures using SQL functions and expressions to enhance data quality.

Tuned SQL queries and optimized database performance by creating indexes, optimizing query execution plans, and applying query hints. Utilized query profiling and performance tuning tools compatible with the chosen database platform.

Worked with database administrators to ensure efficient data storage and access, using tools Oracle Database and Microsoft SQL Server.

Collaborated with cross-functional teams to integrate databases into applications and systems, ensuring compatibility with application programming language Java.

Supported data migration and ETL processes using tools Talend and SSIS, including data extraction, transformation, and loading.

Designed and maintained ETL workflows using Informatica PowerCenter to extract, transform, and load data from Oracle and SQL Server into enterprise data warehouses.

Developed Informatica mappings using transformations such as Lookup, Aggregator, Filter, and Router to implement business logic and improve data quality.

Scheduled and monitored Informatica workflows and sessions, resolving failures and ensuring timely delivery of financial reports.

Created and maintained documentation for database schemas, processes, and best practices using documentation tools and version control systems compatible with the team's workflow.

Stayed up-to-date with the latest database technologies, trends, and best practices to continuously improve SQL development processes.

Assisted in troubleshooting and resolving database-related issues and errors, working closely with the support team and using diagnostic and monitoring tools.

Collaborated with software developers to optimize SQL queries within applications, ensuring compatibility with application frameworks and libraries.

Monitored database performance and implemented maintenance tasks as necessary, using database-specific tools and scripts.

Worked with version control systems to track changes and updates to SQL scripts and database objects, ensuring compatibility with the chosen version control system in use.

Environment: Informatica PowerCenter, Microsoft SQL Server, Oracle Database, SQL and PL/SQL, Microsoft SQL Server Management Studio, Talend, or SSIS, Java.

Contact this candidate