Data Engineer Cloud Platform

Location:

Kissimmee, FL

Posted:

April 23, 2025

Contact this candidate

Resume:

Swapna Kasireddy

Data Engineer

Email: *********@*****.***

Contact no: +1-223-***-****

LinkedIn Id : https://www.linkedin.com/in/swapna-de/

PROFESSIONAL SUMMARY:

Big Data and Cloud Data engineering professional with over 9+ years of experience in building distributed data solutions, data analytical applications and ETL and streaming pipelines leveraging big data, Hadoop ecosystem components, Databricks platform, AWS and GCP Cloud Services.

Extensive experience building data pipelines with Apache Spark, Hive, Kafka, and orchestration using Apache Airflow/Cloud Composer.

Expertise with Google Cloud Platform (GCP): BigQuery, Dataproc, Cloud Storage, Dataflow, Pub/Sub, Cloud Composer, and Airflow, supporting real-time and batch processing at scale.

4+ years of GCP experience in designing scalable ETL pipelines with BigQuery, Dataproc, and Cloud Composer.

Proficient in Google Cloud Platform (GCP) services, including Vertex AI, BigQuery, Dataflow, and Dataproc, with expertise in designing frameworks for machine learning (ML) and large language model (LLM) operationalization. Skilled in AI compliance governance, data governance, and metadata management, with strong programming expertise in Python, Spark, and SQL.

Skilled in designing and implementing scalable data pipelines for structured, semi-structured, and unstructured data across cloud platforms, particularly Google Cloud Platform (GCP), including BigQuery, Dataproc, Cloud Spanner, and Dataflow.

Hands-on experience with Google's IAM APIs, security policies, and real-time data streaming.

Proficient in building distributed data solutions, optimizing Databricks workflows, and creating scalable ETL pipelines using advanced data engineering techniques

Demonstrated hands-on experience with REST API development, IAM policies, and infrastructure development on AWS, with a strong focus on security and scalability.

Designed and implemented ETL pipelines using Snowflake to process and load large-scale datasets for unified data warehousing.

Hands-on experience with Airflow DAG creation, execution, and monitoring.

Extensive knowledge of CI/CD pipelines, Git workflows, and infrastructure-as-code tools such as Terraform.

Strong understanding of dimensional data modeling, encompassing Star and Snowflake schemas.

Proficient in CI/CD practices using Jenkins and container orchestration with Kubernetes.

Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse and Data Mart. Experience in developing ETL data pipelines using PySpark

Architect, deploy, and manage cloud infrastructure components, leveraging services such as Google Cloud Platform (GCP) to support data processing and analytics workloads.

Architected scalable ETL workflows using Data proc (Apache Spark) for processing terabyte-scale datasets, optimizing the performance and reducing data processing time by 30%.

Expert in designing & managing ETL/ELT pipelines using tools like Cloud Dataflow, Big Query, and Cloud Storage.

Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/ Studio, Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical data modeling.

Expertise working on Amazon Glue, RDS, DynamoDB, Kinesis, CloudFront, CloudFormation, S3, Athena, SNS, SQS, X-ray, Elastic load balancing (ELB), Amazon Redshift.

Data Pipeline Design & Optimization: 5+ years of experience building and optimizing scalable data pipelines using tools like Google Cloud Composer, Dataflow, and Apache Beam.

ETL & Big Data Processing: Skilled in ETL processes, Apache Hadoop and Spark ecosystems for large-scale data handling and transformation.

Experienced on Build, Deploying, and managing SSIS packages with SQL server management studio, creating SQL Server agent jobs, configuring jobs, configuring data sources, and scheduling packages through SQL server agent jobs.

Adept at utilizing BI tools such as Power BI and QlikView for enhancing reporting capabilities and developing BI applications by client requirements.

Advanced Python and Spark programming for large-scale data solution.

Proficient in cloud services: AWS (Redshift, Glue), GCP (Big Query, Dataflow, Pub/Sub), and Azure.

Strong experience in SQL, including advanced functions and query optimization.

Experience in working on Apache Hadoop open-source distribution with technologies like HDFS, Map-reduce, Python, Pig, Hive, Yarn, Hue, HBase, SQOOP, Oozie, Zookeeper, Spark, Spark-Streaming, Storm, Kafka, Cassandra, Impala, Snappy, Green plum and MongoDB, Mesos

Well-versed in Tableau Desktop and Tableau Online, with a strong understanding of their functionalities.

Experienced in using agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD).

TECHNICAL SKILLS:

Hadoop/Spark Ecosystem

Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, Apache Beam

Cloud Platforms

AWS: Amazon EC2, S3, RDS, IAM, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, DynamoDB

GCP: Big query, cloud storage, Data flow, Vertex AI, Data pro, pub/sub, cloud data fusion, cloud composer, cloud functions, Ai platforms, cloud SQL /cloud spanner, looker, Cloud Run, Cloud Storage

ETL/BI Tools

Informatica, SSIS, Tableau, Power BI, SSRS

CI/CD

Jenkins, Kubernetes, Helm, Docker, Splunk, Ant, Maven, Gradle.

Ticketing Tools

JIRA, Service Now, Remedy

Database

Oracle, SQL Server, Cassandra, Teradata, PostgreSQL, Snowflake, HBase, MongoDB

Programming Languages

Scala, Hibernate, PL/SQL, R, JAVA

Scripting

Python, Shell Scripting, JavaScript, jQuery, HTML, JSON, XML.

Web/Application server

Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

Version Control

Git, Subversion, Bitbucket, TFS.

Scripting Languages

Python, Scala, R, PL/SQL, Shell Scripting

DevOps Tools

Jenkins, Docker, Kubernetes, Bitbucket

Platforms

Windows, Linux (Ubuntu), Mac OS, CentOS (Cloudera)

Machine Learning& ML Ops

Vertex AI, Auto ML, ML Ops for GenAI, Real-time CDC ingestion

PROFESSIONAL EXPERIENCE:

Rappahannock Electric Cooperative, Fredericksburg, VA Sep 2021 – Till Date

Data Engineer

The project focused on migrating and optimizing Boeing's data infrastructure on Google Cloud Platform (GCP) to enhance scalability, security, and real-time analytics capabilities. It involved designing data pipelines, ETL processes, and data lakes to ensure seamless data integration and processing. The objective was to modernize legacy data systems, improve data accessibility, and support AI/ML workloads for advanced analytics. The project emphasized cost optimization, security compliance, and high availability. Additionally, it aimed at implementing CI/CD pipelines for automated deployments and enhancing observability with monitoring and logging solutions. The solution was built to support real-time and batch processing for mission-critical aerospace data.

Responsibilities:

Designed and developed scalable, secure, and high-performance data pipelines on Google Cloud Platform (GCP).

Built ETL/ELT workflows using Cloud Dataflow, Apache Beam, and Big Query for large-scale data processing.

Migrated on-premises data warehouses to Big Query and Cloud Storage, optimizing query performance and cost.

Implemented data ingestion pipelines from various sources, including IoT devices, relational databases, and streaming services.

Designed and optimized ETL pipelines using Dataflow, Data proc, and BigQuery to support large-scale data processing.

Implemented Delta Live Tables for real-time data ingestion and transformation.

Automated workflows with Apache Airflow (Cloud Composer) to support advanced analytics.

Designed and implemented feature engineering pipelines for ML workflows using Vertex AI and Big Query ML, enabling advanced analytics and AI model deployment.

Automated workflows using Cloud Composer (Apache Airflow) to ensure efficient data orchestration and reduced manual intervention by 30%.

Collaborated with AI teams to enable ML model training and inference using Vertex AI.

Ensured compliance with data governance and security standards through robust IAM policies, encryption, and auditing.

Built and maintained Star Schema models to enhance query performance and storage efficiency.

Automated data pipeline deployments using Terraform and Kubernetes.

Troubleshot and resolved data quality issues, ensuring 99.9% data accuracy

Developed real-time data processing solutions using Pub/Sub, Dataflow, and Bigtable to support mission-critical analytics.

Designed data lake architecture leveraging Cloud Storage, BigQuery, and Dataproc for structured and unstructured data.

Automated data pipeline deployments using Terraform, Cloud Composer (Apache Airflow), and CI/CD tools.

Ensured data governance, security, and compliance by implementing IAM roles, encryption, and auditing.

Developed data transformation logic using SQL, Python, and Spark on Data proc for batch and real-time workloads.

Monitored data pipelines using Cloud Logging, Cloud Monitoring, and Prometheus/Grafana dashboards.

Optimized Big Query queries to enhance performance and reduce cloud costs.

Collaborated with data scientists and analysts to enable AI/ML model deployment in production environments.

Worked closely with business stakeholders to understand data requirements and improve decision-making capabilities.

Ensured disaster recovery and high availability strategies were in place using multi-region GCP services.

Implemented data validation and anomaly detection mechanisms to maintain high data quality.

Designed data retention and lifecycle management strategies for optimized storage usage.

Managed GCP networking configurations such as VPC, Firewall Rules, and Cloud NAT for secure data transfers.

Worked in Agile (Scrum) environments, participating in sprint planning, daily stand-ups, and retrospectives.

Automated metadata management and lineage tracking using Data Catalog and Looker for analytics.

Troubleshot and resolved performance bottlenecks, system failures, and data inconsistencies.

Assisted in evaluating new GCP technologies to enhance Boeing’s cloud data platform.

Provided technical support, training, and mentorship to junior engineers.

Designed and implemented data partitioning and clustering strategies to optimize Big Query performance and reduce costs.

Optimized SQL queries and analytical workloads for cost efficiency in Big Query and Cloud SQL.

Deployed machine learning models using Vertex AI and TensorFlow for predictive analytics.

Developed CI/CD pipelines using Cloud Build and GitHub Actions to automate deployments.

Environment: Hadoop/Bigdata Ecosystem (Spark, Kafka, Hive, HDFS, Sqoop, Oozie, Cassandra, MongoDB), SQL, cloud spanner, Google cloud storage, Python 3.x, Py Spark, Data warehousing, GCP (Big Query, Data proc, Dataflow, Pub/Sub), Vertex AI, Tensor flow, Terraform, Jenkins, Kubernetes, looker, Databricks, Delta Live Tables, Jira, Agile/Scrum

CapitalOne, Richmond, VA May 2018 – Aug 2021

Sr. Data Analyst

The project involved building scalable data pipelines to ingest, process, and unify customer data from multiple sources using PySpark, Snowflake, and Kafka. Real-time and batch ETL workflows were implemented on AWS, leveraging services like Glue, Lambda, and S3 for efficient data transformation and storage. Automation with tools like Airflow, Cloud Composer, and Terraform improved orchestration, reducing manual intervention by 30%. Advanced data modeling in Snowflake and Databricks optimized query performance and supported machine learning workflows. The project delivered robust, low-latency data solutions, enabling actionable insights and supporting analytics across the organization.

Responsibilities:

Designed and built scalable, distributed data pipelines to ingest and process customer data from multiple sources.

Developed and managed data solutions within a Hadoop Cluster environment using Hortonworks distribution.

Created persistent customer keys to unify customer data across various accounts and systems.

Designed and implemented scalable ETL pipelines on AWS, using PySpark for data transformation and Snowflake for data warehousing.

Built and optimized distributed data pipelines using PySpark and Snowflake for large-scale datasets.

Integrated and processed streaming data using Kafka and AWS Glue for real-time analytics.

Developed and maintained ETL workflows using Airflow and Cloud Composer.

Improved query performance by 30% through data model optimization and clustering strategies.

Designed and implemented Star Schema models for enhanced reporting and analytics.

Collaborated with business intelligence teams to support AI/BI Genie integrations.

Developed distributed data processing solutions in PySpark to handle large-scale datasets in AWS EMR.

Utilized AWS S3 as the main data storage solution, integrating with Snowflake for efficient data loading and querying.

Managed data ingestion from various sources into Snowflake via AWS Glue and Apache Kafka for real-time streaming data processing.

Built real-time data pipelines with Kafka to process and stream large volumes of data, ensuring low-latency data delivery to Snowflake.

Built and monitored real-time streaming pipelines with Kafka and Pub/Sub, achieving low-latency data delivery.

Designed and executed workflows with Cloud Composer (Airflow), improving orchestration efficiency.

Improved query performance by 30% through optimized data models and partitioning strategies.

Implemented real-time data ingestion and transformation workflows in Snowflake using Kafka, achieving a 20% reduction in data processing latency.

Designed data models using Neptune DB for unified customer views across multiple data sources.

Developed real-time data streaming solutions using Kafka and AWS Glue, reducing data processing latency by 20%.

Automated ETL workflows using Cloud Composer, enhancing orchestration efficiency and reducing manual intervention.

Optimized Snowflake schemas and queries, resulting in a 30% improvement in query performance and cost efficiency.

Collaborated with data science teams to design and deploy ML models using advanced feature engineering techniques.

Created CI/CD pipelines for continuous deployment and monitoring of ETL processes.

Developed data quality checks with Airflow/Cloud Composer, enhancing accuracy by 80%.

Automated pipeline monitoring using Cloud Composer and optimized ETL processes with BigQuery for high-quality data analysis.

Designed and optimized complex SQL queries and stored procedures in Snowflake for data extraction and reporting, achieving a 20% reduction in processing time.

Created technical solution designs and unit test cases to support UAT and functional testing, ensuring high data accuracy.

Developed and optimized complex SQL queries in Snowflake for data extraction, transformation, and loading.

Developed and monitored ETL workflows using Airflow and Cloud Composer to support scalable data pipelines.

Automated real-time data streaming using Kafka and Pub/Sub, enabling low-latency analytics in Snowflake.

Deployed Airflow DAGs to orchestrate end-to-end data processes, reducing manual intervention by 30%.

Collaborated with stakeholders to enhance data models and schemas in BigQuery for seamless data retrieval.

Designed and deployed data solutions using AWS Glue, DynamoDB, and Elasticsearch, reducing query response times by 30%.

Implemented RESTful APIs with AWS Lambda and secured integrations with IAM roles and policies.

Enhanced system security by defining IAM roles, policies, and permissions, ensuring compliance with AWS standards.

from multiple file formats including XML, JSON, CSV, and other compressed file formats.

Developed data partitioning and clustering strategies in Snowflake to enhance query performance and optimize storage costs.

Automated data processing workflows using AWS Lambda and AWS Step Functions for event-driven ETL processes.

Monitored the health and performance of ETL pipelines and Kafka streams using AWS CloudWatch and Kafka Monitoring Tools, ensuring reliable data flows.

Built scalable data pipelines on Databricks using PySpark for data ingestion, transformation, and validation from various sources.

Integrated Snowflake with Databricks to support analytics and machine learning workflows.

Automated data workflows with Databricks and Cloud Composer, achieving a 30% increase in efficiency.

Implemented data partitioning and clustering strategies in Snowflake, optimizing query performance and storage efficiency.

Automated data ingestion and transformation processes using Python and Snow pipe, reducing data processing time by 20%.

Design and implement Customer 360 solutions to unify patient and provider data across multiple systems.

Collaborate with cross-functional teams to integrate clinical, claims, and behavioral data for a single customer view.

Leveraged Apache Beam for distributed data processing in real-time.

Manage and configure the CDP to capture, clean, and unify patient/member data from various touchpoints.

Develop data pipelines to ingest first-party, third-party, and multi-source data for deeper customer insights.

Collaborated with cross-functional teams to design and implement scalable data models, leveraging Star and Snowflake schemas.

Automated workflows using Apache Airflow and improved data ingestion latency by 20%.

Supported data science initiatives by delivering clean, processed data from Kafka and PySpark pipelines into Snowflake for advanced analytics and machine learning.

Environment: PySpark, ETL, Kafka, Snowflake, Hadoop, Map Reduce, SQL Server, SQL scripting, PL/SQL, Python. AWS, EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop, Hive v2.3.1, Spark v2.1.3, Python, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Customer 360, Customer Data Platform (CDP), Oracle, Databricks, Tableau, Docker, Maven, Git, Jira

Anthem, Richmond, VA Feb 2016– Jan 2018

Data Analyst

Description:

The project aimed to design and implement a scalable cloud-based data platform on Google Cloud Platform (GCP) to support banking operations for data processing, analytics, and decision-making capabilities. It focused on migrating on-premises data pipelines to GCP, ensuring efficient data ingestion, transformation, and storage using BigQuery, Dataflow, and Cloud Storage., The objective was to modernize data infrastructure, optimize ETL workflows, and enable real-time data streaming with Pub/Sub and Apache Kafka. Security and compliance were prioritized, implementing IAM roles, encryption, and data governance policies. The project also aimed at cost optimization and performance tuning while ensuring high availability and reliability. Additionally, it supported machine learning and business intelligence initiatives by integrating with Looker and AI/ML services.

Responsibilities:

Integrated data from various sources, including HDFS and HBase, into Spark RDDs to enable distributed data processing and analysis.

Integrated disparate data sources (HDFS, HBase) into a unified platform using GCP services such as Cloud Storage and Data proc.

Designed, developed, and deployed end-to-end data pipelines on GCP to support real-time and batch processing.

Implemented ETL/ELT frameworks using Cloud Dataflow, Apache Beam, and BigQuery for efficient data transformation.

Developed cloud-native data warehousing solutions using BigQuery, ensuring high performance and cost optimization.

Optimized Snowflake queries and warehouse configurations to enhance performance and reduce costs for large-scale banking datasets.

Developed and maintained ETL/ELT processes using Snowflake’s Snowpipe for continuous data ingestion from diverse sources (e.g., banking transactions, customer profiles).

Conducted performance tuning by analyzing query execution plans, clustering key optimizations, and partitioning large datasets.

Led data modeling initiatives, including Star Schema and Snowflake Schema design for efficient querying and reporting.

Implemented data security and governance measures including IAM roles and encryption.

Created API Wrappers for seamless data integration across services.

Deployed and managed Kubernetes clusters to ensure high availability and scalability.

Automated workflow orchestration using Cloud Composer (Apache Airflow) to manage complex data dependencies.

Integrated Pub/Sub messaging systems to enable real-time streaming and event-driven architectures.

Optimized data ingestion from multiple sources such as on-prem databases, APIs, and third-party services into GCP.

Ensured data security, encryption, and compliance with banking regulations (PCI DSS, GDPR, etc.).

Built and maintained Cloud Storage-based data lakes for scalable and cost-effective data storage.

Developed and maintained Terraform-based Infrastructure as Code (IaC) for automated cloud provisioning.

Implemented logging and monitoring solutions using Stackdriver, Cloud Logging, and Cloud Monitoring.

Designed and enforced data governance strategies, including role-based access control and audit logging.

Created materialized views and partitioned tables in BigQuery for efficient query performance.

Migrated on-premises databases to GCP using Dataproc, Data Fusion, and Database Migration Services.

Set up CI/CD pipelines using Cloud Build, GitHub Actions, and Terraform for seamless deployment.

Developed custom Python and SQL scripts for data transformation and validation.

Utilized Data Catalog to improve metadata management and data lineage tracking.

Enhanced data quality and integrity by implementing data validation checks and anomaly detection.

Collaborated with cross-functional teams to gather requirements and provide cloud-based solutions.

Optimized cloud resource usage to minimize costs while maintaining performance benchmarks.

Created technical documentation for data pipelines, architecture diagrams, and troubleshooting guides.

Conducted knowledge-sharing sessions and provided training on GCP best practices to teams.

Participated in Agile/Scrum methodologies, attending daily stand-ups, sprint planning, and retrospectives.

Resolved performance bottlenecks and improved query execution time using indexing and caching strategies.

Environment: Hadoop, Spark, ETL, Python, SQL Apache Airflow, cloud Pub/Sub, Google cloud scheduler, cloud composer, GCP (Cloud Storage, Dataflow, DataFusion, Dataproc), Data Catalog, Looker, Snowflake, PySpark, Terraform, Data Integration & Data Migration

Contact this candidate