Data Engineer Senior

Location:

Fuquay-Varina, NC, 27526

Salary:

Posted:

May 07, 2025

Contact this candidate

Resume:

Sri Upputuri

Senior Data Engineer

Phone number: +1-312-***-****

Email: ***********@*****.***

LinkedIn: https://www.linkedin.com/in/sri-upputuri-336283167/

Professional Summary

Versatile and results-driven Data Engineer with over 8+ years’ experience in designing, implementing, and optimizing ETL/ELT pipelines, data integration, and cloud infrastructure for scalable data processing and analytics.

Extensive experience working with Google Cloud Platform (GCP), AWS, including services like BigQuery, Snowflake, Redshift, Dataproc, Dataflow, AWS Glue, and Databricks.

Proven track record of data migrations from on-premise systems to cloud platforms, including the transition of Oracle 19c to Snowflake and BigQuery.

Specialized in real-time data processing using Apache Kafka, AWS Kinesis, Google Cloud Pub/Sub, Apache Flink, and Spark Streaming for Change Data Capture (CDC) and streaming analytics.

Skilled in developing and deploying machine learning models using TensorFlow, scikit-learn, and PySpark for applications like recommendations, fraud detection, and sales forecasting.

Expert in building and optimizing data lakes and data warehousing solutions, ensuring high performance, data consistency, and ACID compliance in platforms like Snowflake, BigQuery, and Redshift.

Extensive experience with data transformation and feature engineering using Python, SQL, PySpark, Pandas, and Spark SQL.

Proficient in cloud infrastructure automation using Terraform, CloudFormation, and Azure DevOps, improving efficiency and scalability.

Solid understanding of data governance, security, and compliance, implementing access control using IAM, KMS, and Secrets Manager.

Expertise in monitoring, observability, and log management using tools like Datadog, CloudWatch, Stackdriver, and Dynatrace.

Skilled in building interactive business intelligence dashboards with Power BI, Tableau, Looker, and Amazon QuickSight for data-driven decision-making.

Experienced in working in Agile environments and collaborating with cross-functional teams for end-to-end project delivery.

Notable Achievements:

Successfully led data migrations from Oracle to Snowflake and BigQuery, ensuring smooth data transition with minimal downtime.

Designed and implemented real-time CDC pipelines using Apache Kafka, AWS Kinesis, and Spark Streaming, enabling real-time analytics for business-critical applications.

Built and deployed machine learning models using TensorFlow and scikit-learn, enhancing business capabilities in recommendations, fraud detection, and inventory forecasting.

Optimized ETL workflows for large-scale data processing using PySpark, Apache NiFi, and AWS Glue, significantly improving pipeline efficiency.

Automated cloud infrastructure provisioning using Terraform, ensuring scalable and reproducible environments for data pipelines and cloud-based solutions.

Established comprehensive data governance and security policies, enforcing compliance and secure data access across cloud platforms.

Developed interactive dashboards with Power BI, Tableau, and Looker, providing real-time insights and data visualizations for stakeholders.

CERTIFICATIONS:

Google Cloud Certified Professional Data Engineer

Course Specialization Certificate - Data Engineering, Big Data, and Machine Learning on GCP

WORK EXPERIENCE:

GSK Pharmaceutical – USA Jan 2023 – Present

GCP Data Engineer

Responsibilities:

Led successful on-premises data migrations to GCP using Apache NiFi and Oracle 19c to Snowflake Data Warehouse.

Defined optimized Snowflake virtual warehouse sizing for various workloads.

Designed migration plans and selected GCP services for hosting Oracle databases.

Developed and tested data extraction scripts for smooth transfers.

Utilized Apache NiFi for diverse data ingestion workflows into GCP.

Built data integration solutions with Oracle Data Integrator (ODI) and customized Talend components.

Set up a GCP Data Lake with Google Cloud Storage, BigQuery, and BigTable.

Automated workflows with Apache Airflow for Change Data Capture (CDC) services.

Configured GCP services using cloud shell SDK, including Data Proc, Storage, and BigQuery.

Implemented scalable data solutions with Hadoop, including MapReduce programs and ETL pipelines in Databricks using Python.

Successfully migrated data to Azure HDInsight and Azure Blob Storage.

Optimized Apache NiFi pipelines utilized Kafka and Spark Streaming, and designed data ingestion pipelines into Druid on GCP.

Created Databricks Spark Jobs with PySpark for various operations.

Extracted and analyzed data from data lakes, EDW, and validated data flows using SQL and PySpark.

Designed Talend jobs, ODI mappings, and Sqoop scripts for data movement.

Conducted data transformation and cleansing using SQL and PySpark.

Implemented data pipelines, flows, and transformations with Azure Data Factory and Databricks.

Automating cloud infrastructure provisioning using Infrastructure as Code (IaC) and performed deployments through cloud shell environments.

Experience with the complete SDLC process, including code reviews, source code management, and build process.

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

Developed and architected multiple Data pipelines, end-to-end ETL and ELT processes for Data ingestion and transformation in GCP.

Loaded Data to BigQuery using Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil, and Shell Script.

Developed Terraform script and deployed it in cloud deployment manager to spin up resources like cloud virtual networks.

Used Spark-SQL to load JSON data, create Schema RDD, and loaded it into Hive Tables, handling structured data using Spark SQL.

Set required Hadoop environments for clusters to perform Map Reduce jobs, monitored Hadoop cluster connectivity and security using tools such as Zookeeper and Hive.

Developed ETL systems using Python and in-memory computing framework (Apache Spark), scheduled and maintained data pipelines at regular intervals in Apache Airflow.

Involved Python scripts for data transformations on large datasets in Azure Stream Analytics or Azure Databricks.

Worked with HBase and MySQL for optimizing data and over File Sequence, AVRO, and Parquet file formats.

Created data frames in Azure Databricks using Apache Spark to perform business analysis.

Optimized and implemented data storage formats (e.g., Parquet, Delta Lake) through Databricks with effective partitioning strategies.

Built complex data pipelines using PL/SQL scripts, Cloud REST APIs, Python scripts, GCP Composer, and GCP Dataflow.

Monitored BigQuery, DataProc, and cloud Dataflow jobs via Stackdriver and opened an SSH tunnel to Google DataProc to access the Yarn manager to monitor Spark jobs.

Submitted Spark jobs using Gsutil and Spark submission for execution in the DataProc cluster.

Performed data wrangling to clean, transform, and reshape data using pandas library.

Used digital signage software to display real-time metrics for monitoring migration progress.

Analyzed data using SQL, Scala, Python, Apache Spark, and presented analytical reports to management and technical teams.

Designed and executed Apache Spark jobs within Databricks to perform complex data transformations, utilizing Scala and Python for enhanced analytics.

Implemented data wrangling processes using Databricks, employing pandas library for cleaning, transforming, and reshaping data in an efficient manner.

Optimized and implemented storage formats like Parquet and Delta Lake in Databricks, incorporating effective partitioning strategies for streamlined data processing.

Created firewall rules to access Google DataProc from other machines.

Used Python for data ingestion into BigQuery.

Involved in porting on-premises Hive code migration to GCP (Google Cloud Platform) BigQuery.

Monitored Hadoop cluster connectivity and security using tools like Zookeeper and Hive.

Managed and reviewed Hadoop log files.

Followed AGILE (Scrum) development methodology for application development.

Involved in managing backup and disaster recovery for Hadoop data.

Environment: Snowflake, Apache NiFi, Oracle 19c, Oracle Data Integrator (ODI), Talend, Sqoop, Shell Script, Gsutil, Cloud REST APIs, Apache Spark, PySpark, Scala, Spark SQL, GCP, BigQuery, BigTable, GCS, DataProc, Cloud Composer, Cloud Shell SDK, Dataflow, Azure, ADF, Databricks, Azure Stream Analytics, HIVE, MapReduce, RDD, Spark Streaming, Hadoop, Apache Airflow, HBase, MySQL, Hive Tables, AVRO, Parquet, Delta Lake, Python, PL/SQL, SQL, Bash, Pandas, DataFrames, Terraform, Cloud Shell, Stackdriver, Zookeeper, SSH Tunnel, Yarn Manager, Hadoop Logs, Signage Software, Agile (Scrum)

Cisco – San Jose, CA Apr 2020 – Dec 2022

AWS Data Engineer

Responsibilities:

Designed, implemented, and optimized scalable ETL/ELT pipelines using Python and SQL to process structured and semi-structured data, ensuring reliable and fast data processing.

Designed and deployed cloud infrastructure using AWS services such as AMI and CloudFormation for scalable and efficient deployment.

Integrated diverse data sources, including MySQL, PostgreSQL, and MongoDB, into Google Cloud Storage for scalable and reliable storage.

Leveraged PySpark for distributed processing and transformation of high-volume telemetry data, enabling real-time analytics for critical business insights.

Built robust data lake solutions on Databricks, implementing Delta Lake architecture to ensure data reliability, version control, and ACID compliance.

Designed REST and SOAP API integrations to facilitate seamless data ingestion and system connectivity.

Engineered real-time streaming pipelines using Apache Kafka, Kafka Connect, and Google Cloud Pub/Sub, enabling Change Data Capture (CDC) for real-time analytics.

Loaded structured and semi-structured data from Google Cloud Storage into BigQuery for centralized data analysis and querying.

Optimized data ingestion workflows using Cloud Dataflow and Cloud Pub/Sub, enabling seamless and efficient analytics processing.

Engineered automated AWS infrastructure deployment with AWS CloudFormation and integrated it with Jenkins to create continuous deployment pipelines, enhancing operational efficiency.

Designed and implemented AWS-based data pipelines using AWS Glue for serverless ETL operations, automating data transformation and integration across multiple sources.

Integrated AWS Redshift for data warehousing and optimized analytical query performance by applying distribution keys, sort keys, and compression techniques.

Implemented scalable event-driven architecture with AWS Lambda and AWS Step Functions, automating workflows and reducing latency in data processing tasks.

Utilized Amazon RDS for high-performance relational databases, ensuring reliable and secure database management for transactional workloads.

Built monitoring solutions using Amazon CloudWatch for performance metrics and log monitoring to ensure operational health across the AWS environment.

Administered Snowflake for data warehousing, optimizing query performance with partitioning, clustering, and materialized views.

Supported data quality through automated labeling pipelines and enforced data privacy and provenance policies for reliable data usage.

Managed hybrid workloads across AWS and Google Cloud Platform (GCP), ensuring seamless integration and optimized performance.

Automated infrastructure provisioning with Terraform, managed code with GitHub, and implemented CI/CD pipelines with Jenkins, streamlining deployment processes.

Managed and optimized ELK Stack (Elasticsearch, Logstash, and Kibana) for log management, ensuring visibility and ease of troubleshooting.

Implemented robust monitoring and observability using Google Cloud Monitoring, Google Cloud Logging, and Stackdriver, ensuring system health and reliability.

Enforced data governance and compliance using Google Cloud Key Management Service (KMS) and Google Cloud Identity and Access Management (IAM) to manage data security and access control.

Built interactive business intelligence dashboards using Power BI and Tableau to deliver real-time insights.

Developed and optimized dashboards in Looker to enable data-driven decision-making across the organization.

Developed visualizations in Google Data Studio, enabling data-driven decision-making across the organization.

Enhanced data security by implementing AWS Identity and Access Management (IAM) roles and policies to control access and ensure compliance with data security regulations.

Enforced data security and governance frameworks using AWS KMS (Key Management Service) for data encryption and AWS IAM for managing secure access and permissions.

Administered Splunk Enterprise Security, configuring SOAR, Search Head Clustering, and Indexer Clustering, and maintained secure configurations across platforms.

Administered secure Linux-based environments with UNIX/Linux administration and automated routine tasks through shell scripting for efficient operations.

Environment: Python, SQL, Shell scripting, PySpark, Apache Kafka, Kafka Connect, Databricks, Delta Lake, Dataflow, Pub/Sub, AWS Glue, AWS Lambda, AWS Step Functions, MySQL, PostgreSQL, MongoDB, GCS, BigQuery, Amazon S3, Amazon RDS, Snowflake, AWS Redshift, Terraform, Jenkins, GitHub, AMI, Power BI, Tableau, Looker, Google Data Studio, Splunk, ELK Stack, Elasticsearch, Logstash, Kibana, AWS KMS, AWS IAM, CloudWatch, Cloud Run, Stackdriver, Google Cloud IAM, KMS.

First Citizens Bank, Raleigh, NC Dec 2018 – Mar 2020

AWS Data Engineer

Responsibilities:

Migrated on-prem Hadoop and OpenStack data environments to AWS EMR.

Deployed Kubernetes-based Spark clusters via Amazon EKS for distributed data processing.

Facilitated collaborative data operations using JIRA for sprint planning, Confluence for documentation, and Microsoft Teams for team coordination across pipelines and business data reporting workflows.

Applied ITIL and SDLC frameworks along with modern data engineering best practices to deliver reliable, scalable, and governed data pipelines for enterprise analytics and reporting.

Streamlined provisioning of data platforms and workflows using Terraform and Azure DevOps, automating infrastructure for ETL orchestration tools like Apache Airflow and AWS Glue, reducing manual configuration.

Built a multi-cloud data pipeline infrastructure using Terraform, supporting hybrid workloads across AWS, enabling reproducible, version-controlled pipeline environments for batch and stream processing.

Created technical documentation for data architecture, ETL workflows, monitoring protocols, and access control configurations using Visio, Excel, and Word, ensuring knowledge retention and handover readiness.

Developed Python and SQL-based data transformation scripts, automating ingestion, transformation, and validation processes across S3, Redshift, and DynamoDB, improving pipeline execution time and data reliability.

Developed reusable Python modules to support ETL pipelines, including custom logging, error handling, and alerting mechanisms for increased reliability and maintainability.

Used Python with Pandas and PySpark to perform complex data wrangling, feature engineering, and integration of structured and semi-structured datasets for downstream analytics.

Implemented REST API data ingestion and batch processing scripts in Python, enabling scalable data acquisition from third-party systems and internal services.

Wrote advanced SQL queries for data aggregation, transformation, and cleansing across Redshift, Snowflake, and BigQuery to support data modeling and analytics.

Optimized SQL queries and database performance by analyzing execution plans, indexing strategies, and applying partitioning and clustering techniques.

Created and maintained views, stored procedures, and UDFs in SQL to support business logic encapsulation and simplify complex transformations within ETL workflows.

Automated orchestration workflows using Apache Airflow and AWS Step Functions, reducing manual data handoffs and improving process continuity.

Integrated Snowflake as a cloud-native data warehouse for high-performance querying, enabling scalable analytics and seamless data sharing across AWS environment.

Automated data ingestion from S3 and Redshift into Snowflake using Snowpipe and custom ETL scripts, enabling near real-time analytics and reducing data pipeline latency.

Optimized Snowflake performance using clustering keys, result caching, and materialized views, significantly improving query efficiency and dashboard responsiveness.

Cut data infrastructure costs by 40% through automated CI/CD deployment of ETL jobs using AWS CodePipeline, AWS CodeBuild, and dynamic scaling of EMR clusters and EC2-based data workloads.

Managed Dockerized PySpark and Kafka stream processing jobs on Amazon EKS, optimizing resource use for real-time ingestion pipelines. Implemented Blue/Green deployments for Spark jobs and Airflow DAGs, ensuring zero-downtime job updates.

Secured sensitive datasets and metadata by integrating AWS Secrets Manager, responding to GuardDuty data threats, and conducting vulnerability scans on data systems using Qualys.

Enhanced observability and data pipeline visibility using Datadog, AWS CloudWatch, and LogicMonitor, monitoring real-time data ingestion, transformation, and processing across AWS Glue, EMR, and Redshift environments.

Used Datadog, Dynatrace, and AWS CloudWatch to monitor ETL pipeline performance, BigQuery/Redshift latency, and data quality metrics, achieving a 30% reduction in mean time to pipeline recovery.

Improved incident handling for data ingestion failures and pipeline anomalies using enhanced monitoring tools, data lineage tracking (via AWS Glue Data Catalog), and cross-team resolution workflows.

Environment: AWS (EMR, S3, Redshift, DynamoDB, EKS, Glue, CloudWatch, Secrets Manager, CodePipeline, CodeBuild, Step Functions, Glue Data Catalog, Snowpipe, GuardDuty), Azure DevOps, OpenStack, Apache Spark, Apache Airflow, Kafka, Hadoop, Snowflake, BigQuery, Terraform, Python (Pandas, PySpark, REST APIs), SQL, Datadog, Dynatrace, LogicMonitor, JIRA, Confluence, Microsoft Teams, Microsoft Word, Excel, Visio, Qualys

Adidas America – Portland, OR Mar 2017 – Nov 2018

AWS Data Engineer

Responsibilities:

Leveraged AWS Kinesis and Apache Kafka for real-time ingestion of customer clickstream data.

Used PySpark, AWS Glue, and Redshift for transforming the raw data into features for recommendation models.

Delivered personalized product recommendations using Machine Learning models (e.g., TensorFlow or scikit-learn) based on user behavior and preferences.

Automated data pipelines to forecast sales using historical data, leveraging AWS Redshift and Snowflake as centralized data warehouses.

Build ETL workflows using Apache Airflow to move data from transaction logs, product catalogs, and warehouse systems into cloud data lakes (e.g., Amazon S3).

Used Spark and Python to develop and deploy predictive models that help businesses forecast demand and optimize inventory.

Used AWS Glue and Snowflake to create centralized customer profiles by transforming raw transactional data into structured datasets.

Leveraged SQL and Python to segment customers by purchasing behavior, demographics, and browsing history, enabling personalized marketing campaigns.

Created automated reporting solutions with Power BI or Tableau, connecting directly to data warehouses like Redshift or Snowflake for real-time business insights.

Used AWS Glue and ETL scripts in Python to collect, clean, and transform order data, inventory levels, and shipment tracking information from multiple sources.

Build dashboards using Amazon QuickSight or Power BI to monitor key supply chain metrics, including delivery times, stock levels, and order fulfillment rates.

Implemented AWS Step Functions and Apache Airflow to automate order processing workflows, from purchase to delivery.

Implemented AWS Kinesis and Apache Flink for real-time fraud detection by analyzing transaction patterns and behavior.

Used AWS Lambda and PySpark to transform raw transaction data and trigger alerts in case of suspicious activity.

Build and deploy fraud detection models using scikit-learn and TensorFlow to continuously monitor incoming transactions for anomalous behavior.

Used AWS Comprehend or PySpark to analyze customer reviews, chat logs, and social media data for sentiment analysis.

Stored unstructured data like customer feedback in Amazon S3, and transform it using AWS Glue for analysis.

Integrated sentiment and feedback data into Redshift or Snowflake to analyze customer satisfaction trends, providing actionable insights for product teams.

Used AWS Glue to aggregate competitive pricing data and historical sales data for products.

Leveraged PySpark and AWS Lambda for real-time adjustments to pricing strategies based on market trends, competitor pricing, and demand elasticity.

Implemented AWS CloudWatch to monitor the pricing model's effectiveness and create dashboards for key stakeholders.

Environment: Apache Kafka, Apache Spark, PySpark, Snowflake, Redshift, AWS Kinesis, AWS Glue, AWS Redshift, AWS Snowflake, Amazon S3, AWS Lambda, AWS Step Functions, Amazon QuickSight, AWS CloudWatch, AWS Comprehend, AWS Flink, TensorFlow, scikit-learn, Apache Airflow, AWS Glue, Power BI, Tableau, Amazon QuickSight, Python, SQL, AWS Comprehend, PySpark, AWS Lambda

EDUCATION:

Master’s in computer science – Governor state University.

Bachelor’s in computer science - PSCMR College of Engineering and Technology.

Contact this candidate