Post Job Free
Sign in

Data Engineer Azure

Location:
Dallas, TX
Posted:
July 17, 2025

Contact this candidate

Resume:

Tarun Teja Pasupuleti

Frisco, Tx 940-***-**** *****************@*****.*** Linkedin

SUMMARY DETAILS

Accomplished Data Engineer with over 5+ years of cutting edge hands-on experience, distinguished by a profound ability to design and implement sophisticated data pipelines and ETL/ELT frameworks across multi-cloud platforms such as AWS, Azure, and GCP. Leveraging proficient skills in Python, SQL, PySpark, and a variety of Big Data tools like Hive, Kafka, Spark, and Snowflake to build robust data lakes and lakehouses. Exceptional in automating infrastructures and orchestration with Docker, Kubernetes, and Terraform, advancing cloud migrations and data visualization using PowerBI & Tableau. Dedicated to employing thorough expertise to create data engineering solutions that facilitate seamless data access and strengthen system efficiency.

KEY SKILLS

•GCP Services: BigQuery, Cloud Storage (GCS), Cloud Functions, Cloud SQL, Cloud Pub/Sub, Cloud Dataflow, Cloud Shell SDK, Google Cloud SDK, Bigtable, Cloud Dataproc, Apache Beam

•AWS Services: S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, Cloud Formation

•Azure Services: Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Data Lake Storage, Azure SQL Database, Azure Analysis Services, Azure Synapse Pipelines, Synapse Studio, Azure Data Lake Analytics

•Hadoop / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, PySpark, Airflow, Snowflake, Spark Components

•Databases: Oracle, Microsoft SQL Server, MySQL, DB2, Teradata

•Programming Languages: Java, Scala, Impala, Python, T-SQL, U-SQL, Spark SQL

•Web Servers: Apache Tomcat, WebLogic

•IDEs: Eclipse, Dreamweaver, Visual Studio

•NoSQL Databases: HBase, Cassandra, MongoDB, Bigtable

•Methodologies: Agile (Scrum), Waterfall, UML, Design Patterns, SDLC

•Technologies Exploring: Apache Flink, Drill, Tachyon

•Cloud Services: AWS, Azure, Azure Data Factory / ETL/ELT/SSIS, Azure Data Lake Storage, Azure Data Bricks, GCP

•ETL Tools: Talend Open Studio, Talend Enterprise Platform, CI/CD, Apache Beam, Snow Pipe

•Reporting and ETL Tools: Tableau, Power BI, AWS GLUE, SSIS, SSRS

•DevOps & Containerization: Docker, Kubernetes, Jenkins, Terraform, Bamboo, Bitbucket, Git, Gitlab

•Monitoring & Analytics: Elasticsearch, Kibana, Logstash, Filebeat

WORK HISTORY

Western Union Feb 2024 - Present

GCP Data Engineer Austin, Texas, USA

•Developed and maintained ETL data pipelines by performing analytics on Hive data using Spark API over Hortonworks Hadoop YARN, validating data and generating detailed reports with Power BI while integrating Dataproc with BigQuery, GCS, and Cloud Pub/Sub to support near real-time processing, improving data accessibility and decision-making

•Engineered scalable data ingestion processes by implementing Apache Sqoop for efficient bulk data transfers between Hadoop and Oracle, and crafted a Python and Apache Beam program for rigorous data validation between raw source files and BigQuery tables.

•Streamlined data processing and storage using GCP Dataproc, GCS, Cloud Functions, Cloud SQL, and BigQuery, enhancing overall system efficiency and reliability.

•Optimized query performance in BigQuery by configuring Cloud Shell SDK in GCP to manage services like Data Proc, applying partitioning and clustering on high-volume tables.

•Automated data delivery by developing PySpark scripts to push data from GCP to third-party vendor APIs.

•Enhanced performance monitoring by estimating cluster size and tracking Spark Databricks clusters using the Spark Data Frame API, improving data manipulation within Spark sessions.

•Executed and deployed data validation jobs in Cloud Dataflow through a Python and Apache Beam program, while configuring essential services (Cloud Dataproc, Google Cloud Storage, Cloud BigQuery) using Cloud Shell SDK.

•Managed Databricks clusters with autoscaling and spot instances to reduce costs and optimize performance, resulting in significant cost savings and improved resource utilization

•Collaborated with cross-functional teams to design and implement robust data pipelines using Apache Flink, thereby enhancing overall data processing capabilities.

•Integrated Python ORM classes with SQL Alchemy to update database records and monitored GCP services using Google Cloud SDK for expedited incident resolution and system reliability.

•Optimized ETL processes to load and refresh Oracle Data Warehouse data, thereby enhancing accessibility and reliability.

•Coordinated data ingestion from Azure services by leveraging Azure Data Factory, T-SQL, Spark SQL, and U-SQL with Azure Data Lake Analytics, and migrated an entire Oracle database to BigQuery for effective reporting with Power BI.

•Integrated NoSQL databases such as MongoDB, Bigtable, and Cassandra to handle high-velocity and unstructured data, improving scalability and flexibility.

•Facilitated efficient data processing by utilizing Sqoop for importing/exporting raw data into Google Cloud Storage via Cloud Dataproc clusters.

•Implemented Airflow, Apache Beam, and BigQuery to orchestrate data processing and analytics, reinforcing system performance and efficiency.

Frito-Lay Apr 2023 - Jan 2024

AWS Data Engineer Plano, Texas, USA

•Designed and maintained Airflow DAGs for scheduling data ingestion and ETL jobs, which streamlined data processing and enhanced production-level orchestration across pipelines.

•Monitored system metrics and logs to diagnose and resolve issues in Hadoop Cluster management, ensuring optimal performance and stability.

•Accelerated EMR cluster launch by 70% and optimized Hadoop job processing by 60% through effective use of Boto3 for seamless file transfers to S3, improving overall system efficiency.

•Integrated Big Data technologies including Hadoop, SOLR, PySpark, Kafka, and Storm to support robust analytics and enhanced data processing workflows.

•Enhanced SQL performance by analyzing and redesigning scripts with Spark SQL to achieve faster query execution.

•Managed the full Software Development Lifecycle (SDLC) by gathering requirements, designing, developing, deploying, and analyzing applications, ensuring project alignment with business goals.

•Developed T-SQL scripts to manage instance-level objects and optimized performance, while implementing a CI/CD pipeline with Docker and Jenkins for custom application image deployments in the cloud.

•Facilitated data movement by utilizing Sqoop to transfer data from multiple relational databases to Hadoop Distributed File System, and optimized both batch and streaming processes with Dataflow and Pub/Sub to achieve near real-time synchronization.

•Implemented automated monitoring and alerting systems using Kubernetes and Docker to proactively identify and resolve data pipeline exceptions, and designed Elasticsearch index schemas for scalable search and analytics over structured and unstructured data.

•Migrated legacy system metrics by developing SAS scripts and transferring them to Snowflake through AWS S3, improving data accessibility and analytical capabilities.

•Captured clickstream data by managing Flume configurations, thereby enhancing real-time data insights.

•Leveraged a suite of technologies including Airflow, AWS, Docker, EMR, Glue, Hive, Jenkins, Kafka, Kubernetes, Lambda, PySpark, Redshift, S3, SAS, Snowflake, Spark, and SQL to streamline comprehensive data engineering processes.

Mizuho Jun 2021 - Jul 2022

Azure Data Engineer Mumbai, India

•Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using Flume and Sqoop, and performed structural modifications using MapReduce, Hive.

•Ensured data quality and accuracy with custom SQL and Hive scripts and created data visualizations using Python and Tableau for improved insights and decision-making.

•Designed, developed, and maintained scalable and reliable data integration and ETL/ELT pipelines using Microsoft Azure data services, including Azure Data Factory, Azure Databricks, and Azure Synapse Pipelines, to ingest and process data from diverse sources, improving data flow and accessibility

•Leveraged Apache Spark (using PySpark or Scala) within Azure Databricks and Azure Synapse Analytics (Spark pools) to process and transform large datasets, facilitating advanced analytics and machine learning initiatives.

•Developed and executed complex SQL queries against Azure SQL Database and Azure Synapse Analytics to extract, transform, and load data, ensuring data quality and integrity for various downstream applications and users, improving data accessibility.

•Developed Spark Streaming programs to process real-time data from Kafka, applying both stateless and stateful transformations, which enhanced data processing efficiency. Assessed infrastructure needs for each application and deployed them on the Azure platform.

•Created Databricks Job workflows which extracts data from SQL Server and upload the files to SFTP using PySpark and Python. Analyzed data using Hadoop components Hive and Pig.

•Worked with Spark Core, Spark ML, Spark Streaming, and Spark SQL on Databricks, optimizing data processing. Managed Kafka message partitioning and set up replication factors in Kafka Cluster, enhancing data reliability

•Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records.

•Used Python for data validation and analysis purposes. Successfully completed a POC for Azure implementation, with the larger goal of migrating on-premises servers and data to the cloud.

•Participated in all phases of the Software Development Life Cycle (SDLC) from implementation to deployment, ensuring seamless integration and delivery of software solutions

•Imported and exported databases using SQL Server Integration Services (SSIS) and Data Transformation Services (DTS Packages). Developed and maintained Azure Analysis Services models, creating measures, dimensions, and hierarchies to enhance business intelligence and data analytics capabilities

•Used Bitbucket as source control to push the code and Bamboo as deployment tool to build CI/CD pipeline.

•Dockerized applications by creating Docker images from Docker file, collaborated with development support team to setup a continuous deployment environment using Docker. Optimized Elasticsearch cluster performance through shard tuning, heap memory management, refresh interval adjustments, and query profiling.

•Utilized Elasticsearch and Kibana for indexing and visualizing the real-time analytics results, enabling stakeholders to gain actionable insights quickly.

•Built data models and curated datasets in Synapse Studio for enterprise-level reporting and machine learning use cases.

•Applied Snowflake utilities, Snow SQL, and Snow Pipe along with Big Data modeling techniques using Python to streamline data processing and improve job development efficiency using stages like Transformer, Aggregator, Merge, Join, Lookup, Sort, Remove Duplicates, Funnel, Filter, and Pivot

•Leveraged technologies such as Azure Analysis Services, Docker, and Spark SQL to optimize data processing workflows and enhance data analysis capabilities.

API Holdings Jan 2020 - May 2021

Data Engineer Mumbai, India

•Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.

•Automated ingestion and transformation of streaming and batch data with AWS Lambda + Glue + S3 pipeline.

•Implemented partitioning, dynamic partitions, and buckets in Hive, optimizing data retrieval and storage efficiency

•Designed and developed data ingestion, aggregation, and integration processes in the Hadoop environment, improving data accessibility and integration

•Utilized Kafka as a messaging system to implement real-time streaming solutions with Spark Streaming, enhancing data processing capabilities

•Created several Databricks Spark jobs with PySpark to perform table operations, improving data processing efficiency and accuracy

•Developed Spark code using Python for PySpark, Scala, and Spark-SQL, enhancing data processing speed and reliability

•Designed the business requirement collection approach based on project scope and SDLC methodology, ensuring comprehensive understanding and alignment with project goals

•Developed dashboards and visualizations using SQL Server Reporting Services (SSRS) and Power BI, enabling business users to analyze data effectively and providing actionable insights to upper management

•Worked with Docker to enhance the Continuous Delivery (CD) framework, streamlining release processes and improving deployment efficiency

•Implemented Apache Airflow for workflow automation and scheduling tasks and created DAGs tasks.

•Developed and automated data migration pipelines using Python, Apache Airflow, and GCP services, ensuring data consistency and minimizing downtime during cutover. Conducted performance tuning and optimization of Kubernetes and Docker deployments to improve overall system performance.

•Built robust data ingestion pipelines using Logstash, Filebeat, and Kafka Connect to stream real-time logs and events into Elasticsearch clusters. Automated and monitored AWS infrastructure with Terraform for high availability and reliability, reducing infrastructure management time by 90% and improving system uptime.

•Conducted query optimization and performance tuning tasks, such as query profiling, indexing, and utilizing Snowflake's automatic clustering to improve query response times and reduce costs. Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.

EDUCATION

University of North Texas, Texas, USA

Sep 2022 - Dec 2023

Masters, Computer Science



Contact this candidate