Data Engineer

Location:

Houston, TX

Salary:

Posted:

October 29, 2025

Contact this candidate

Resume:

Sumanth.B

Data Engineer

Gmail Id.: - **************@*****.***

Mobile No.: - +1-713-***-****

LinkedIn Profile: - LinkedIn

PROFESSIONAL SUMMARY:

Highly skilled Azure Data Engineer with over 8+ years of experience in designing, developing, and maintaining data pipelines and cloud-based architectures across diverse domains.

Extensive hands-on experience with the Microsoft Azure ecosystem, including Azure Cosmos DB, Azure Data Factory (ADF), Synapse Analytics, Azure Data Lake, Azure Monitor, and Azure Automatio.

Extensive experience in working with Base SAS (MACROS, Proc SQL, ODS), SAS/ACCESS, and SAS/GRAPH in UNIX, Windows, and Mainframe environment.

Hands-on experience with Google Cloud Platform (GCP) services including BigQuery, DataFlow, DataStream, Pub/Sub, Cloud Functions, Cloud Run, and Cloud Storage for large-scale data engineering solutions.

Proficient in PowerShell scripting and Azure CLI for automating deployments, resource provisioning, data pipeline monitoring, and infrastructure management..

Experience in Data Analysis, Data Profiling, Data Integration, Migration, and Data governance, Metadata Management, MDM and Configuration Management.

Strong expertise in writing complex SQL queries, developing Spark (Java/Scala) applications, and automating workflows using Unix/Shell scripting.

Proficient in Databricks for large-scale data processing, and experienced with AWS (EMR, S3, Glue) and Azure (Data Factory, Data Lake, Synapse.

Experience in UNIX Korn Shell scripting for process automation.

Thorough knowledge in SAS Programming, using Base SAS, Macro Facility, Proc SQL, SAS Procedures, SAS Functions, SAS Formats, and ODS facility in data scrubbing, manipulation, and preparation to produce summary datasets and reports.

Developed and consumed REST-based web services, integrating with relational databases such as MySQL and SQL Server.

Experienced in architecting and implementing globally distributed data solutions using Azure Cosmos DB, optimizing partitioning, indexing, and Request Unit (RU) consumption for cost efficiency and high performance..

With the knowledge on Spark Context, Spark, Data frame API, Spark Streaming, and Pair RDD's, worked extensively on Py Spark to increase the efficiency and optimization of existing Hadoop approaches.

Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.

Strong background in data partitioning, clustering, and IAM-based security models in BigQuery.

Experience in writing queries using SQL, experience in data integration and performance training.

Hands-on experience on AWS components such as EMR, EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, Redshift, DynamoDB to ensure a secure zone for an organization in AWS public cloud.

Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).

Experience in Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data bricks, and Azure SQL Data warehouse and controlling and granting database access and migrating on-premises databases to Azure Data Lake store using Azure Data factory.

Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks.

Adept at integrating PowerShell automation with Azure DevOps pipelines to streamline CI/CD processes and reduce manual effort.

Experienced in building Snow pipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.

Collaborative and detail-oriented professional with proven ability to lead data initiatives, mentor junior engineers, and deliver high-quality cloud solutions aligned with business goals.

Hands-on experience in handling database issues and connections with SQL and No SQL databases such as MongoDB, HBase, SQL server.

Professional Experience

Pay Pal, Austin TX Jul 2024 – Present

Sr. Cloud Data Engineer

Designed and implemented globally distributed data solutions using Azure Cosmos DB, optimizing partitioning, indexing, and RU consumption for high-performance analytics.

Built and optimized Spark (Scala/PySpark) jobs in Azure Databricks for ETL and analytics workflows, integrating with AWS S3 and Azure Data Lake.

Built and deployed data pipelines on GCP using DataFlow and Pub/Sub for real-time ingestion of transaction data.

Automated deployment, monitoring, and scaling of Cosmos DB collections and stored procedures using PowerShell scripts and Azure CLI..

Implemented real-time data ingestion workflows using Snow pipe, enabling near-instantaneous processing of chargeback transactions and enhancing decision-making capabilities.

Built PowerShell-based scripts for automated data validation, schema synchronization, and log analysis, improving deployment efficiency by 40%.

Developed real-time data ingestion pipelines using Azure Data Factory and Snowflake to streamline transactional data processing from Braintree’s payment systems.

Built microservices on AWS ECS integrated with Kafka for event-driven architecture, enabling near real-time data flow.

Implemented Cosmos DB multi-region replication and backup strategies ensuring high availability and fault tolerance.

Extracted, transformed, and loaded data from source systems to Snowflake Data Warehouse using a combination of Snowflake's native functionalities, SQL scripts, and Snow pipe for continuous data loading.

Designed and implemented effective semi-structured data models using Snowflake, ensuring optimal performance and scalability for various analytical requirements.

Collaborated with Data Scientists to operationalize machine learning models by integrating them into production pipelines using AWS SageMaker.

Collaborated with Data Science team to incorporate Machine Learning with Snowpark into the pipeline, reducing latency and improving the efficiency of the overall pipeline.

Implemented Snowflake batch jobs using Snowflake’s task and event notifications, integrated with Azure Monitor, and Slack ensuring timely issue resolution and performance monitoring of ETL pipelines.

Integrated PowerShell with Azure DevOps pipelines for automated build and deployment workflows across environments..

Developed SQL queries using Snow SQL and generated reports in SPLUNK based on Snowflake connections, facilitating data visualization and analysis

Hands-on experience in designing, building, and maintaining ETL pipelines using FiveTran, Azure Data Lake, and Azure Data Factory.

Deployed and managed distributed Apache Spark clusters in Azure Databricks to process terabytes of data, ensuring high availability and fault tolerance for mission-critical data operations.

Utilized the Azure Service Fabric SDK for developing, deploying, and managing microservice applications, interacting with the Service Fabric cluster through APIs for monitoring and troubleshooting.

Set up threshold-based alerts in Grafana to monitor critical system metrics, reducing incident response time by 40% through timely notifications.

Configured role-based access control policies in Saviynt, streamlining permissions management and improving compliance with corporate security standards.

Developed and maintained ETL pipelines leveraging Azure Data Lake Storage (ADLS) to efficiently store, process, and analyze unstructured and structured data, supporting real-time analytics and machine learning models.

Designed and maintained scalable ETL pipelines using Python and optimized complex SQL queries to support high-volume data ingestion and transformation processes.

Developed reusable Python modules for data validation, anomaly detection, and schema enforcement to streamline ETL workflows

Deployed and configured Azure Virtual Machines (VMs) using ARM templates and Azure portal to meet project-specific requirements.

Designed and implemented backup strategies using Azure Backup and Azure Site Recovery to ensure high availability and disaster recovery for critical workloads running on Azure Virtual Machines.

Used Azure Data Migration Service (DMS) for assistance in migrating PL/SQL code to Azure SQL.

Leveraged Fivetran capabilities to maintain real-time data synchronization across multiple data sources, facilitating immediate insights and decision-making for business intelligence initiatives.

Deployed Fivetran to perform incremental data updates, reducing the load on source systems and optimizing the data replication process to ensure minimal latency and resource usage.

Created PowerShell scripts for managing cloud storage solutions, such as uploading, archiving, and versioning files in Azure Blob Storage.

Reduced mean time to detect (MTTD) and mean time to resolve (MTTR) by implementing actionable alerting mechanisms in Microsoft Insights.

Integrated Microsoft Insights with Azure Logic Apps and Azure Functions for automated data-driven workflows based on monitoring triggers.

Designed and implement data solutions using NoSQL databases, accommodating diverse data types and structures.

Played a crucial role in the deployment of code across System Integration Testing (SIT), User Acceptance Testing (UAT), and Production platforms, adhering to DevOps practices and utilizing tools like Jira, Confluence, and Bitbucket for streamlined collaboration and version control.

Designed pipelines to convert raw data into Parquet, enabling efficient processing and analytics on large datasets.

Developed multiple Python DAGs and integrated all ETL jobs in Airflow, ensuring efficient workflow management.

American Airlines, Austin, Texas Jul 2023 – Jun 2024

Cloud Data Engineer

Developed and maintained PowerShell scripts for provisioning Azure resources (Data Lake, SQL DB, and Key Vault), optimizing infrastructure automation.

Developed and scheduled Unix/Shell scripts to automate data ingestion pipelines and monitor ETL job health.

Implemented automated workflows in Azure Data Factory (ADF) integrating with Azure SQL Database, Synapse, and Power BI.

Designed Cosmos DB data models to store semi-structured retail transaction data, enabling low-latency access for analytics.

Designed and implemented ingestion pipelines using GCP Cloud Functions and Cloud Run, enabling serverless processing for supply chain metrics.

Architected and optimized data using Snowflake, ensuring high performance and scalability to handle large volumes of transactional data related to grocery delivery and pickup.

Collaborated with business analysts and retail teams to streamline data models in Snowflake, focusing on supply chain and inventory management for smaller store formats, improving decision-making.

Leveraged Snowflake's built-in features for data ingestion, such as Snowpipe for real-time data loading and Snowflake Connector for Azure Blob Storage, facilitating efficient data ingestion and processing workflows.

Leveraged PowerShell and Azure Monitor APIs to generate real-time system performance dashboards and alerts.

Deployed and monitored ML models using SageMaker endpoints, ensuring low-latency inference and high availability.

Replicated SAP tables into BigQuery using SAP SLT + GCP DataStream, streamlining downstream analytics

Automated data loading and transformation pipelines using Azure functions to trigger Snowpipe for real-time ingestion and transformation in Snowflake.

Utilized Snowflake Secure Data Sharing to provide real-time access to data across multiple Azure accounts and external partners without data replication.

Utilized Azure Data Share for secure and real-time access to data across multiple Azure subscriptions and external partners without data duplication.

Utilized Entra ID for user and group role management, ensuring seamless integration with Azure AD for identity governance and access management.

Leveraged Azure Virtual Machines (VMs), Azure Blob Storage, and Azure SQL Database to deploy, scale, and manage web applications and services.

Configured and managed Azure Managed Identities and Role-Based Access Control (RBAC) to ensure secure, streamlined authentication and fine-grained access control across Azure services.

Implemented data workflows using Azure Logic Apps and Data Factory, integrating Python-based scripts to automate data processing pipelines.

Created serverless APIs using Azure API Management and Azure Functions, integrating with Python back-end logic for scalable and cost- effective API solutions.

Utilized Azure Event Hubs with Apache Kafka for streamlined event streaming and cluster management.

Created automation workflows using PowerShell to clean and transform log data prior to ingestion into Azure Synapse.

Integrated Azure Data Catalog with Azure Databricks and Azure SQL Database for metadata management and ETL workflows.

Integrated Python applications with Azure Monitor, Log Analytics, and Application Insights for monitoring, logging, and alerting.

Integrated Power BI with Azure SQL Database, Azure Synapse Analytics, and Azure Data Lake to visualize large datasets, providing stakeholders with dynamic reports and actionable insights.

Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.

Automated ETL processes using Perl and PowerShell, reducing manual intervention and improving data accuracy.

Developed and maintained stored procedures, functions, and triggers in T-SQL to automate data transformation tasks.

Implemented data storage and retrieval solutions using Delta Lake format for optimized storage and faster query execution.

Deployed Microsoft Insights solutions to monitor and analyze performance across multiple Azure regions, ensuring high availability and compliance.

Used Microsoft Insights to monitor audit logs, track security events, and ensure compliance with standards such as GDPR and HIPAA.

Built shell scripts Bash to monitor and schedule data pipeline jobs, improving reliability and reducing downtime.

Orchestrated job scheduling and management using Autosys, ensuring timely execution of ETL processes to meet business SLAs, while also collaborating with stakeholders to gather requirements, analyze data integration needs, and ensure alignment with organizational objectives and data quality standards.

Created and scheduled Bash scripts for data ingestion pipelines using tools like Cron, improving job reliability and timeliness.

ACCOLITE, Hyderabad, India Jun 2020 – Dec 2022

Data Engineer

Performed migration of several Databases, Application and Web servers from on-premises environments to Microsoft Azure Cloud environment.

Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset merging techniques of SAS/BASE.

Developed Spring Boot microservices in Java for data ingestion pipelines, leveraging Kafka and RabbitMQ for event-driven messaging

Created unit and integration tests for Java-based data engineering workflows to ensure reliability and maintainability.

Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.

Created and optimized Spark applications in Java and Scala to process large-scale structured and semi-structured datasets.

Used Snowflake Spark Streaming to configure to receive real-time data from Kafka, stored in HDFS.

Developed REST API ingestion pipelines on GCP to integrate external data sources with enterprise data lakes.

Leveraged Terraform to automate provisioning of GCP services for data engineering workflows

Used Snow SQL to analyze data quality issues and improve on the snowflake schema and data modeling.

Worked with Data ingestion using Spark Streaming and Kafka.

Implemented SQL queries and MySQL for backend data validation, ensuring data consistency across distributed systems

Experienced in handling Terabytes of records and manipulating them through SAS.

Analyzed existing databases, tables and other objects to prepare to migrate to Azure Synapse.

Analyzed various databases containing sales, titles, account, region etc data of the releases of various titles. Developed new or modified SAS programs and used SQL Pass Through and Libname methods to extract data from the Teradata DB and created study specific SAS datasets, which are used as source datasets for report generating programs.

Automated deployment and monitoring of Java-based data pipelines using CI/CD tools and frameworks.

Used Azure Data Factory extensively for ingesting data from disparate source systems.

Designed and Implemented pipelines in Azure Synapse/ADF to Extract, Transform and load Data from several sources including Azure SQL, Azure SQL Datawarehouse etc.

Created Python notebooks on Azure Databricks for processing the datasets and load them into Azure SQL databases. Worked in creating a Data Lake Storage Gen2 storage account and a file system.

Experienced in managing Azure Data Lake Storage (ADLS), Databricks Delta Lake and understanding of how to integrate with other Azure Services.

Experience with best practices of Web services development and Integration (both REST and SOAP).

Created Dynamic Pipelines, data Sets, Linked servers in Azure Data Factory for data movement and data transformations.

Collaborated with cross-functional teams to analyze requirements and deliver efficient Java-based data solutions.

Have good experience designing cloud-based solutions in Azure by creating Azure SQL database, setting up Elastic pool jobs and design tabular models in Azure analysis services.

Ingested the data into Azure Blob storage and processed the data using Databricks and involved in writing.

Data warehouse solutions using polybase/external table on Azure Synapse/Azure SQL Data warehouse (Azure DW), Using Azure Data Lake as source. Rewriting exiting SSAS cubes to Azure Synapse/Azure SQL Data warehouse (Azure DW).

Implement an ETL procedure utilizing SQLizer, SSIS, and SQL Azure Database to transfer data from Cosmos to the database.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL Server, Windows remote desktop, AZURE PowerShell, Databricks, Python, Kubernetes, Azure SQL Server, Azure Data warehouse.

FIRSTSOURCE, Hyderabad, India Dec 2018 – May 2020

Cloud Data Platform Engineer

Created Spark jobs by writing RDDs in Python and created data frames in Spark SQL to perform data analysis and stored in Azure Data Lake.

Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to HDFS using Scala.

Created REST APIs in Java to integrate AWS Glue ETL workflows with downstream applications.

Developed Spark Applications by using Kafka and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Design and Develop ETL process in AWS Glue to migrate Campaign data from external sources like S3, ORC, and Parquet.

Data Extraction, aggregations, and consolidation of Adobe data with AWS Glue using Pyspark.

Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.

Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster.

Developed data pipeline using Flume to ingest data and customer histories into HDFS for analysis.

Used Hive as an ETL tool for event joins, filters, transformations, and pre-aggregations.

Extracting real-time data using Kafka and Spark streaming by Creating DStreams and converting them into RDD, processing it, and stored it into.

Solid understanding of No SQL Database (MongoDB and Cassandra).

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Scala extracted large datasets from Cassandra and Oracle servers into HDFS and vice versa using Sqoop.

Developed analytical component using Scala, Spark and Spark Streaming.

Environment: Hive, Red Shift, Snowflake, Spark, Scala, HBase, Cassandra, JSON, XML, UNIX Shell Scripting, Cloudera, MapReduce, Power BI, ETL, MySQL, NoSQL

RYTHMOS, Hyderabad, India Aug 2017 – Dec 2018

Cloud Data Engineer

Implemented Spark SQL queries that combine hive queries with Python programmatic data manipulations supported by RDDs and Data frames.

Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.

Extract Real-time feed using Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data in HDFS.

Developing Spark scripts, UDFS using Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

Installed and configured Pig and written Pig Latin scripts.

Wrote MapReduce job using Pig Latin.

Worked on analyzing Hadoop clusters using different big data analytic tools including HBase database and Sqoop.

Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics.

Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.

Created ETL packages with different data sources (SQL Server, Oracle, Flat files, Excel, DB2, and Teradata) and loaded the data into target tables by performing different kinds of transformations using SSIS.

Designed, developed data integration programs in a environment with No SQL data store Cassandra for data access and analysis.

Created partitions, bucketing across the state in Hive to handle structured data using Elastic search.

Performed Sqooping for various file transfers through the HBase tables for processing of data to several No SQL DBs- Cassandra, Mongo DB.

Certifications

Microsoft Certified: Azure Data Engineer Associate

Microsoft Certified: Power BI Data Analyst Associate

Microsoft Certified: Azure Fundamentals.

AWS Certified Solutions Architect – Associate

Microsoft Certified: Azure Administrator Associate.

Microsoft PowerShell for Automation and Administration

Six Sigma Certification (Green Belt)

Completed extensive hands-on training in PowerShell Automation and Azure CLI scripting.

Skills

Programming & Scripting: Core Java, J2EE (Spring, Hibernate, MVC, Spring Boot), REST APIs, SQL, Scala, Python, C, C++, Shell Scripting, Apache Spark (Scala/PySpark)

Databases: Relational: MySQL, Oracle, Teradata, MS SQL Server, PostgreSQL, DB2, HBase, DynamoDB, MongoDB, Cassandra

ETL / BI Tools: Informatica, Tableau, Power BI, Microsoft Excel

Cloud Platforms: Google Cloud Platform (BigQuery, DataFlow, DataStream, Pub/Sub, Cloud Functions, Cloud Run, Cloud Storage, IAM), AWS (EMR, S3, Glue, Kinesis, RDS, VPC, IAM, CloudWatch), Azure (Data Factory, Data Lake, Databricks, SQL Database, Synapse Analytics)

Big Data Technologies: Databricks, Hadoop, MapReduce, HDFS, YARN, SparkMLib, Hive, Pig, Sqoop, HBase, Flume, Kafka, Zookeeper, Oozie, NiFi, StreamSets, Apache Airflow

DevOps & CI/CD Tools: Terraform, Docker, Kubernetes, Jenkins, Ansible

Version Control: Git, GitHub, SVN

Application Servers: Apache Tomcat (5.x, 6.0), JBoss (4.0)

Monitoring & Others: Splunk, Jupyter Notebook, Jira, Kerberos.

Education

Master of Science in Engineering Management Jan 2023 – Dec 2023

University of Houston Clear Lake, USA

Bachelor of Technology in Electronics & Communication Engineering Jun 2014 – May 2018

Lovely Professional University, India

Contact this candidate