Data engineer

Location:

Houston, TX

Salary:

Posted:

January 30, 2025

Contact this candidate

Resume:

Name: Sumanth Borra

Email: **************@*****.***

PH: +1-713-***-****

DATA ENGINEER

Professional Summary:

Around 10+ years of extensive IT experience as a Data engineer with expertise in designing data-intensive applications using Hadoop Ecosystem and Big Data Analytical, Cloud Data engineering (Aws,), Data Visualization, Data Warehouse, Reporting, and Data Quality solutions.

Hands-on expertise with the Hadoop ecosystem, including strong knowledge of Big Data technologies such as HDFS, Spark, YARN, MapReduce, Apache Cassandra, HBase, Zookeeper, Hive, Oozie, Impala, Pig, and Flume.

Extensive experience in working with Base SAS (MACROS, Proc SQL, ODS), SAS/ACCESS, and SAS/GRAPH in UNIX, Windows, and Mainframe environment.

Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance, Metadata Management, MDM and Configuration Management.

Experience in UNIX Korn Shell scripting for process automation.

Thorough knowledge in SAS Programming, using Base SAS, Macro Facility, Proc SQL, SAS Procedures, SAS Functions, SAS Formats, and ODS facility in data scrubbing, manipulation, and preparation to produce summary datasets and reports.

With the knowledge on Spark Context, Spark, Data frame API, Spark Streaming, and Pair RDD's, worked extensively on Py Spark to increase the efficiency and optimization of existing Hadoop approaches.

Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.

In-depth understanding and experience with real-time data streaming technologies such as Kafka and Spark Streaming.

Experience in writing queries using SQL, experience in data integration and performance training.

Hands-on experience on AWS components such as EMR, EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, Redshift, DynamoDB to ensure a secure zone for an organization in AWS public cloud.

Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).

Experience in Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data bricks, and Azure SQL Data warehouse and controlling and granting database access and migrating on-premises databases to Azure Data Lake store using Azure Data factory.

Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks.

Experience in integrating Hive queries into Spark environment using Spark SQL.

Experienced in building Snow pipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.

Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.

Hands-on experience in handling database issues and connections with SQL and No SQL databases such as MongoDB, HBase, SQL server.

Technical Skills:

AWS Ecosystem

Amazon EC2, S3, EMR, Kinesis, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, CloudWatch.

AZURE Ecosystem

Azure Data Factory, Azure data, Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse.

Big Data Ecosystems

Hadoop, Map Reduce, HDFS, YARN, Spark, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Zookeeper, PIG, Oozie, NiFi, Kafka, Apache Spark, SparkMLib.

Scripting Language

C, C++, Java, Python, Spark, SQL, Shell Scripting.

NoSQL Database

HBase, Dynamo DB.

Database

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2, MongoDB, Cassandra.

Version Control

Git, SVN, GitHub.

ETL Tools

Tableau, Microsoft Excel, Informatica, Power BI.

Application Server

Apache Tomcat 5.x 6.0, JBoss 4.0.

Others

Jupiter Notebook, Terraform, Docker, Kubernetes, Jenkins, Ansible, Splunk, Jira, Apache Airflow, Kerberos.

Certifications:

1. Azure Data Engineer

2. AWS Solution Architect

3. Power BI Data Analyst Associate

4. Microsoft Azure fundamentals

5. Six sigma certifications

Education Details:

Master’s degree from University Of Houston Clear Lake Jan 2020 to May 2021

Bachelor’s from Lovely Professional University June 2009 to May 2013

Professional Experience

Title: Sr. Data Engineer.

Pay Pal, Austin TX Jan 2024 Present

Responsibilities:

Developed ETL pipelines to extract, transform and load (ETL) Digital wallet transaction data into a centralized data warehouse at Brain Tree payments using Snowflake by optimizing data accessibility for business intelligence.

Implemented real-time data ingestion workflows using Snowpipe, enabling near-instantaneous processing of chargeback transactions and enhancing decision-making capabilities.

Tuned and optimized SQL queries within Snowflake to improve performance and reduce latency for chargeback reporting, leading to faster insights and enhanced operational efficiency.

Developed real-time data ingestion pipelines using Azure Data Factory and Snowflake to streamline transactional data processing from Braintree’s payment systems.

Extracted, transformed, and loaded data from source systems to Snowflake Data Warehouse using a combination of Snowflake's native functionalities, SQL scripts, and Snow pipe for continuous data loading.

Designed and implemented effective semi-structured data models using Snowflake, ensuring optimal performance and scalability for various analytical requirements.

Collaborated with Data Science team to incorporate Machine Learning with Snowpark into the pipeline, reducing latency and improving the efficiency of the overall pipeline.

Implemented Snowflake batch jobs using Snowflake’s task and event notifications, integrated with Azure Monitor, and Slack ensuring timely issue resolution and performance monitoring of ETL pipelines.

Configured Snowflake External Tables to query data directly from external storage system Azure Blob Storage without needing to ingest the data first.

Developed SQL queries using Snow SQL and generated reports in SPLUNK based on Snowflake connections, facilitating data visualization and analysis.

Deployed and managed distributed Apache Spark clusters in Azure Databricks to process terabytes of data, ensuring high availability and fault tolerance for mission-critical data operations.

Utilized the Azure Service Fabric SDK for developing, deploying, and managing microservice applications, interacting with the Service Fabric cluster through APIs for monitoring and troubleshooting.

Set up threshold-based alerts in Grafana to monitor critical system metrics, reducing incident response time by 40% through timely notifications.

Configured role-based access control policies in Saviynt, streamlining permissions management and improving compliance with corporate security standards.

Developed and maintained ETL pipelines leveraging Azure Data Lake Storage (ADLS) to efficiently store, process, and analyze unstructured and structured data, supporting real-time analytics and machine learning models.

Deployed and configured Azure Virtual Machines (VMs) using ARM templates and Azure portal to meet project-specific requirements.

Designed and implemented backup strategies using Azure Backup and Azure Site Recovery to ensure high availability and disaster recovery for critical workloads running on Azure Virtual Machines.

Used Azure Data Migration Service (DMS) for assistance in migrating PL/SQL code to Azure SQL.

Leveraged Fivetran capabilities to maintain real-time data synchronization across multiple data sources, facilitating immediate insights and decision-making for business intelligence initiatives.

Deployed Fivetran to perform incremental data updates, reducing the load on source systems and optimizing the data replication process to ensure minimal latency and resource usage.

Created PowerShell scripts for managing cloud storage solutions, such as uploading, archiving, and versioning files in Azure Blob Storage.

Reduced mean time to detect (MTTD) and mean time to resolve (MTTR) by implementing actionable alerting mechanisms in Microsoft Insights.

Integrated Microsoft Insights with Azure Logic Apps and Azure Functions for automated data-driven workflows based on monitoring triggers

Designed and implement data solutions using NoSQL databases, accommodating diverse data types and structures.

Played a crucial role in the deployment of code across System Integration Testing (SIT), User Acceptance Testing (UAT), and Production platforms, adhering to DevOps practices and utilizing tools like Jira, Confluence, and Bitbucket for streamlined collaboration and version control.

Designed pipelines to convert raw data into Parquet, enabling efficient processing and analytics on large datasets.

Developed multiple Python DAGs and integrated all ETL jobs in Airflow, ensuring efficient workflow management.

Title: Sr Data Engineer

Wells Fargo, Charlotte, NC Jan 2023 to Dec 2023

Responsibilities:

Implemented real-time data ingestion and processing using Azure Event Hubs and Azure Stream Analytics, enhancing monitoring of grocery delivery and pickup metrics in real time.

Architected and optimized data using Snowflake, ensuring high performance and scalability to handle large volumes of transactional data related to grocery delivery and pickup.

Collaborated with business analysts and retail teams to streamline data models in Snowflake, focusing on supply chain and inventory management for smaller store formats, improving decision-making.

Leveraged Snowflake's built-in features for data ingestion, such as Snowpipe for real-time data loading and Snowflake Connector for Azure Blob Storage, facilitating efficient data ingestion and processing workflows.

Automated data loading and transformation pipelines using Azure functions to trigger Snowpipe for real-time ingestion and transformation in Snowflake.

Utilized Snowflake Secure Data Sharing to provide real-time access to data across multiple Azure accounts and external partners without data replication.

Utilized Azure Data Share for secure and real-time access to data across multiple Azure subscriptions and external partners without data duplication.

Utilized Entra ID for user and group role management, ensuring seamless integration with Azure AD for identity governance and access management.

Leveraged Azure Virtual Machines (VMs), Azure Blob Storage, and Azure SQL Database to deploy, scale, and manage web applications and services.

Configured and managed Azure Managed Identities and Role-Based Access Control (RBAC) to ensure secure, streamlined authentication and fine-grained access control across Azure services.

Implemented data workflows using Azure Logic Apps and Data Factory, integrating Python-based scripts to automate data processing pipelines.

Created serverless APIs using Azure API Management and Azure Functions, integrating with Python back-end logic for scalable and cost-effective API solutions.

Utilized Azure Event Hubs with Apache Kafka for streamlined event streaming and cluster management.

Developed ETL processes using Python and Azure Synapse Analytics, ensuring data quality and consistency for advanced analytics.

Integrated Azure Data Catalog with Azure Databricks and Azure SQL Database for metadata management and ETL workflows.

Integrated Python applications with Azure Monitor, Log Analytics, and Application Insights for monitoring, logging, and alerting.

Integrated Power BI with Azure SQL Database, Azure Synapse Analytics, and Azure Data Lake to visualize large datasets, providing stakeholders with dynamic reports and actionable insights.

Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.

Automated ETL processes using Perl and PowerShell, reducing manual intervention and improving data accuracy.

Developed and maintained stored procedures, functions, and triggers in T-SQL to automate data transformation tasks.

Implemented data storage and retrieval solutions using Delta Lake format for optimized storage and faster query execution.

Deployed Microsoft Insights solutions to monitor and analyze performance across multiple Azure regions, ensuring high availability and compliance.

Used Microsoft Insights to monitor audit logs, track security events, and ensure compliance with standards such as GDPR and HIPAA.

Built shell scripts Bash to monitor and schedule data pipeline jobs, improving reliability and reducing downtime.

Created and scheduled Bash scripts for data ingestion pipelines using tools like Cron, improving job reliability and timeliness.

Orchestrated job scheduling and management using Autosys, ensuring timely execution of ETL processes to meet business SLAs, while also collaborating with stakeholders to gather requirements, analyze data integration needs, and ensure alignment with organizational objectives and data quality standards.

Title: Azure Data Engineer

ACCOLITE DIGITAL, Irving,TX July 2021 to Dec 2022

Responsibilities:

Performed migration of several Databases, Application and Web servers from on-premises environments to Microsoft Azure Cloud environment.

Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset merging techniques of SAS/BASE.

Created unit and integration tests for Java-based data engineering workflows to ensure reliability and maintainability.

Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.

Used Snowflake Spark Streaming to configure to receive real-time data from Kafka, stored in HDFS.

Used Snow SQL to analyze data quality issues and improve on the snowflake schema and data modeling.

Worked with Data ingestion using Spark Streaming and Kafka.

Analyzed various databases containing sales, titles, account, region etc data of the releases of various titles. Developed new or modified SAS programs and used SQL Pass Through and Libname methods to extract data from the Teradata DB and created study specific SAS datasets, which are used as source datasets for report generating programs.

Experienced in handling Terabytes of records and manipulating them through SAS.

Analyzed existing databases, tables and other objects to prepare to migrate to Azure Synapse.

Automated deployment and monitoring of Java-based data pipelines using CI/CD tools and frameworks.

Used Azure Data Factory extensively for ingesting data from disparate source systems.

Designed and Implemented pipelines in Azure Synapse/ADF to Extract, Transform and load Data from several sources including Azure SQL, Azure SQL Datawarehouse etc.

Created Python notebooks on Azure Databricks for processing the datasets and load them into Azure SQL databases. Worked in creating a Data Lake Storage Gen2 storage account and a file system.

Experienced in managing Azure Data Lake Storage (ADLS), Databricks Delta Lake and understanding of how to integrate with other Azure Services.

Experience with best practices of Web services development and Integration (both REST and SOAP).

Created Dynamic Pipelines, data Sets, Linked servers in Azure Data Factory for data movement and data transformations.

Collaborated with cross-functional teams to analyze requirements and deliver efficient Java-based data solutions.

Have good experience designing cloud-based solutions in Azure by creating Azure SQL database, setting up Elastic pool jobs and design tabular models in Azure analysis services.

Ingested the data into Azure Blob storage and processed the data using Databricks and involved in writing Spark Scala scripts and UDF's to perform transformations on large datasets.

Data warehouse solutions using polybase/external table on Azure Synapse/Azure SQL Data warehouse (Azure DW), Using Azure Data Lake as source. Rewriting exiting SSAS cubes to Azure Synapse/Azure SQL Data warehouse (Azure DW).

Implement an ETL procedure utilizing SQLizer, SSIS, and SQL Azure Database to transfer data from Cosmos to the database.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL Server, Windows remote desktop, AZURE PowerShell, Databricks, Python, Kubernetes, Azure SQL Server, Azure Data warehouse.

Title: Big Data Engineer

FIRSTSOURCE, Hyderabad, India Jan 2018 to Nov 2019

Responsibilities:

Created Spark jobs by writing RDDs in Python and created data frames in Spark SQL to perform data analysis and stored in Azure Data Lake.

Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to HDFS using Scala.

Developed Spark Applications by using Kafka and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Design and Develop ETL process in AWS Glue to migrate Campaign data from external sources like S3, ORC, Parquet.

Data Extraction, aggregations, and consolidation of Adobe data with AWS Glue using Pyspark.

Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.

Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster.

Developed data pipeline using Flume to ingest data and customer histories into HDFS for analysis.

Used Hive as an ETL tool for event joins, filters, transformations, and pre-aggregations.

Extracting real-time data using Kafka and Spark streaming by Creating DStreams and converting them into RDD, processing it, and stored it into.

Solid understanding of No SQL Database (MongoDB and Cassandra).

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Scala extracted large datasets from Cassandra and Oracle servers into HDFS and vice versa using Sqoop.

Developed analytical component using Scala, Spark and Spark Streaming.

Environment: Hive, Red Shift, Snowflake, Spark, Scala, HBase, Cassandra, JSON, XML, UNIX Shell Scripting, Cloudera, MapReduce, Power BI, ETL, MySQL, NoSQL

Title: Data Engineer

RYTHMOS, Hyderabad June 2013 to Dec 2017

Responsibilities:

Implemented Spark SQL queries that combine Hive queries with Python programmatic data manipulations supported by RDDs and Data frames.

Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.

Extract Real-time feed using Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data in HDFS.

Developing Spark scripts, UDFS using Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

Installed and configured Pig and written Pig Latin scripts.

Wrote MapReduce job using Pig Latin.

Worked on analyzing Hadoop clusters using different big data analytic tools including HBase database and Sqoop.

Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics.

Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.

Created ETL packages with different data sources (SQL Server, Oracle, Flat files, Excel, DB2, and Teradata) and loaded the data into target tables by performing different kinds of transformations using SSIS.

Designed, developed data integration programs in a environment with No SQL data store Cassandra for data access and analysis.

Created partitions, bucketing across the state in Hive to handle structured data using Elastic search.

Performed Sqooping for various file transfers through the HBase tables for processing of data to several No SQL DBs- Cassandra, Mongo DB.

Environment: Hadoop, MapReduce, HDFS, Hive, python, Kafka, HBase, Sqoop, No SQL, Spark 1.9, PL/SQL, Oracle, Cassandra, Mongo DB, ETL, MySQL

Contact this candidate