Data Engineer

Location:

Richardson, TX

Posted:

October 22, 2024

Contact this candidate

Resume:

Name: Hemanth Batchu Data Engineer

Email: *********@*****.***

Ph#: 469-***-****

Professional Summary:

With over 6+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer/Data Developer and Data Modeler.

Experience building distributed high-performance systems using Spark and Scala.

Experience in Spark ecosystem, core, SQL, Streaming modules.

Experienced in writing Spark Applications in Scala and Python (PySpark).

Experience in developing Kafka producers and Kafka consumers for streaming millions of events per minute on streaming data using PySpark, Python & Spark Streaming.

Experience with Data Lake Implementation using Apache Spark and developed Data pipelines and applied business logics using Spark.

Experience in writing MAPREDUCE programs for data cleansing and preprocessing.

Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Spark MLlib, Spark Graph X, Spark SQL, Kafka.

Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).

Good experience in Snowflake data warehouse, developed data extraction queries, automatic ETL for data loading from Data Lake.

Extensive experience with Informatica (ETL Tool) for Data Extraction, Transformation and Loading. Experience in building Data Warehouses/Data Marts using ETL tools Informatica Power Center.

Experience with Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.

Experience working with Job scheduling tools (Airflow).

Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.

Experience working with GitHub/Git source and version control systems.

Experience with Continuous Integration and Continuous Deployment using containerization technologies like Docker and Jenkins.

Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.

Experience in coding SQL for developing Procedures, Triggers, and Packages.

Experience with AGILE Methodologies.

Good communication skills, work ethics and the ability to work in a team efficiently with good leadership skills.

Certified Applied Machine Learning Specialist Dec 2023 - Present

This certification provided a foundation in statistics and data-driven communication to extract meaningful business insights. It also covered machine learning, including text mining and NLP, for actionable insights from practical data.

Technical Skills:

Databases

Postgre SQL, MySQL, Microsoft SQL, Snowflake, AWS RDS, Teradata, Oracle

Cloud Technologies

AWS, Azure, GCP

Programming Languages

Python, SQL, Scala, MATLAB

NoSQL Databases

MongoDB, Hadoop HBase and Apache Cassandra

Scalable Data Tools

Hadoop, Hive, Apache Spark, Map Reduce

Integration Tools

Jenkins, Docker

Data Formats

CSV, JSON

Reporting & Visualization

Tableau, Matplotlib

Operating Systems

Red Hat Linux, Unix, Windows, macOS

Client: Delta Airlines

Role: Data Engineer Jun 2023 – Current

Responsibilities:

Worked closely with business users to gather, define business requirements, and analyze potential technical solutions.

Extracted, transformed, and loaded data from multiple source systems into Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Handled data ingestion into Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processed the data further in Azure Databricks.

Implemented Kafka functionalities such as distribution, partitioning, and maintaining replicated commit log services for messaging systems to manage real-time data feeds effectively.

Actively involved in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats, which provided insights into customer usage patterns through data analysis.

Developed multiple Spark jobs in Scala and Python for data cleaning and preprocessing, preparing the data for further ingestion and analysis in Hive tables.

Migrated existing MapReduce programs into Spark transformations using both Spark and Scala, after initially developing them with Python (PySpark).

Utilized Scala sbt to develop Scala coded spark projects and executed using spark-submit.

Created and configured Spark clusters, including high-concurrency clusters using Azure Databricks, enabling faster and more efficient data preparation.

Worked on setting up and maintaining various Azure Data services, including Azure SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, and Azure Data Factory, to streamline data storage and processing.

Automated resulting scripts and workflows using Apache Airflow and shell scripting to ensure that daily production jobs were executed consistently and without errors.

Installed and configured Apache Airflow for integration with S3 buckets and Snowflake Data Warehouse, creating Directed Acyclic Graphs (DAGs) to run and monitor workflow tasks.

Designed and developed data marts using dimensional data modeling techniques such as star and snowflake schemas to optimize data storage and reporting efficiency.

Played an integral role in data modeling using star schema methodologies, transforming logical data models into dimensional models for building efficient data marts.

Involved in extensive ETL testing, including running jobs, extracting data using complex SQL queries, transforming data, and loading it into the data warehouse servers for business use.

Developed Tableau reports with complex calculations and performed ad-hoc reporting using Power BI, working closely with business users to meet reporting requirements.

Integrated Hive queries with Tableau to perform complex data analysis and published these reports to Tableau Server for broader business consumption.

Extensively used Git commands for code version control, ensuring that code changes were properly tracked, and collaborated with the team through version management.

Developed and modified database objects such as Tables, Views, Indexes, Constraints, Stored Procedures, Packages, Functions, and Triggers using SQL and PL/SQL to meet business requirements and improve database performance.

Performed analysis and developed queries using Hive to analyze data and generate reports for business users, ensuring that data was clean, accurate, and aligned with business needs.

Worked closely within an Agile framework, attending daily scrum meetings, participating in sprint planning sessions, and contributing to iterative review meetings to track progress and resolve issues as they arose.

Actively provided feedback during weekly iterative review meetings, offering constructive insights to improve each iterative development cycle.

Environment: Spark, Scala, MapReduce, Python, PySpark, Apache Spark, Kafka, Apache Airflow, Star Schema, Snowflake, ETL, Tableau, Power BI, Azure, Git, Jira, MongoDB, SQL, PL/SQL, Agile, Windows.

Client: Amazon

Role: Risk Specialist (Data Engineering) Feb 2019 – Jun 2022

Responsibilities:

Gathered business requirements and performed business analysis to design various data products.

Developed EMR Spark code using Scala and Spark-SQL/Streaming for faster data processing and testing.

Utilized Amazon EMR and Glue for Spark-SQL and data processing; stored data in S3 and used Athena for queries.

Implemented AWS Lambda functions with PySpark to perform real-time data aggregation and validation.

Worked on data encryption using KMS encryption and hashing algorithms as per client specifications.

Used Apache Airflow (Managed Workflows for Apache Airflow on AWS) for orchestrating, scheduling, and monitoring data pipelines.

Designed DAGs (Directed Acyclic Graph) to automate ETL pipelines, optimizing for performance and scalability.

Created direct query-based reports using Amazon QuickSight for data comparison between legacy and current data, generating real-time dashboards.

Designed and developed dashboards and reports using QuickSight, integrating data from S3, Redshift, and RDS.

Involved in optimizing Redshift queries and tables to improve cost performance.

Built 3NF data models for ODS and OLTP systems and implemented dimensional modeling using Star and Snowflake schemas.

Worked on data streaming from Kinesis Data Streams to S3 for real-time data ingestion into HDFS.

Developed and deployed migration strategies to AWS, using AWS Data Migration Service (DMS), Glue, and AWS Data Lake solutions.

Involved in loading data from REST API endpoints using AWS API Gateway and transferring data to Kinesis for streaming.

Managed version control using AWS CodeCommit for check-in and checkout of code changes.

Wrote HiveQL scripts to transform data within Redshift for analytics purposes.

Created Stored Procedures in Redshift and RDS for transforming data as part of ETL workflows.

Environment: Amazon EMR, AWS Glue, AWS Lambda, S3, Kinesis, Redshift, Athena, AWS API Gateway, Amazon QuickSight, Star Schema, Git, HiveQL, HDFS, Confluence, Jira, SQL, PL/SQL, Agile, Windows

Client: TechTide

Role: Data Engineer Jan 2018 – Feb 2019

Responsibilities:

Develop near real time data pipeline using spark.

Designed and Developed Scala workflows for data pull from cloud-based systems and applying transformations on it.

Developed Spark Scala scripts for mining information and performed changes on huge datasets to handle ongoing insights and reports.

Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python.

Involved in Data model sessions to develop models for HIVE tables.

Performed Tableau type conversion functions when connected to relational data sources.

Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.

Write Python scripts to parse JSON documents and load the data in database.

Used Pyspark for data ingestion and perform complex transformations.

Developed Map-Reduce programs to get rid of irregularities and aggregate the data.

Use SQL queries and other tools to perform data analysis and profiling.

Followed agile methodology and involved in daily SCRUM meetings, sprint planning, showcases and retrospective.

Actively participated and provided feedback in a constructive and insightful manner during weekly Iterative review meetings to track the progress for each iterative cycle and figure out the issues.

Environment: Spark, Scala, ETL, Tableau, Python, PySpark, MapReduce, Kafka, Pig, Hive, HDFS, Snowflake, SQL and Windows.

Education:

The University of Texas at Dallas, Richardson, TX, USA

Master of Science, Business Analytics

Sri Siddhartha Institute of Technology, KA, India

Bachelor of Engineering, Mechanical

Certifications:

AWS Certified Solutions Architect Professional

Applied Machine Learning

Google Data Analytics Professional

Contact this candidate