Data Engineer Senior

Location:

United States

Posted:

July 10, 2024

Contact this candidate

Resume:

Rahul Sathya Gunti

Phone : 704-***-****

Email : *****.******@*****.***

PROFESSIONAL SUMMARY:

Experienced in the full development lifecycle including requirements analysis, system design, development, testing, deployment, and documentation. Specialized in data management solutions, focusing on restructuring, simplifying, and maintaining data processes and technical metadata documentation.

•Insights-driven professional with over 6 years of cross-functional experience in Data Engineering, analysis, and BI development.

•Proven record of delivering innovative data solutions, actionable real-time insights to stakeholders across diverse domains.

•Expert in developing and maintaining over 1100 efficient data pipeline workflows using Apache Airflow, StreamSets, Spark.

•Enhanced data processing and analysis capabilities for large-scale datasets, resulting in a 50% reduction in processing times.

•Proficient in data visualization and reporting tools such as Microsoft Power BI and Tableau, creating insightful, real-time dashboards and reports to support data-driven strategies.

•Extensive experience with Azure Cloud services, including Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical Services, Azure Cosmos, Snowflake, and Databricks.

•Extensive experience with AWS Cloud services and implemented comprehensive monitoring and logging solutions using AWS CloudWatch to ensure system reliability and performance.

•Developed custom CloudWatch metrics and dashboards to provide real-time insights into system health and performance.

•Configured CloudWatch alerts and integrated them with SNS to notify stakeholders about critical system events and anomalies.

•Developed and managed scalable ETL jobs using AWS Glue to automate the extraction, transformation, and loading of data from diverse sources.

•Implemented Lambda functions for real-time data processing, transformation, and analysis.

•Managed large-scale data storage solutions using Amazon S3, ensuring high availability and durability of data.

•Orchestrated data ingestion pipelines to load data into S3 from various sources and utilized S3 for data retrieval in analytics workflows.

•Developed and deployed Azure Functions for data preprocessing and enrichment.

•Integrated Azure Logic Apps for orchestrating complex data workflows and implemented data archiving and retention strategies using Azure Blob Storage.

•Applied advanced analytics and machine learning workflows using Azure ML and created real-time customer behavior analysis workflows for dynamic data processing using Azure Databricks.

•Implemented data processing workflows in Azure Databricks for large-scale transformations.

•Strong capabilities in exploratory data analysis, statistical analysis, and visualization using R, Python, SQL, Power BI, and integration of Azure Synapse with Power BI for interactive dashboards.

•Engineered scalable Azure Synapse schemas for complex reporting requirements.

•Established cloud-based data warehouse solutions on Azure using Snowflake for rapid analysis of real-time customer data.

•Engineered and fine-tuned Snowflake schemas, tables, and views using Snow SQL and Snow Pipe for optimized storage and retrieval efficiency.

•Collaborated closely with stakeholders to implement tailored data models and structures in Snowflake for effective analysis.

•Implemented partitioning, indexing, and caching strategies in Snowflake for enhanced query performance.

•Developed, supported, and maintained ETL processes using Informatica.

•Created complex mappings, reusable transformations, sessions, and workflows with Informatica ETL tools to extract data from various sources and load into targets.

•Proficient in multiple databases including Cosmos DB, MongoDB, Cassandra, MySQL, Oracle, and MS SQL Server.

•Developed Spark applications using PySpark and SparkSQL in Databricks for data extraction, transformation, and aggregation from multiple file formats, uncovering insights into customer usage patterns.

•Familiar with libraries like PySpark, NumPy, Pandas, and Matplotlib in Python.

•Skilled in writing complex SQL queries using joins, group by, and nested queries and experienced with HBase for data loading using connectors and querying with NoSQL.

•Established robust data lineage and metadata management solutions for real-time data tracking and implemented data governance practices and data quality checks.

•Designed and created Hive external tables using shared meta-store with static and dynamic partitioning, bucketing, and indexing.

•Improved performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, and pair RDDs.

•Big Data development using Hive and Spark, relational database, and SQL development, proficient in CI/CD tools such as Git, Jenkins, and Ansible. Deployed Apache Oozie for Hadoop task scheduling and management.

TECHNICIAL SKILLS:

•Programming Languages: Java, Python, C programming, C++.

•Data Engineering and Analytics: Data Structures and Algorithms, Big Data Technologies, Data Science, Data Analytics, Data Visualization.

•Azure Services: Azure data Factory, Azure Data Bricks, Azure Synapse, Blob Storage, Data Lake Storage (ADLS gen2), Azure Logic Apps, Azure DevOps, Azure Repos, Key vault, Azure

•Functions, Azure Event Hubs, Azure HDInsight, Azure Service Fabric

•Machine Learning & AI: Machine Learning, Deep Learning Algorithms, Pandas, NumPy, Scikit-Learn, TensorFlow, Keras, PyTorch, Matplotlib.

•Data Visualization Tools: Power BI, Tableau, Tibco Jasper.

•Databases: PostgreSQL, MySQL, Microsoft SQL Server, MongoDB, Cassandra.

•Cloud Computing Platforms: Amazon Web Services (AWS), Microsoft Azure.

•Hadoop Ecosystem: HDFS, MapReduce, Hive, Sqoop.

•Big Data Technologies: Snowflake (Schemas, Tables, Views), Apache Spark, MapReduce, Hadoop, HDFS,

YARN, Hadoop Common, Hive, Kafka, Sqoop, Oozie, Flume, HBase, Zookeeper, Tez, Apache Spark, Kafka, Spark SQL, Zookeeper, Oozie, Composer.

•Azure Stack: Azure Data Factory, Azure Databricks, StreamSets, Azure Synapse.

•AWS Stack: S3, CloudWatch, Lambda, EC2, Glue, Redshift.

•Version Control & Build Automation Tools: Ant, Maven, Git, GitHub, Jenkins, Kubernetes, Docker, JIRA, Maven, Informatica Power & Center, Apache Airflow

WORK EXPERIENCE:

Client: Johnson and Johnson, New Brunswick, NJ Oct’22 – Present

Role: Azure Snowflake Data Engineer

Responsibilities:

•Successfully implemented a real-time customer behavior analysis solution using Azure Data Factory, Azure Databricks, Snowflake, Kafka, and Spark Streaming.

•Enabled the client to gain immediate insights into customer interactions, resulting in a 35% increase in targeted marketing effectiveness and a 40% improvement in customer retention rates.

•Understood business requirements, analyzed them, and translated them into application and operational requirements.

•Extracted, transformed, and loaded data from source systems to Azure Data Storage services using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Ingested data into one or more Azure services (Azure Data Lake, Azure Blob Storage, Azure SQL, Azure DW) and processed the data in Azure Databricks.

•Developed Spark applications with Azure Data Factory and Spark-SQL for data extraction, transformation, and aggregation from different file formats to uncover insights into customer usage patterns.

•Orchestrated dynamic data processing workflows using Azure Databricks and Spark to evaluate and understand client behavior in real-time.

•Collaborated on ETL tasks to maintain data integrity and verify the stability of real-time customer behaviour analysis pipelines.

•In the initial phase of the project, leveraged Relational database and later migrated to cloud databases.

•Migrated data from on-premises SQL server to cloud databases (Azure Synapse Analytics and Azure SQL DB).

•Designed and modified Snowflake schemas, tables, and views for optimized retrieval and storage efficiency, ensuring a robust foundation for real-time analytics and reporting.

•Analyzed data quality issues using Snow SQL and built analytical warehouses on Snowflake.

•Incorporated Slowly Changing Dimension (SCD) and Change Data Capture (CDC) approaches into microservices to preserve data integrity and effectively capture small changes, interfacing smoothly with SQL databases, CSV files, and REST APIs.

•Integrated big data processing and analytics capabilities with Azure Synapse Analytics for effortless exploration and generation of real-time insights from customer behavior data.

•Used Snowpipe to automatically ingest and process streaming data from Kafka into Snowflake, enabling real-time analysis of high-volume streaming data for immediate insights into customer behavior.

•Configured Snowpipe to load data from Azure Data Lake Storage (ADLS GEN2) into Snowflake, ensuring seamless integration between the data lake and the data warehouse for efficient data processing and analysis.

•Developed UDFs in Scala and PySpark to meet specific business requirements. Analyzed large data sets using Hive queries.

•Created a framework for data profiling, cleansing, automatic restart ability of batch pipelines, and handling rollback strategy.

•Implemented masking and encryption techniques to protect sensitive data.

•Used Kafka, Spark Streaming, and Hive to design and implement a robust data pipeline for real-time ingestion, transformation, and analysis of customer behavior data.

•Developed real-time data processing solutions using Kafka and Spark Streaming enabling continuous ingestion, transformation, and analysis of high-volume streaming data for immediate insights into customer behavior.

•Analyzed, developed, and built modern data solutions with Azure PaaS services to enable data visualization.

•Innovated and optimized microservices for Spark jobs using Scala for scripting to perform real-time data transformations, aggregations, and machine learning tasks on extensive datasets, resulting in swift insights into customer behavior.

•Led the development of a comprehensive CI/CD framework for real-time data pipelines using Jenkins, collaborating with DevOps engineers to meet client requirements.

•Facilitated the execution of Hive scripts through Hive on Spark and SparkSQL, ensuring real-time data processing for comprehensive customer behavior analysis.

•Used JIRA to report on projects and created sub-tasks for development, QA, and partner validation, ensuring a streamlined and agile approach to real-time customer behavior analysis.

•Experienced in the full spectrum of Agile ceremonies, from daily stand-ups to internationally coordinated PI Planning, to support iterative and responsive development of real-time customer behavior analysis solutions.

Environment: Azure, Azure Data Factory, Snowflake, Snow pipe, Azure Databricks, Spark, Hive, Spark, SQL, Python, Java, Scala, SQL,

Azure Data Lake Storage (ADLS GEN2), Shell scripting, GIT, Jenkins, Azure Logic Apps, Azure Service Fabric, MS SQL, Oracle, HDFS, MapReduce, YARN, JIRA, Power BI, Kafka

Client: Thermo Fisher Scientific, Sunnyvale, CA Jan’22 - Sep’22

Role: Big Data Engineer

Responsibilities:

•Orchestrated and managed data workflows using Azure Data Factory, Sqoop, Flume, and Kafka, facilitating the ingestion, transformation, and processing of customer behavioral data for subsequent analysis.

•Implemented data quality checks and monitoring solutions within Azure Data Factory pipelines, ensuring the accuracy and integrity of ingested data throughout the workflow.

•Integrated Azure Databricks with Scala for data aggregation and analysis on extensive datasets, enhancing business insights.

•Developed custom transformations and UDFs (User Defined Functions) in Scala and PySpark within Azure Databricks notebooks to address specific business logic requirements, improving data processing capabilities.

•Leveraged big data ecosystems, including Azure HDInsight with Hadoop, Spark, and other big data technologies, for loading and transforming diverse sets of structured, semi-structured, and unstructured data.

•Integrated Azure Cosmos DB with Hive within the Analytics Zone, optimizing data storage and retrieval processes.

•Applied Hive queries and Spark SQL on Azure HDInsight to meet specific business requirements, using MapReduce-like functionalities for data analysis and processing.

•Implemented Azure DevOps Pipelines for deployment automation, leading to expedited and streamlined build and release processes.

•Designed and implemented data partitioning strategies in Azure Cosmos DB to optimize query performance and reduce latency in data retrieval operations.

•Successfully migrated data from RDBMS (Oracle) to Azure SQL Database using Sqoop, enhancing data management and processing capabilities.

•Conducted performance tuning and optimization on Hive queries and Spark jobs to improve processing efficiency and reduce execution times. Implemented data modeling solutions using Azure Data Lake Storage and Azure SQL Data Warehouse, ensuring scalability and performance of analytical workloads.

•Utilized Azure DevOps Boards for issue management and project workflow, improving overall project organization and efficiency.

•Collaborated with team members to identify and resolve JVM-related issues, enhancing system performance and stability on Azure Virtual Machines.

•Employed Azure Repos and Git/GitHub for version control, maintaining the code repository and ensuring effective code management and meticulous tracking of changes.

Environment: Azure Data Factory, Azure Data Bricks, Sqoop, MYSQL, HDFS, Apache Spark Scala, Hive Hadoop, Cloudera, HBASE, Kafka, MapReduce, Zookeeper, Oozie, Data Pipelines, Python, PySpark, shell script, JIRA.

Client: APLIHS Software Solutions, India Jun’19 - Dec’21

Role: Data Engineer

Project: DHnA Data Hydration to Azure ADLS Gen2 using Streamsets, Airflow and Spark data stack.

Responsibilities:

•Architecture, curated, and implemented over 1100 batch & stream pipeline workflows to efficiently move data in file formats such as Parquet, Avro, and CSV and optimized data pipelines, performing statistical analysis to eliminate bottlenecks.

•Designed, developed and tested the architectural pipelines in multiple cloud providers – Azure, AWS.

•ETL pipelines using Airflow, ADF, Glue, to load, transform and curate data as per business logic.D

•Onboarded datasets using Airflow, explore data, and automate tasks using a modern Spark data stack.

•Utilized Kafka pub-sub model for tracking real-time events in the data records to trigger processes for data orchestration.

•Leveraged Spark jobs to process large-scale datasets with 10-15 billion records, resulting in a 50% reduction in processing time.

•Achieved over 85% code coverage using Python's UNITTEST libraries to automate the approval process.

•Designed data governance reports to facilitate the development of data pipelines for moving data.

•Parsed, analyzed to understand business data sets. Perform data reconciliations, validation, and quality checks. Identify and develop new processes within the data request process to enrich data.

•Utilized HBase for logging, for updating Hive tables. Curated raw datasets using HQL queries to business-ready data involving data skewing to several columns.

•Designed DevOps pipelines for build and release operations utilizing Azure Repos as the version control system.

•Created reports to visualize real-time metrics and monitor data pipeline performance using Prometheus and Graphana.

Environment: AWS, Azure Data Factory, Azure Data Bricks, Sqoop, MYSQL, HDFS, Apache Spark Scala, Hive Hadoop, Cloudera, HBASE, Kafka, MapReduce, Zookeeper, Oozie, Data Pipelines, Python, PySpark, shell script, JIRA.

Client: TECH MAHINDRA, India July’18 - May’19

Role: BI Developer

Project: Development of future state analytical reporting of CCA Data Analytics using Microsoft Power BI, Tableau.

Responsibilities:

•Designed, Developed, and maintained many Self-service Power BI & Tableau Reports.

•Built visually compelling, comprehensive trends, and 50+ KPIs in Consumer Health Care domain for 363 products across 120 business operation centers.

•Key participant in the migration process from on-Premises warehouse to Azure Cloud solutions.

•Applied permissions and Row Level security for data privacy concerns and data transformations using DAX Measures, Columns, and M Language.

•Active Interaction with SMEs to understand the requirements and develop relevant visuals.

EDUCATION:

UNIVERSITY OF NORTH CAROLINA, GREENSBORO - Master’s in Computer Science (Jan 2022 – Dec 2023)

Relevant Courses: Analysis of Algorithms, Advanced Data Structures, Distributed Operating Systems, Data Science, Database System Implementation, Database System Architecture, Software Engineering, BigData and Machine Learning

OSMANIA UNIVERSITY, HYDERABAD - Bachelor’s Degree in Computer Science (Sep 2014 – June 2018)

CERTIFICATIONS & PAPERS:

•Microsoft Certified: Azure Fundamentals

•Microsoft Certified: Azure Data Fundamentals

•Microsoft Certified: Azure Data Analyst (Power BI)

•Machine Learning A to Z: Hands on Python in Data Science by UDEMY

Contact this candidate