Data Processing Engineering

Location:

Irving, TX

Posted:

November 06, 2024

Contact this candidate

Resume:

Bharath M

Email: ************@*****.***

Cell: +1-469-***-****

PROFESSIONAL SUMMARY:

* ***** ** ********** ** data engineering, specializing in SQL, Python, PySpark, and Spark SQL for pipeline design, development, and maintenance.

Databricks certified Professional & Databricks certified Associate.

Proficient with AWS, Azure, and GCP for scalable data processing and analytics, using tools like AWS Glue, Azure Data Factory.

Skilled in optimizing Big Data workflows with Databricks, Delta Live Tables, and Apache Spark for reliable real-time data processing.

Developed interactive dashboards and visualizations with Power BI and Tableau, providing actionable insights from complex data sets.

Transformed Alteryx workflows into Databricks for improved integration and performance.

Well-versed in both relational (MySQL, PostgreSQL) and NoSQL databases (MongoDB), focusing on efficient data storage and retrieval.

Expert in Git for version control and collaboration, enhancing team productivity and project management in agile environments.

Proficient with JIRA for managing data engineering projects and ensuring effective team communication.

Integrated Asset Bundles with Git repositories, automating the tracking of changes across different components of the data pipeline, thereby improving traceability, version control, and rollback capabilities.

Knowledgeable in CI/CD pipeline automation to streamline deployment and improve software quality.

Designed and implemented Lakehouse architecture on Databricks, integrating Delta Lake to support a seamless blend of structured and semi-structured data processing and analytics.

Leveraged Databricks SQL Analytics and Delta Lake to unify data lakes and warehouses, optimizing storage and query performance for high-volume, scalable analytics.

Designed and implemented data governance frameworks to maintain data quality, security, and compliance.

Deep understanding of Data Warehousing principles and cloud storage solutions like AWS S3 and Azure Storage

Developed ETL workflows using SQL Server Integration Services (SSIS) to extract, transform, and load data between SQL Server and cloud data lakes, enhancing data integration across hybrid environments.

Configured Change Data Capture (CDC) in SQL Server to efficiently track and manage real-time data changes, enabling incremental data updates for data warehousing and reporting solutions.

Familiar with Spark Architecture, including its core components, SQL, Data Frames, and streaming capabilities.

Implemented Unity Catalog in Databricks for enhanced metadata management.

Documented data engineering processes, including data flow diagrams, technical specifications, and standard operating procedures.

Have a knowledge of implementing Unity Catalog on the Databricks platform for improved metadata management and organization.

EDUCATION: Master’s in Data science from Jackson State University 2023

Bachelor’s in computer science from Nagarjuna University 2016

TECHNICAL SKILLS:

Data Processing and Analysis

Apache Spark, Databricks, Apache Kafka, Snowflake, Hadoop, Databricks SQL, AWS EMR

Cloud Technologies

Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Data Storage and Management

Azure Data Lake Storage (ADLS), Azure Blob Storage, Azure Synapse, Amazon S3, Unity Catalog, Auto Loader

Data Warehousing

Azure Synapse Analytics, BigQuery, AWS Redshift

Data Orchestration

Azure Data Factory, AWS Glue, Apache Airflow, ETL Pipeline, Delta Live Table Pipeline, Databricks Workflows.

Visualization

Tableau, Power BI, Databricks SQL Dashboards

Programming Languages

Python, PySpark, Spark SQL, SQL, T- SQL, Java script

NoSQL Database

MongoDB, Neo4j, Redis.

Database

MySQL, PostgreSQL, SQL Server, Oracle, Snowflake

Version Control and Methodology

Git, Agile

WORK EXPERIENCE:

Thrivent, MN Jul 2024 - Present

Sr Databricks/Data Engineer

Project: Migration to Unity Catalog for Improved Health Insurance Data Management

This project aimed to migrate and optimize Thrivent Financials’ health insurance data, specifically focusing on customer appointments, promotional campaigns, promo history, and appointment promo history, by leveraging Unity Catalog. Through this migration, we enabled seamless integration and management of key health insurance datasets, enhancing Thrivent’s ability to make data-driven decisions.

Responsibilities:

•Provisioned AWS EC2 and AWS S3 for scalable, secure storage to support the migration of large volumes of data on appointments, promotional campaigns, promo history, and appointment promo history from Thrivent’s legacy system to Databricks, optimizing for cost-efficiency and performance.

•Developed ETL pipelines to migrate and organize data related to appointments, campaign promotions, and promo history into Databricks. Leveraged DBT in Snowflake to handle data transformations, standardizing schemas and ensuring accurate migration.

•Configured Snow pipe to enable continuous ingestion from AWS S3 into Snowflake, providing near real-time access to updated campaign and appointment data, essential for analyzing promotion effectiveness as data transitioned to Databricks.

•Implemented Snowflake Streams to monitor changes in appointment and promotional campaign data, triggering updates as needed and adding data validation checks to maintain consistency and reliability during migration.

•Built data integration and transformation pipelines with Apache Spark on Databricks to consolidate data on appointments, campaigns, and promo history, optimizing resource usage for efficient processing in the new environment.

•Used Spark Streaming on Databricks for continuous updates, ensuring that insights on promotional campaigns and appointment engagement remained timely and relevant.

•Managed datasets in Snowflake with micro-partitioning and clustering, enhancing query performance for appointment and campaign data, with deduplication strategies to minimize storage costs.

•Created predictive models using Spark ML on Databricks to analyze customer engagement and the effectiveness of various health insurance promotions, keeping models up to date with migrated data for accurate insights.

•Utilized DBT for modular data transformations within Snowflake, ensuring consistency and high data quality for appointment, campaign, and promo history data.

•Developed interactive dashboards in Databricks SQL and Power BI, providing real-time insights into appointment trends, promotion engagement, and historical promo data, tailored to Thrivent’s needs for campaign effectiveness analysis.

•Established CI/CD pipelines with Azure DevOps and Git to streamline deployments of data transformations across Databricks, Snowflake, and AWS, supporting regular updates for campaign and appointment data workflows.

•Enforced robust security and governance protocols using Unity Catalog in Databricks, implementing role-based access control and regular audits to safeguard sensitive data on customer appointments, campaign details, and historical promo information.

•Collaborated with Thrivent’s analysts, campaign managers, and engineers to ensure the Databricks data architecture supported specific requirements for tracking appointments, promotions, and historical engagement data.

•Deployed secure data exchange platforms to facilitate collaboration with external partners, enabling a broader analysis of promo effectiveness and customer engagement for health insurance offerings.

•Promoted strong data governance through Unity Catalog, maintaining quality and compliance across Databricks and Snowflake environments for data on appointments, campaign promotions, and promo history.

•Leveraged Databricks Asset Bundles to automate packaging and versioning of data pipelines, streamlining CI/CD workflows and reducing errors for appointment and promotional data.

•Enabled Power BI integration with Databricks clusters, allowing Thrivent to analyze and visualize data on appointment trends and promotional campaign performance, supporting data-driven decisions on health insurance campaigns.

•Integrated Asset Bundles with Git for enhanced version control, traceability, and rollback capabilities, ensuring reliability as Thrivent’s appointment and promotional data transitioned to Databricks.

•Documented the data pipeline architecture, detailing ingestion processes, transformation logic, and governance practices to ensure transparency and facilitate knowledge transfer within Thrivent’s team, especially for appointment, promotion, and promo history data management.

Technical Stack: Unity Catalog, Databricks, Asset Bundles, Snowflake, Amazon RDS, Auto Loader, Databricks SQL Analytics, Power BI, AWS Glue, Git, AWS S3, PySpark and SQL.

GE HealthCare, NC Dec 2022 – Jun 2024

Databricks/Data Engineer

Project: Enhanced Patient Treatment Insights

The project leveraged Medallion architecture within Delta Live Tables (DLT) to assess and enhance treatment plans across healthcare systems. It aimed at identifying key factors leading to improved patient outcomes, optimizing treatment protocols, and reducing time-to-recovery, thereby elevating overall patient care quality and operational efficiency.

Responsibilities:

Initiated dynamic streaming live tables to amalgamate diverse datasets including patient health records, ongoing treatment data, and patient outcome metrics, solidifying the foundational bronze layer with real-time and historical data. This involved the use of Azure Databricks for DLT pipeline construction, SQL for sophisticated table configurations, and the Auto Loader feature for efficient data ingestion.

Enhanced the silver layer by seamlessly integrating primary data tables, employing complex filtering, and executing transformative operations to cleanse, enrich, and consolidate patient data for deeper analytical insights.

Led the development of an advanced analytical model in the gold layer, designed to closely monitor and evaluate patient progress, identifying patterns and discrepancies in treatment outcomes.

Orchestrated data integration efforts from wearable devices and IoT sensors in patient monitoring, enhancing the bronze layer with a broader spectrum of health data.

Designed a feedback loop mechanism within the data architecture to continually refine and optimize machine learning models based on new patient data and treatment outcomes.

Advocated for data democratization within the healthcare organization, implementing secure data access protocols to ensure relevant stakeholders can leverage insights while adhering to privacy standards.

Played a pivotal role in driving the adoption of predictive analytics in patient care planning, presenting model findings to executive teams and securing buy-in for broad implementation.

Orchestrated data integration from various sources into Azure Data Factory (ADF) and Azure Databricks, enriching the bronze layer with a comprehensive dataset that combines both real-time and historical patient interaction data for unified insights.

Advocated for secure data democratization within GE HealthCare by implementing Azure Active Directory (AAD)-based access protocols, ensuring that stakeholders could easily access patient insights while maintaining compliance with privacy standards.

Enabled Power BI integration with Azure Databricks through Partner Connect, allowing GE HealthCare to leverage refined datasets for comprehensive analysis and visualization of patient appointments and campaign performance.

Collaborated with IT security teams to enforce stringent data protection protocols and compliance with healthcare regulations, using Azure Key Vault for secure storage of sensitive information and Azure Policy for data governance, ensuring protection of sensitive patient and campaign data.

Enabled direct integration of Power BI with Azure Databricks clusters through Partner Connect, leveraging the refined datasets produced by DLT for comprehensive analysis and visualization, thereby supporting evidence-based adjustments to patient care protocols.

Pioneered a culture of data excellence and governance using Unity Catalog, ensuring unparalleled data quality and integrity across all layers of the Delta live tables architecture.

Documented the entire data pipeline architecture, including data ingestion processes, transformation logic, and business rules, to ensure transparency and facilitate knowledge transfer within the team and across the organization.

Championed the use of advanced statistical methods and AI algorithms to analyze complex healthcare data, pushing the boundaries of traditional data analysis in the medical field.

Implemented a continuous monitoring system for the data pipeline's performance and health, ensuring high availability and minimal downtime for critical analytics operations.

Developed custom data transformation scripts and functions to handle unique data types and structures found in healthcare datasets, enhancing the versatility of the data processing pipeline.

Established a structured data labeling process for unstructured and semi-structured medical data, enabling more effective use of natural language processing (NLP) techniques in patient data analysis.

Technical Stack: Azure Databricks, Delta Live Tables, Azure SQL Database, Auto Loader, Unity Catalog, Databricks SQL Analytics, Power BI, Azure Data Factory, Git, Alteryx, Azure Synapse, Azure Data Lake Storage (ADLS), PySpark and SQL.

Ernst & Young, NJ Jan 2022 – Nov 2022

Databricks/Data Engineer

Project: Logistics Data Optimization and Analysis

The project aimed to improve logistics efficiency for a client by applying advanced analytics and machine learning to enhance freight and warehouse operations. The goal was to reduce costs and improve delivery speeds, with success measured by decreased logistics expenses and faster, more reliable service delivery.

Responsibilities:

Leveraged EY's proprietary Feature Store to facilitate rapid data access and feature engineering, transitioning from traditional data analysis to a machine learning-based approach for logistics optimization.

Collaborated with client stakeholders to identify critical factors affecting logistics performance, including shipment volumes, delivery times, and warehouse capacity, integrating these into a comprehensive analytics model.

Deployed PySpark Data Frames within AWS EMR (Elastic MapReduce) to develop and refine data-centric models dedicated to freight and warehouse operation enhancements, prioritizing real-time data analysis and predictive analytics.

Integrated logistics data from client systems into AWS S3 (Simple Storage Service) using AWS Glue for seamless ETL (Extract, Transform, Load) processes, ensuring data integrity and seamless accessibility for analytical endeavors.

Implemented Amazon Kinesis for real-time data ingestion from logistics tracking systems within AWS, boosting capabilities for instantaneous analytics and decision-making support.

Employed Apache Kafka within Databricks for real-time data ingestion from logistics tracking systems, enhancing the capability for immediate analytics and decision support.

Developed and maintained robust ETL pipelines in Databricks, automating data transformation and enrichment processes to support analytics and machine learning models.

Configured continuous data processing workflows in Databricks, automating operational tasks to improve data pipeline reliability and efficiency.

Implemented comprehensive data security protocols, including data encryption and strict access controls, to protect sensitive logistics information.

Set up continuous data processing workflows in AWS, employing automation to enhance data pipeline reliability and operational efficiency.

Stored processed logistics data in AWS S3 and utilized AWS Glacier for long-term storage, establishing a scalable and secure architecture for data analytics and reporting.

Proactively monitored data storage and processing resources, optimizing for cost-effectiveness and performance scalability.

Conducted rigorous data quality checks, ensuring the cleanliness and accuracy of logistics data for analytics applications.

Automated routine data management tasks, freeing up resources for strategic analysis and model development.

Established error logging and recovery mechanisms within data pipelines, minimizing downtime and ensuring data process resilience.

Maintained code and project documentation in Git repositories, ensuring effective team collaboration and version control.

Managed project tasks and deliverables using JIRA, facilitating transparent communication and efficient workflow management.

Technical Stack: AWS cloud, Databricks, SQL Server, PySpark, Medallion architecture, Databricks SQL AWS Glue, ECS S3, Git, JIRA, Power BI, Delta Live Tables, Unity Catalog

VCTPL, India Aug 2016 – May 2021

Software Engineer

Project: Enhancing port operations and Efficiency through TOS.

Integrated the Navis N4 terminal operating system to enhance container handling and movement efficiency within the port operations. Developed a data analysis platform to enable predictive analytics for optimizing container flow and designed a hybrid cloud infrastructure to ensure scalable, resilient operations.

Responsibilities:

Played a key role in integrating Navis N4 terminal operating system, ensuring seamless operations, data accuracy, and real-time tracking of containers, which led to a significant improvement in operational efficiency.

Conducted comprehensive needs analysis to tailor the integration of the Navis N4 system to specific port operational workflows.

Developed custom modules and extensions within Navis N4 using its APIs and JavaScript, facilitating tailored solutions for issues like yard planning, berth scheduling, and equipment dispatching.

Implemented real-time data feeds from Navis N4 into our data analytics pipeline, ensuring up-to-the-minute accuracy in operational monitoring and decision support.

Automated data extraction processes from Navis N4, which allowed for seamless ingestion into our data analytics platform, reducing manual data handling and potential for errors.

Led the technical planning and seamless integration of the Navis N4 system, enhancing container tracking accuracy and real-time data accessibility.

Utilized Navis N4 APIs and JavaScript for customization, ensuring the system aligned with unique operational needs.

Designed and implemented algorithms for operational data analysis, employing Python and leveraging libraries like Pandas, NumPy, and Scikit-learn.

Utilized Apache Spark on AWS EMR for efficient processing of large-scale data sets, enabling the development of predictive models to forecast container flow.

implemented Apache Spark on AWS EMR for efficient big data processing.

Enhanced decision-making with predictive models to optimize container flow.

Employed Docker for application containerization and Kubernetes for orchestration, facilitating easy deployment and scaling across cloud and on-prem environments.

Designed and implemented algorithms to analyze large sets of operational data, enabling predictive analytics that reduced downtime and optimized container flow.

Developed comprehensive training materials on Agile and Scrum methodologies, enhancing team productivity and operational efficiency.

Utilized JIRA for project management, ensuring effective tracking and coordination of project milestones and deliverables.

Ensured all software solutions adhered to industry standards for security and compliance, implementing robust data security measures.

Conducted performance tuning of software applications and databases, identifying, and resolving system bottlenecks.

Ensured that all data handling and processing complied with industry standards and regulations, implementing security measures to protect sensitive information.

This included data encryption, secure access controls, and regular security audits to maintain the confidentiality and integrity of customer data.

Led research initiatives to explore new technologies and methodologies for port operations.

Evaluated emerging technologies for potential application within the organization.

Technical Stack: Navis N4, JavaScript, SQL, AWS (EC2, S3, RDS, EMR), Apache Spark, Docker, Kubernetes, Agile, Python, JIRA, Pandas, NumPy, data analysis.

Contact this candidate