Data Engineer Machine Learning

Location:

New York City, NY

Posted:

September 10, 2025

Contact this candidate

Resume:

Venugopal Reddy Kandhati

**********.***@*****.*** +1-716-***-**** LinkedIn

PROFESSIONAL SUMMARY

Senior AI Data Engineer with over 10+ years of IT experience in designing, developing, and deploying scalable, secure, and cloud-native data pipelines and platforms across insurance, healthcare, banking and retail domains.

Hands-on expertise in Big Data technologies including Hadoop, Apache Spark (PySpark), Hive, Databricks, and Snowflake to build robust ETL/ELT pipelines, implement feature engineering workflows, and deliver real-time and batch data solutions.

Proficient in implementing cloud-based data architecture and services across AWS (S3, Glue, Lambda, EMR, ECS, Redshift, DynamoDB) and Azure (ADLS Gen2, Azure Data Factory, Synapse, Azure SQL, Event Hubs, Azure ML, Unity Catalog, Purview).

Strong programming and scripting experience in Python, SQL, and R for building automated data processing workflows, transformation logic, data validation routines, and MLOps pipelines.

Experienced in writing complex queries and performing query optimization across Snowflake, Redshift, SQL Server, Teradata, Oracle, and MySQL, skilled in data modeling, dbt transformations, and building semantic layers.

Worked extensively on DevOps automation using Docker, Jenkins, GitHub Actions, Terraform, and CloudFormation to deploy data infrastructure and CI/CD pipelines across environments.

Delivered machine learning pipelines and scoring systems using MLflow, Azure ML, and ECS containers to enable predictive analytics for fraud detection, claims scoring, and customer segmentation.

Built and scheduled complex ingestion and transformation pipelines using Apache Airflow, Azure Data Factory, and StreamSets, enabled SLA compliance and reduced latency in downstream systems.

Strong foundation in legacy Hadoop-based workflows using Hive, MapReduce, and HDFS, supporting migration to Spark and cloud-native environments for modernization efforts.

Designed ETL frameworks using Talend, Informatica Cloud, SSIS, and Azure Data Factory to enable ingestion from RDBMS and API sources; transformed data across staging, curated, and semantic zones.

Worked with diverse data formats including Parquet, Avro, JSON, CSV, and XML for ingestion, processing, and analytics; enforced schema validation and type-checking before promotion.

Developed executive-ready dashboards and KPIs using Tableau and Power BI, enabling business teams to analyze operational efficiency, risk, and customer behavior.

Documentation and collaboration skills using Confluence, Lucidchart, and JIRA, with consistent delivery across Agile teams, sprint ceremonies, and cross-functional collaborations.

Exposure to OLAP modeling using SSAS and Redshift, enabling fast multidimensional analysis over large datasets.

Skilled in version control tools like Git, GitHub, and Azure Repos for collaborative engineering workflows and code promotion.

Experienced in the full SDLC from requirements gathering and design through development, deployment, testing, and production support in Agile environments.

Familiar with ticketing and project management tools like JIRA, Azure DevOps Boards, and ServiceNow for sprint tracking, issue resolution, and delivery planning.

A collaborative team player, fast learner, and automation-driven professional who consistently delivers high-performance data engineering solutions aligned with business outcomes.

TECHNICAL SKILLS

Languages & Scripting: Python, SQL, R, PowerShell, Unix Shell Scripting

Big Data & ETL Tools: Apache Spark (Scala, PySpark, Spark SQL), Apache Hive, Hadoop (HDFS, MapReduce), Apache Kafka, Apache Airflow, Azure Data Factory, Informatica Cloud, SSIS, Talend, StreamSets, dbt, Sqoop

Cloud Platforms:

AWS - S3, Glue, Lambda, EMR, ECS, Redshift, Redshift Spectrum, DynamoDB, Bedrock, CloudWatch, CloudTrail, Secrets Manager

Azure - Data Lake Gen2, Data Factory (ADF), Synapse, Azure ML, Event Hubs, Azure Functions, AKS, Key Vault, Azure DevOps, Azure OpenAI, Cognitive Services (Form Recognizer, OCR, Text Analytics), Unity Catalog, Purview

Machine Learning & GenAI Tooling: MLflow, Azure Machine Learning, OpenAI API, Azure OpenAI (GPT-4), LangChain, FAISS, Retrieval-Augmented Generation (RAG), XGBoost, spaCy, scikit-learn

Data Warehousing & Storage: Snowflake, Redshift, Redshift Spectrum, Azure Synapse, Cosmos DB, SQL Server, Oracle, MySQL, PostgreSQL, Teradata

DevOps & CI/CD: Docker, Jenkins, Git, GitHub, GitLab, Azure DevOps, GitHub Actions, Terraform, AWS CloudFormation

Visualization & BI: Tableau, Power BI, Excel

Security & Access Management: IAM (AWS), Azure Active Directory (AAD), Azure Key Vault, Secrets Manager Encryption, Policy Enforcement, Column-level and Row-level Security

Documentation & Collaboration: Confluence, JIRA, Lucidchart, SharePoint

PROFESSIONAL EXPERIENCE

Liberty Mutual Senior AI Data Engineer May 2023 - Present Boston, MA

Project: Intelligent Claims Automation and Insight Enablement Program

As a Senior AI Data Engineer, I have contributed to the transformation of Liberty Mutual’s claims operations by building a modern data and intelligence platform that supported automation, real-time insights, and risk-based decisioning. The initiative enabled more efficient data processing, streamlined fraud detection, and faster claims triage through advanced analytics and AI-driven summarization. Delivered governed, secure, and scalable systems to support cross-functional teams with actionable insights.

Roles & Responsibilities:

Engineered a Delta Lakehouse architecture on Azure Data Lake Gen2 and Databricks to support scalable ingestion, processing, and curation of structured and unstructured claims data across multiple business domains.

Built robust batch and streaming pipelines using Azure Data Factory and Azure Event Hubs to ingest data from SQL sources, APIs, and real-time adjuster interactions into bronze and silver Lakehouse zones, ensuring timely data availability.

Developed PySpark-based ETL workflows in Databricks to normalize, cleanse, and enrich policy, claims, and adjuster datasets for consumption by analytics and business reporting teams.

Provisioned Azure Synapse SQL serverless pools and Power BI datasets, enabling business teams to access near real-time KPIs, fraud indicators, and enriched metadata without impacting operational systems.

Architected and optimized Snowflake financial data marts to support IFRS 17, GAAP, and reinsurance reporting, resolving schema drift and automating multi-source reconciliation workflows.

Implemented data governance using Unity Catalog and Azure Purview, automating schema enforcement, lineage tracking, and table-level RBAC across data zones to support audit and compliance needs.

Hardened pipeline security with Azure Key Vault, Managed Identities, and scoped RBAC roles, ensuring protection of sensitive policyholder data and adherence to enterprise access policies.

Automated infrastructure provisioning for Databricks, Azure ML, ADF, Synapse, and Cognitive Search components using Azure DevOps and Terraform, cutting environment setup time by 60% and ensuring consistency across dev, staging, and prod.

Enabled monitoring and auditing of critical data workflows using Azure Log Analytics and Databricks logging integration, improving system observability and reducing troubleshooting time.

Prepared curated feature datasets and transformed model-ready inputs in PySpark for integration with real-time Azure ML endpoints, enabling low-latency scoring pipelines governed with MLflow.

Integrated Azure Cognitive Services (OCR, Text Analytics) to extract structured metadata and sentiment from scanned documents and complaint narratives, automating triage and escalation workflows.

Built semantic search pipelines using Azure OpenAI embeddings and Cognitive Search to index historical claims and retrieve similar case references, improving triage accuracy and decision-making.

Collaborated with Responsible AI teams to implement RBAC, prompt filtering, and PII redaction across GenAI-powered services, aligning with enterprise AI governance standards.

Environment: Azure Cloud • Databricks • Snowflake • Azure Machine Learning • Azure OpenAI • Power BI • Docker • SQL

HCSC ML Data Engineer June 2021 - May 2023 Chicago, IL

Project: Healthcare Data Modernization and Predictive Intelligence Platform

As a ML Data Engineer, led the development of a secure, cloud-native data and analytics platform to modernize healthcare operations and enable predictive modeling for clinical and claims insights. Delivered real-time and batch pipelines, unified patient data across systems, and operationalized ML workflows for risk scoring, fraud detection, and compliance reporting. The platform supported HIPAA-compliant governance and empowered care teams with timely, AI-driven intelligence.

Roles & Responsibilities:

Led architecture and Agile delivery of a cloud-native data and ML platform using Jira, collaborating with cross-functional teams across to enable predictive modeling for readmission risk, fraud detection, and compliance scoring.

Designed a HIPAA-compliant data lake on AWS using S3, Glue, and Redshift Spectrum, ingesting healthcare data weekly across raw, staging, and curated zones to support ML model training, regulatory reporting, and longitudinal member analytics.

Implemented high-volume ingestion pipelines using Informatica Cloud, Kafka, and AWS Lambda, capturing daily claims, eligibility, and HL7/FHIR events with sub-2s latency to support real-time patient alerts and decisioning workflows.

Orchestrated ingestion and transformation pipelines using Apache Airflow, coordinating Informatica, Glue, and Databricks jobs with success rate and DAG completion to ensure SLA-aligned delivery for downstream ML and reporting systems.

Built high-volume transformation pipelines using Apache Spark on AWS EMR, processing clinical and claims records to normalize ICD/CPT codes, standardize provider data, and construct longitudinal member encounter views, while optimizing EMR job costs through dynamic resource allocation and spot instance usage.

Created robust SQL and Python-based validation frameworks for NPI verification, null conditioning, field length checks, and temporal integrity, reducing rejected records in downstream ML pipelines by 35% and improving overall data quality for model training.

Developed and version-controlled dimensional marts using dbt in Snowflake and Redshift, building semantic layers like Claim Fact, Risk Score Dim, and Provider Summary while enabling collaborative code reviews, tagging, and CI/CD-based promotion via Git and GitHub.

Prepared curated ML features in Databricks using Delta Tables, integrating claims, lab results, demographics, and NLP-derived entities, which improved recall by 15% across readmission and cost anomaly detection models.

Led experiment tracking and model lifecycle management using MLflow, enabling retraining and reproducible versioning of 20+ models, containerized with Docker and deployed to production via AWS ECS.

Implemented Retrieval-Augmented Generation (RAG) pipelines using Amazon Bedrock and LangChain, delivering context-rich GenAI summaries and automated compliance explanations served through SageMaker endpoints.

Developed interactive Tableau dashboards backed by Snowflake views, surfacing patient stratification, cost anomalies, and compliance flags that empowered care managers and actuaries with actionable insights.

Provisioned and automated infrastructure across dev, test, and prod using Terraform and CloudFormation, reducing deployment time by 60% through CI/CD pipelines in Jenkins and GitHub Actions for ML and data services.

Environment: AWS Cloud • Databricks • Snowflake • Docker • SageMaker • Tableau • SQL

Bank of America Data Engineer March 2019 - May 2021 Charlotte, NC

Roles & Responsibilities:

Designed and deployed a hybrid data architecture using AWS S3, Redshift, and Snowflake to centralize ingestion, transformation, and analytics workflows, enabling real-time monitoring and regulatory compliance across financial transaction pipelines.

Built real-time ingestion pipelines with Kafka and Spark Structured Streaming (Scala) to capture high-volume merchant and card swipe activity, enabling sub-second anomaly detection and reducing fraud investigation cycles by 40%.

Engineered scalable workflows in Informatica IICS and PowerCenter to ingest and standardize SWIFT, ACH, and bureau feeds into S3, transforming semi-structured XML/CSV into validated Parquet, which improved data readiness for analytics teams.

Developed AML risk scoring logic using PySpark and SQL, flagging suspicious transactions across geographies and account types, and publishing outputs to Snowflake and Redshift, which accelerated investigator decision-making by 25%.

Orchestrated enrichment workflows using AWS Glue, Step Functions, and DynamoDB to automate KYC and blacklist updates, enabling high-speed NoSQL lookups and reducing SLA violations in batch processing.

Enabled real-time decisioning by exposing curated risk views through Redshift Spectrum, cutting model inference time and improving fraud response accuracy and speed.

Modernized legacy batch pipelines by migrating AML and risk data processing from Hadoop to Snowflake and S3 using Spark and Hive on EMR, which cut reconciliation effort and aligned with cloud transformation goals.

Automated CI/CD pipelines using Terraform, GitHub Actions, and Jenkins to deploy Informatica jobs, Glue workflows, and Redshift schema changes, improving release consistency and reducing manual overhead.

Implemented secure data governance using AWS IAM, Snowflake RBAC, and Informatica Monitor, with CloudTrail-integrated audit trails to support SOX, GDPR, and internal audit requirements.

Delivered actionable insights through Power BI dashboards connected to Snowflake, surfacing fraud exposure, anomaly trends, and data quality metrics that helped compliance analysts reduce backlog and respond faster to risk events.

Environment: AWS Cloud • Snowflake • Databricks (Spark, Hive) • Apache Kafka • DynamoDB • Power BI • SQL

Safeway Data Engineer August 2017 - Jan 2019 Pleasanton, CA

Roles & Responsibilities:

Architected a distributed data platform on AWS using S3, EMR, Glue, and Redshift Spectrum to support ingestion and processing of real-time and batch datasets for inventory tracking, sales forecasting, and pricing analytics.

Established foundational Big Data infrastructure with Hadoop, HDFS, Hive, and Spark, enabling scalable data storage and schema-on-read transformations to support retail-wide reporting and analytics.

Ingested structured data from Teradata, Oracle, and SQL Server into S3 and HDFS using Talend and Sqoop, converting to partitioned Parquet and Avro formats for efficient querying and parallel processing.

Enabled streaming data ingestion with Apache Kafka, capturing POS logs and transactional events into S3, powering real-time Spark and Hive pipelines for low-latency processing.

Developed PySpark pipelines on AWS EMR to aggregate and transform inventory, vendor, and pricing data, supporting daily replenishment, stockout detection, and promotional effectiveness analysis.

Built and orchestrated ETL pipelines using AWS Glue and Talend, applying deduplication, schema validation, and format standardization across curated data zones.

Provisioned infrastructure with Terraform, automating deployment of EMR clusters, ECS services, and IAM policies, ensuring consistent infrastructure across environments.

Containerized Spark and transformation workloads using Docker, deploying to AWS ECS for scalable and repeatable execution across staging and production environments.

Configured MongoDB and Elasticsearch as NoSQL stores to manage product metadata, transaction logs, and catalog search indexes, enabling real-time lookups, metadata enrichment, and sub-second product discovery across retail systems.

Automated CI/CD pipelines with Jenkins and GitHub, streamlining the deployment of Spark jobs, Talend packages, and Hive scripts, significantly reducing manual deployment errors.

Modeled Redshift schemas and delivered Tableau dashboards, empowering stakeholders to monitor store performance, vendor KPIs, and sales trends, supporting data-driven retail decisions.

Environment: AWS Cloud • Hadoop (HDFS, Hive) • Spark (PySpark, MapReduce) • Apache Kafka • Tableau • Docker • MongoDB • Elasticsearch • SQL (Teradata, Oracle, SQL Server)

FuGenX Technologies Python Developer Jan 2015 - June 2017 Hyderabad, India

Roles & Responsibilities:

Designed and developed Python-based ETL workflows to automate data extraction, transformation, and loading from flat files, APIs, and relational databases, reducing manual processing time by 40%.

Implemented SSIS packages and scheduling scripts to integrate legacy SQL Server and Oracle data, enabling seamless reporting across finance and operations teams.

Optimized SQL queries and stored procedures for reporting systems, significantly improving query response times and reducing load on production databases.

Built log processing frameworks in Python to parse and analyze system/application logs, improving monitoring and reducing troubleshooting time.

Developed Python APIs and scripts for data integration between internal applications and partner systems, ensuring consistency and timely synchronization of data.

Enhanced data quality through validation and cleansing routines in Python, ensuring accurate reporting and analytics for business stakeholders.

Established version control practices using Git and GitLab, maintaining code traceability and improving collaboration across development teams.

Automated deployment and build processes with Jenkins, streamlining release cycles and reducing manual errors.

Collaborated with business analysts to deliver ad-hoc reporting solutions that supported campaign effectiveness tracking and operational decision-making.

Environment: Python • SSIS • SQL Server • Oracle • MySQL • Apache Airflow • Git • GitLab • Jenkins

EDUCATION

Bachelor of Technology in Computer Science Aug 2011 - June 2015

GITAM University, Hyderabad

Contact this candidate