Gnanitha G
Senior Data Engineer (AI/ML)
Email: ***********@*****.***
Phone: +1-505-***-****
PROFESSIONAL SUMMARY
Senior AI/ML Data Engineer with 9+ years of experience designing and implementing large-scale data, analytics, and machine learning platforms across healthcare, finance, and retail domains.
Specialized in developing LLM, NLP, and retrieval-augmented generation (RAG) systems using BERT, BioBERT, Flan-T5, GPT-3.5, GPT-4, and LLaMA-2, integrated with LangChain, LangGraph, Pinecone, and FAISS for semantic search, summarization, and contextual retrieval.
Incorporated MCP into enterprise AI workflows to standardize tool access, enforce structured prompt evaluation, and enhance reliability of RAG, agent-based systems, and model orchestration pipelines.
Built vector-embedding pipelines in Databricks and Azure Cognitive Search, enabling low-latency query responses and intelligent document understanding.
Implemented Master Data Management (MDM) practices across key business entities, standardizing schemas, resolving duplicates, and maintaining governed golden records across enterprise systems.
Designed feature-store frameworks in Snowflake, Synapse, and SageMaker, supporting consistent model training, inference, and monitoring of drift and embedding quality with MLflow and LangSmith.
Engineered reusable MLOps pipelines across Databricks, SageMaker, and Airflow, automating model retraining, A/B testing, versioning, and CI/CD deployments.
Deployed containerized inference endpoints on Kubernetes (AKS, EKS) and Fargate, exposing REST and gRPC APIs for scalable, low-latency model access.
Integrated AWS Bedrock and Hugging Face Transformers to unify foundation-model access for summarization, classification, and GenAI-driven analytics.
Experienced in designing reusable AI libraries and optimizing large-scale model pipelines for latency, throughput, and compliance.
Developed FastAPI and Flask services for model serving with schema validation, logging, and role-based authentication to ensure secure access.
Created time-series and forecasting pipelines in Databricks and Synapse for claims, sales, and operational trend prediction, driving data-backed decision making.
Architected cloud-native, multi-cloud data ecosystems across Azure, AWS, and GCP, integrating lakehouses, pipelines, and ML environments for batch and streaming workloads.
Designed and modernized enterprise lakehouse platforms using Azure Synapse, Databricks, Delta Lake, and Microsoft Fabric, bridging data engineering, analytics, and AI under a governed framework.
Developed and optimized data warehouses in Synapse, Redshift, Snowflake, and BigQuery, applying Parquet/ORC formats, partitioning, and caching for cost-efficient performance.
Built large-scale PySpark pipelines handling terabytes of healthcare, retail, and financial data for analytics, forecasting, and ML workloads.
Engineered real-time data streaming architectures using Kafka, Kinesis, and Azure Event Hubs, supporting continuous ingestion, anomaly detection, and live model updates.
Implemented ETL/ELT frameworks using Azure Data Factory, AWS Glue, and GCP Dataflow, standardizing data ingestion from APIs, IoT telemetry, and partner systems.
Migrated Hadoop, Oracle, and SQL Server systems into modern cloud lakehouses, improving scalability, governance, and analytical performance.
Designed relational and NoSQL data models in PostgreSQL, MongoDB, and Cosmos DB for hybrid transactional-analytical workloads.
Optimized Spark and SQL transformations through partition pruning, adaptive caching, and broadcast joins for high-volume data operations.
Implemented Airflow and Cloud Composer for end-to-end orchestration, automating dependencies, SLA monitoring, and lineage tracking across data and ML pipelines.
Automated infrastructure provisioning with Terraform, Azure DevOps, and GitHub Actions, ensuring reproducible environments and secure CI/CD pipelines.
Applied A/B Testing, Great Expectations and dbt-style validation to enforce data quality, schema integrity, and drift detection throughout ETL workflows.
Established unified observability with MLflow, CloudWatch, and Azure Monitor, correlating logs, metrics, and model events for proactive incident response.
Integrated Azure Purview, IAM, and Key Vault for data governance, lineage, and compliance with HIPAA, SOX, and GDPR standards.
Implemented secure tokenization and anonymization frameworks to protect sensitive healthcare and financial data assets.
Delivered interactive Power BI, Tableau, and Plotly dashboards combining curated data and ML insights for operational and executive visibility.
Documented complete ETL, ML, and RAG architectures using Lucidchart and Visio, mapping lineage, orchestration, and compliance flows for audit readiness.
Mentored engineers on Spark optimization, cloud architecture, and reproducible MLOps practices across distributed teams.
Drove agile delivery using JIRA and Azure Boards, coordinating multi-cloud data and AI initiatives from design to production rollout.
TECHNICAL SKILLS
Category
Skills
Programming & Frameworks
Python (NumPy, Pandas, PySpark, SQLAlchemy), SQL (T-SQL, PL/SQL), Scala, R, Node.js, Shell Scripting, Flask, FastAPI, Django, NLTK
Databases
PostgreSQL, MySQL, SQL Server, Oracle, Snowflake, Redshift, BigQuery, DynamoDB, Cosmos DB, MongoDB, Cassandra, Pinecone, FAISS
Data Engineering & Warehousing
Apache Spark, Kafka, NiFi, Airflow, AWS Glue, Azure Data Factory, Databricks, Synapse, Data Lake Gen2, Stream Analytics, Event Hubs, Event Grid, Great Expectations, Terraform, Photon Engine, Unity Catalog
Analytics & Visualization
Power BI, Tableau, Looker, Plotly, Excel (VBA, Power Query, Pivot Tables), Visio, Lucidchart
Cloud Platforms
AWS: S3, Glue, EMR, Redshift, DynamoDB, Lambda, Athena, Fargate, Kinesis, Bedrock, EventBridge, IAM, CloudWatch
Azure: Data Factory, Synapse, Databricks, Data Lake Gen2, Event Hubs, Stream Analytics, Monitor, Purview, Active Directory, Key Vault, AKS, DevOps, Microsoft Fabric
GCP: BigQuery, Dataflow, Composer, Pub/Sub, Cloud Functions, Data Catalog, Cloud Monitoring, App Engine
Machine Learning & AI
Scikit-learn, XGBoost, TensorFlow, PyTorch, MLflow, SageMaker, spaCy, NLTK, BERT, BioBERT, RoBERTa, Flan-T5, LLaMA-2, Sentence-BERT, GPT, LangChain, LangGraph, LangSmith, RAG, Vision Transformer, SHAP, Feature Store Design, Drift Detection, A/B Testing
DevOps & CI/CD
Docker, Kubernetes (EKS, AKS), Terraform, Jenkins, Git, GitHub Actions, Azure DevOps, CodePipeline
Governance & Security
Azure Purview, Great Expectations, Key Vault, Active Directory, IAM, SOX, HIPAA, GxP, GDPR, Data Lineage & Audit Logging
Methodologies & Tools
Agile/Scrum, Waterfall, JIRA, Azure Boards, Confluence
Additional Tools
VS Code, JupyterLab, MS Office Suite (Excel, PowerPoint, Word, Visio)
WORK EXPERIENCE
Centene Corporation, NYC Sept 2023 - Present
Role: Sr. AI/ML Data Engineer
Responsibilities:
Designed and developed large-scale ETL and ELT pipelines using Azure Data Factory, Databricks, and PySpark, ingesting data from HL7/FHIR APIs, laboratory systems, and IoT telemetry.
Built automated pipelines to generate and curate training datasets for LLM and RAG workflows, including chunking, embedding generation, and metadata enrichment.
Engineered and optimised feature-engineering workflows for patient-risk, fraud detection, and outcome forecasting models, versioned and tracked through MLflow feature stores.
Developed semantic-search and entity-linking layers using Pinecone, Azure Cognitive Search, and FAISS, enabling cross-dataset retrieval of medical entities.
Applied BERT, BioBERT, ClinicalBERT, Hugging Face transformer models and NLTK to extract terminology and relationships from unstructured clinical notes, enhancing accuracy of downstream analytics.
Fine-tuned DeBERTa-v3 and orchestrated generative AI workflows with GPT-4 for text summarization, clinical document classification, and contextual report generation for care-management teams.
Integrated AWS Bedrock using LLaMA-2, Claude, and domain-tuned Hugging Face models to support summarization, classification, and retrieval tasks across healthcare datasets.
Used LangGraph, LangSmith and MCP to standardize tool for RAG evaluation, prompt tuning, and explainability of model responses within Databricks environments.
Implemented structured prompt engineering and safety guardrail mechanisms for clinical LLM outputs using safety filters, structured prompt tests, and embedding-based consistency checks to reduce hallucinations and enforce domain constraints.
Combined Watson Assistant and Azure Cognitive Services to process imaging metadata and text concurrently, improving cross-modal data comprehension.
Automated MLOps orchestration using Airflow, Terraform, and MLflow, implementing retraining triggers, A/B testing, and drift detection workflows.
Managed model lifecycle and releases through MLflow Model Registry, coordinating approvals, versioning, staging flows, and production rollouts.
Developed low-latency inference endpoints via Lambda, Fargate, and EventBridge, supporting clinical-alerting pipelines and model-driven insights.
Modelled and optimized Azure Synapse schemas and Data Lake Gen2 storage zones for analytics and ML workloads, applying partitioning, caching, and columnar compression to accelerate query performance.
Implemented a Lakehouse framework with Delta Lake and Microsoft Fabric integration, connecting Synapse, Data Factory, and Power BI for GenAI-ready data pipelines, unified data access and governance.
Designed ingestion and transformation layers across Cosmos DB, DynamoDB, and relational stores, applying MDM-driven standardization, entity resolution, and metadata governance to maintain consistent patient, provider, and account records.
Built real-time pipelines with Event Hubs, Kafka, Kinesis, and Stream Analytics, enabling continuous processing of device and claims streams at sub-minute latency.
Established data-quality automation using Python, Great Expectations, and validation rule sets for HIPAA and GxP compliance.
Defined SLIs and SLOs across the AI pipeline to track latency, throughput, drift patterns, and uptime expectations, enabling consistent operational performance.
Integrated Azure Purview and Key Vault for secure credential management, lineage visibility, and role-based policy enforcement across data domains.
Implemented data observability dashboards with Azure Monitor, CloudWatch, and MLflow, surfacing drift, anomaly, and freshness metrics for production pipelines.
Created governed data marts in Snowflake and Synapse, implementing materialized views and incremental loads for near real-time analytics.
Implemented CI/CD pipelines with GitHub Actions, CodePipeline, and Azure DevOps, automating testing, deployment, and rollback of data and model assets.
Containerized data APIs and ETL microservices using Docker and AKS, improving scalability and isolation for processing workloads.
Built interactive Power BI and Tableau dashboards visualizing ML predictions, trends, and operational KPIs for decision support.
Designed unified observability dashboards correlating PySpark logs, streaming metrics, and model events to enable proactive incident response.
Documented data and AI architectures using Visio and Lucidchart, mapping pipeline lineage, compliance flows, and model dependencies.
Worked closely with data-governance teams to uphold ethical AI and transparency principles, ensuring responsible and auditable model use across healthcare domains.
Environment: Python, SQL, Azure (Data Factory, Synapse, Databricks, Data Lake Gen2, Event Hubs, Stream Analytics, Purview, Key Vault, Monitor, AKS), AWS (Lambda, Fargate, EventBridge, Bedrock, Kinesis), Snowflake, PySpark, Delta Lake, Microsoft Fabric, Airflow, MLflow, Terraform, Great Expectations, Kafka, BERT, BioBERT, GPT-4, LLaMA-2, LangChain, LangGraph, LangSmith, Pinecone, FAISS, Docker, Power BI, Tableau, Visio, Lucidchart.
State of California, Los Angeles, CA Jun 2022 – Aug 2023
Role: ML Data Engineer
Responsibilities:
Built and maintained cloud-based data pipelines on GCP using BigQuery, Dataflow, and Cloud Composer, integrating operational and program datasets across internal enterprise systems.
Developed event-driven ingestion and validation services using Node.js (TypeScript) on GCP Cloud Functions, processing Pub/Sub events and enforcing schema and quality checks before downstream ingestion.
Created curated analytics zones in BigQuery using partitioning and clustering to enable cost-efficient analytics and historical audits.
Developed predictive analytics workflows in TensorFlow and Scikit-learn for program forecasting, resource allocation, and citizen-outcome modeling.
Built feature-engineering pipelines that automated variable computation and dataset versioning across core data sources.
Transformed nested JSON using MongoDB aggregation and PostgreSQL to produce ML-ready, standardized datasets consumed by forecasting, NLP, and evaluation pipelines.
Used Vertex AI to automate model training, tuning, and deployment pipelines, enabling scalable retraining and standardized ML lifecycles.
Developed evaluation and monitoring workflows using Python and MLflow, tracking drift, stability, and threshold performance.
Integrated SHAP-based explainability dashboards to ensure transparency and fairness in AI predictions.
Implemented NLP pipelines using BERT, RoBERTa, and NLTK to classify documents, extract entities, and analyze stakeholder feedback.
Fine-tuned Flan-T5 to summarize regulatory and compliance documents for internal decision support.
Deployed Sentence-BERT semantic search services on BigQuery and Cloud SQL for contextual information retrieval.
Integrated a LangChain-based RAG solution with GPT-3.5, enabling secure natural-language querying of internal policies.
Implemented a proof-of-concept Vision Transformer (ViT) model for anomaly detection on inspection imagery and scanned documents.
Orchestrated ETL and ML workloads using Airflow and Cloud Composer, automating refresh cycles, dependency management, and retraining triggers.
Applied Great Expectations and custom Python validations for continuous data-quality checks and anomaly detection before ML consumption.
Containerized ML microservices using Docker on Google App Engine and Linux VMs, exposing inference via REST and gRPC.
Implemented data observability and lineage tracking using Cloud Monitoring, Data Catalog, and MLflow, capturing latency, schema drift, and pipeline health.
Designed secure, schema-standardized workflows with role-based access controls to enforce HIPAA and privacy requirements.
Built interactive dashboards in Power BI, Looker, and Plotly for operational and executive visibility.
Collaborated with compliance stakeholders to enforce AI ethics guidelines and model audit logging aligned with governance standards.
Documented end-to-end data and ML architectures in Lucidchart, detailing orchestration flows, lineage, and governance processes.
Environment: Python, SQL, GCP(BigQuery, Dataflow, Composer, Pub/Sub, Cloud Functions, Data Catalog, Cloud Monitoring), PostgreSQL, MongoDB, TensorFlow, Scikit-learn, MLflow, Airflow, Great Expectations, SHAP, spaCy, NLTK, BERT, RoBERTa, Flan-T5, Sentence-BERT, LangChain, GPT-3.5 API, Vision Transformer, Docker, Google App Engine, Power BI, Looker, Plotly, Lucidchart, Linux.
Safeway, Pleasanton, CA Dec 2020 - May 2022
Role: Cloud Data Engineer
Responsibilities:
Migrated large on-prem Oracle and MySQL systems into Azure Data Lake Gen2 and Azure Synapse, improving scalability and reducing reliance on manual file movement across retail data domains.
Designed and automated Delta Lake-based ETL pipelines using Azure Data Factory, Databricks, and Synapse pipelines, applying Photon execution engine and Unity Catalog for high-performance transformations and governed data access.
Orchestrated high-volume data ingestion from retail POS, inventory systems, and vendor APIs using Apache NiFi, enforcing schema consistency and secure routing into Bronze zones in ADLS.
Built and integrated real-time streaming pipelines using Azure Event Hubs, Stream Analytics, Kafka, and Databricks Structured Streaming, enabling sub-minute visibility into sales, pricing, and inventory movements across retail systems.
Built and standardized Lakehouse zones (Bronze, Silver, Gold) using Delta Lake format with Z-ordering and partition pruning, improving query performance and enabling time-travel for data audits.
Optimized large-scale PySpark transformations for compute intensive workloads, reducing processing costs while maintaining SLAs for daily merchandising and operational pipelines.
Implemented data-quality and validation checks in Python and SQL, including null detection, schema drift handling, and duplicate control.
Developed feature-engineering and enrichment pipelines in Databricks for sales forecasting, demand planning, and churn analysis.
Applied time-series modeling techniques using Python and PySpark to improve forecast accuracy for SKU-level sales and regional inventory trends.
Optimized Databricks Photon clusters and adaptive caching strategies to reduce compute costs by 30% while maintaining SLA adherence for high-volume data processing.
Built and tracked ML experiments in MLflow, applying Scikit-learn and XGBoost for model training and versioning.
Integrated Vertex AI for production-grade model serving, enabling autoscaling and versioned deployment of Databricks-trained models.
Implemented data governance and access controls with Azure Active Directory and Purview to enforce privacy and audit requirements.
Implemented a data observability layer using Azure Monitor, Log Analytics, and MLflow metrics, providing visibility into latency, schema drift, and pipeline health.
Automated infrastructure provisioning with Terraform, ensuring consistent configuration and reproducible environments.
Deployed containerized data services on Azure Kubernetes Service (AKS) to handle dynamic workloads and reduce operational overhead.
Conducted end-to-end testing of ETL pipelines, model scoring, and report generation using pytest and custom validation scripts.
Designed interactive Power BI and Tableau dashboards, translating curated data into business insights and performance reports.
Tuned SQL queries in Synapse and MySQL, improving report latency and data retrieval times for business users.
Participated in Agile sprints using JIRA for planning, code reviews, and backlog management, ensuring smooth coordination across teams.
Documented data pipelines, architecture, and testing workflows in Lucidchart to maintain transparency and onboarding readiness.
Environment: Python, PySpark, SQL, Azure (ADF, Databricks, Synapse Analytics, Data Lake Gen2, Event Hubs, Stream Analytics, Functions, Event Grid, Active Directory, Purview, Monitor), Delta Lake, Unity Catalog, Photon Engine, Apache NiFi, Kafka, MLflow, Scikit-learn, XGBoost, Terraform, AKS, Power BI, Tableau, MySQL, Log Analytics, pytest, JIRA, Lucidchart.
Edward Jones, St Louis, MO Jan 2019 - Nov 2020
Role: Data Engineer
Responsibilities:
Modernized legacy on-prem and Hadoop data platforms by migrating workloads to a hybrid cloud architecture using Oracle Cloud and AWS, improving reliability, scalability, and operational resilience.
Stabilized the existing Cloudera Hadoop ecosystem (Hive, Sqoop, YARN, HBase) during the transition period, ensuring uninterrupted reporting and regulatory compliance while migration plans were executed.
Migrated large-scale Hive workloads to Apache Spark (Scala, PySpark) on AWS EMR, reducing batch runtimes from hours to minutes and improving processing performance by nearly 70%.
Designed hybrid data-storage patterns using AWS Glue Catalog for metadata management, consolidating legacy Cassandra and HBase workloads into DynamoDB (OLTP) and Redshift (OLAP) to balance real-time and analytical use cases.
Built end-to-end ETL pipelines using Spark, AWS Glue, and Apache Airflow to ingest data from Oracle, SQL Server, and Salesforce into S3 and Redshift for enterprise analytics.
Integrated Salesforce CRM data with financial and risk systems to create unified client and transaction views, enabling downstream analytics and fraud detection.
Designed modular Spark SQL transformation frameworks with embedded audit metadata and lineage tracking to support internal reviews and compliance requirements.
Developed reusable Python UDFs for data cleansing, standardization, and schema normalization across JSON, XML, and relational datasets.
Applied R and Python for early-stage statistical validation and risk-metric checks, supporting downstream model tuning and financial analysis.
Optimized Spark and Redshift workloads through partitioning, caching, and indexing strategies, significantly improving query response times and storage efficiency.
Replaced legacy Oozie workflows with Apache Airflow, adding monitoring, alerting, and retry logic that reduced manual operational effort by approximately 40%.
Implemented near real-time streaming pipelines using Apache Kafka coordinated through Zookeeper, enabling event ingestion and continuous data processing.
Integrated Apache Flink for select low-latency streaming workloads, supporting real-time risk signals and time-sensitive financial events beyond traditional batch and micro-batch processing.
Implemented role-based access controls and data-masking policies using Active Directory, AWS IAM, and Redshift security features to meet SOX and GDPR compliance standards.
Collaborated with security and infrastructure teams to align IAM policies across AWS, Oracle Cloud, and on-prem systems, ensuring consistent auditability and access governance.
Worked with Oracle Cloud teams to synchronize financial-reporting datasets with Redshift and internal dashboards for consistent performance tracking.
Created and maintained PL/SQL procedures, triggers, and functions to manage high-volume transactional feeds while enforcing data integrity constraints.
Built Jenkins-based CI/CD pipelines for testing and deploying Spark and ETL workloads, containerized data-processing services using Docker, and deployed them on Kubernetes (EKS) to improve scalability and release reliability.
Delivered executive and operational dashboards in Power BI backed by Redshift and Spark, while tracking delivery through JIRA and documenting ETL, governance, and lineage workflows in Lucidchart for audit readiness.
Environment: Python, PySpark, SQL, PL/SQL, Apache Spark, Airflow, Kafka, AWS (S3, EMR, Redshift, DynamoDB, EKS, IAM, Glue, CloudFormation), Oracle Cloud, Jenkins, Docker, Kubernetes, Power BI, Cassandra, HBase, Salesforce, Lucidchart
Dell Technologies, Bengaluru, India July 2016 - Nov 2018
Role: Associate Python Developer
Responsibilities:
Conducted comprehensive EDA using Python, NumPy, and Pandas to identify outliers and inconsistencies, reducing data-processing errors by 20% and improving overall data quality.
Streamlined ETL pipelines through SQL and Hive queries across Oracle and SQL Server systems, automating data extraction and transformation to support four cross-functional analytic reports and raise productivity by 20%.
Optimized SQL Server and Oracle queries (T-SQL, PL/SQL) using indexing, partitioning, and tuning techniques, cutting execution times by 20% and accelerating downstream analytics.
Supported cloud-based data processing by staging extracted datasets in AWS S3 and executing Python-based validation scripts on EC2, helping evaluate cloud feasibility for future analytics workloads.
Worked with Pandas datetime and NumPy arrays for time-series and large-scale data analysis, improving trend forecasting accuracy by 15% and processing speed by 40%.
Generated ad-hoc reports and performed on-demand SQL analyses, shortening business query response time by 30% and enhancing decision agility.
Developed BI visualizations in Tableau and Excel, designing interactive charts, graphs, and maps that improved engagement and insight delivery across finance and sales teams.
Automated Excel workflows with VBA macros, saving approximately 10 hours per week; implemented Git version control and migrated cron jobs to an Airflow pilot, improving visibility and collaboration.
Implemented AWS IAM and CloudTrail for access management and API monitoring, strengthening data-security compliance and audit traceability across projects.
Established governance standards by creating a data dictionary and defining data-management procedures, enhancing metadata consistency and accuracy.
Delivered validated ETL and reporting cycles within a Waterfall framework, collaborating with analysts and QA teams for full sign-off and release readiness.
Environment: SQL, Python, IAM, Tableau, NumPy, CloudTrail, API, Oracle, ETL
EDUCATION
Bachelor of Technology in Electronics and Communication Engineering from VRSEC, Vijayawada, India - 2016