Aravind Datla
+1-860-***-**** *******.***.*@*****.*** www.linkedin.com/in/datla-aravind-6229a6204
Senior Data Engineer AI/ML Engineer Data Scientist
PROFESSIONAL SUMMARY:
Senior Data Engineer with 10+ years of experience in designing and implementing scalable data engineering, AI/ML, and GenAI solutions across healthcare, banking, automotive, and financial services domains.
Expertise in cloud platforms including GCP (BigQuery, Dataflow, Dataproc, Vertex AI, Pub/Sub, Cloud Composer, BigLake, Looker), Snowflake (Snowpipe, Streams, Snowpark, Tasks), Azure (ADF, Databricks, Event Hubs, Data Lake Gen2, Azure ML), and AWS (Glue, S3, Redshift, DynamoDB, Athena, Step Functions).
Strong hands-on experience building data pipelines for batch and real-time ingestion using Pub/Sub, Kafka, Dataflow, Snowpipe, Glue, NiFi, and Airflow, processing 10TB+ data monthly.
Skilled in BigQuery + Snowflake integration for advanced analytics, ML-ready datasets, and cost-optimized storage/compute strategies.
Implemented real-time ingestion frameworks with Pub/Sub, Kafka, and Snowflake Streams/Tasks, enabling time-sensitive decision-making in healthcare and finance.
Experienced in ETL/ELT frameworks using DBT, PySpark, Scala, SQL, Informatica, Talend, SSIS, and Apache NiFi.
Optimized data lakehouse architectures on BigLake, GCS, Snowflake, and Delta Lake, improving query performance by up to 40% and reducing storage costs.
Designed transformation workflows using DBT, PySpark, BigQuery SQL, and Snowflake SQL, reducing ETL processing times by 30%.
Extensive experience in data modeling, SQL optimization, schema design, and creating analytics-ready data marts.
Built and deployed predictive ML models (fraud detection, credit risk scoring, pharmacy demand forecasting, patient churn) using XGBoost, LightGBM, Random Forest, TensorFlow, and PyTorch.
Hands-on with deep learning (CNNs, BioBERT, Hugging Face Transformers) for medical note classification, OCR digitization, and anomaly detection.
Developed and fine-tuned LLMs (LLaMA-2, Falcon, GPT-4 with LangChain) for GenAI-powered RAG solutions, claims summarization, and healthcare-specific knowledge assistants.
Built MLOps pipelines in Vertex AI, Kubeflow, MLflow, and integrated with Snowflake ML (Snowpark) for automated training, deployment, and monitoring.
Proficient in hyperparameter tuning using Optuna and Hyperopt, improving ML model performance while reducing training times.
Applied model monitoring and drift detection using Cloud Monitoring, BigQuery ML Monitoring, Snowflake Observability, and EvidentlyAI.
Strong knowledge of data governance, PHI/PII masking, RBAC, IAM, and encryption (CMEK) to ensure compliance with HIPAA, HITRUST, SOX, PCI-DSS, OCC.
Experienced in DevOps and CI/CD using Terraform, GitHub Actions, GitLab, Azure DevOps, Cloud Build, and Docker for automated data pipeline deployments.
Designed and deployed containerized microservices with Docker on AWS EC2 and GKE, improving modularity and scalability of data processing.
Delivered cross-cloud data interoperability by integrating Azure Data Factory + Databricks with GCP (BigQuery) and Snowflake, enabling enterprise-wide ML-driven applications.
Expertise in BI & visualization tools including Tableau, Power BI, AWS QuickSight, and Looker, creating actionable dashboards for business stakeholders.
Proven experience leading enterprise data lake/warehouse initiatives, integrating 100+ disparate data sources at CVS and 50+ financial sources at Capital One.
Strong leadership and mentorship experience, guiding junior engineers on data engineering best practices, ML deployment, cloud integrations, and GenAI adoption.
Adept at working in agile, cross-functional teams, driving collaboration across data engineering, data science, and business units, consistently delivering projects 15–20% faster.
TECHNICAL SKILLS:
Cloud Platforms: Azure (Data Factory, Databricks, Event Hubs, Data Lake Gen2, Delta Lake, Azure ML, Azure DevOps, Azure Monitor, Key Vault, Active Directory), AWS (Glue, S3, Redshift, DynamoDB, Athena, EMR, EC2, ECS, Lambda, Step Functions, SNS, SQS, EFS, Glacier, CodePipeline, CodeDeploy, CloudWatch), GCP (BigQuery, Dataflow, Dataproc, Pub/Sub, Vertex AI, Cloud Composer, Cloud Storage, BigLake, Data Fusion, Looker, GKE)
Data Engineering & ETL: Apache Spark (PySpark, Scala), Apache Kafka, Apache Airflow, Apache NiFi, Informatica, Talend, SSIS, Apache Oozie, Hadoop (HDFS, Hive, MapReduce), DBT
Databases & Data Warehousing: SQL, PostgreSQL, MySQL, Snowflake (Snowpipe, Streams, Snowpark, Tasks), Amazon Redshift, Azure Synapse, DynamoDB, Hive, Parquet
Programming & Scripting: Python, Scala, SQL
AI/ML & GenAI: Scikit-learn, TensorFlow, PyTorch, Hugging Face Transformers, BioBERT, CNNs, XGBoost, LightGBM, Random Forest, Optuna, Hyperopt, LangChain, LLaMA-2, Falcon, Azure OpenAI (GPT-4), RAG-based GenAI Solutions, EvidentlyAI, MLflow
Business Intelligence & Visualization: Tableau, AWS QuickSight, Power BI (basic exposure)
DevOps, CI/CD & Containerization: Terraform, Git, GitHub Actions, GitLab, Jira, Docker, Azure DevOps
Data Governance & Quality: Great Expectations, Data Lineage & Metadata Tracking, PHI/PII Masking, RBAC, Compliance with HIPAA, SOX, PCI-DSS, HITRUST
Other Tools: Apache Event Hubs, Azure Cognitive Search, Pinecone/Weaviate (Vector Databases)
PROFESSIONAL EXPERIENCE:
Client: CVS Health, Woonsocket, RI JAN 2024 - PRESENT
ROLE: SENIOR DATA ENGINEER – AI/ML, Gen AI
Roles & Responsibilities:
Designed and implemented scalable data pipelines and DAG-based workflows using Dataflow (Apache Beam), Dataproc (PySpark/Scala), BigQuery, and Snowflake (Snowpipe/Streams) to process 10TB+ of structured/unstructured healthcare data monthly for analytics and AI workloads.
Built real-time ingestion frameworks with Pub/Sub, Kafka, and Snowflake Streams/Tasks, enabling streaming of pharmacy transactions and claims data to support time-sensitive healthcare applications.
Optimized data lake and warehouse architecture (BigLake, GCS, BigQuery, Snowflake, Delta Lake), improving query speed by 40% and reducing storage costs while preparing datasets for ML and BI reporting.
Engineered data transformation workflows with DBT, PySpark, BigQuery SQL, and Snowflake SQL, reducing ETL/ELT processing times by 30% and enabling faster availability of curated datasets.
Developed predictive ML models for pharmacy demand forecasting, patient churn, and claims fraud detection using XGBoost, LightGBM, and Random Forest, improving forecast accuracy by 20%.
Deployed deep learning models (BioBERT, GPT-based transformers, CNNs) for medical note classification and OCR-based digitization of handwritten prescriptions, reducing manual processing by 40%.
Implemented hyperparameter tuning with Optuna and Hyperopt, improving model performance while cutting training times by 25%.
Built MLOps pipelines in Vertex AI, Kubeflow, MLflow, and integrated with Snowflake ML (Snowpark), automating training, deployment, and monitoring of ML models, ensuring reproducibility and reducing deployment cycles by 50%.
Introduced LLM-based GenAI solutions using GPT-4 (Azure OpenAI), LangChain, and Pinecone/Weaviate for vector search, enabling RAG-driven assistants for customer service and internal pharmacy knowledge search.
Fine-tuned domain-specific LLMs (LLaMA-2, Falcon) on CVS’s internal healthcare corpus (EHR notes, claims data), improving medical terminology understanding by 30%.
Implemented document summarization workflows with GPT-4 for claims appeals and prior authorization requests, reducing manual review time by 35%.
Collaborated with data scientists to build risk prediction models (readmission risk, medication adherence classification) leveraging scikit-learn, TensorFlow, and PyTorch.
Established model monitoring and drift detection using Cloud Monitoring, BigQuery ML Monitoring, Snowflake Observability, and EvidentlyAI, ensuring AI/ML models maintained performance over time in production.
Applied governance and security controls (IAM, CMEK encryption, PHI/PII masking) across all GCP, BigQuery, Snowflake, and AI pipelines, ensuring compliance with HIPAA and HITRUST.
Automated deployment and orchestration of DAG-driven data/ML workflows using Terraform, Cloud Build, and GitHub Actions, reducing human error and increasing reliability.
Contributed to CVS’s enterprise data lake + warehouse initiative on GCP (BigQuery) + Snowflake, integrating 100+ disparate data sources into a unified analytics and AI-ready platform, accelerating BI adoption by 15%.
Leveraged Azure Data Factory for ingestion and orchestration of external partner data into Snowflake & BigQuery platforms, ensuring cross-cloud data interoperability.
Built Databricks-based ML pipelines on Azure integrated with Snowflake and BigQuery, powering customer-facing ML-driven applications and process automation.
Led cross-functional teams in agile delivery of AI/ML and GenAI projects, improving collaboration between data engineering, data science, and business teams, delivering projects 20% faster.
Mentored junior engineers on GCP (BigQuery) + Snowflake best practices, ML deployment strategies, and GenAI integration techniques, strengthening team expertise and delivery capacity.
Environment: GCP (BigQuery, Dataflow, Pub/Sub, Dataproc, Vertex AI, Cloud Composer, BigLake, GCS, Looker, Cloud Build, Cloud Monitoring, IAM), Snowflake (Snowpipe, Streams, Snowpark, Tasks), MLflow, LangChain, LLaMA-2, Falcon, Hugging Face, Optuna, PyTorch, TensorFlow, Pinecone/Weaviate, Terraform, GitHub Actions, Docker, GKE, Azure Data Factory, Azure Databricks
CLIENT: CAPITAL ONE, MCLEAN, VA SEP 2021 - JAN 2024
ROLE: DATA ENGINEER/ANALYST
Roles & Responsibilities:
Engineered ETL pipelines with AWS Glue, Apache NiFi, Informatica, and SSIS, processing 5TB+ financial data weekly to support analytics and regulatory reporting.
Built real-time ingestion pipelines with Apache Kafka, AWS Step Functions, and Apache Airflow, enabling near-instant data availability for fraud detection and credit risk monitoring.
Optimized storage and retrieval on AWS S3, Redshift, and DynamoDB, cutting query response times by 30% for high-volume datasets.
Developed PySpark/Scala workflows on AWS EMR, reducing ETL runtimes by 25% and ensuring SLA compliance for nightly batch jobs.
Orchestrated large-scale transaction processing with AWS Batch, streamlining workflows for credit card and compliance datasets.
Integrated ML models (fraud detection, credit scoring) into pipelines using Python, PyTorch, and scikit-learn, improving predictive accuracy of risk models.
Deployed containerized services on AWS ECS and Elastic Beanstalk, enabling scalable APIs for fraud and portfolio risk applications.
Implemented AWS SQS and SNS pipelines to decouple ingestion from downstream analytics, increasing fault tolerance and improving alerting reliability.
Designed SQL/PostgreSQL marts to support BI dashboards, accelerating decision-making for risk and finance teams.
Created dashboards in AWS QuickSight and Tableau, visualizing delinquency and fraud anomalies, boosting BI adoption by 20% across business units.
Utilized AWS EFS for shared storage across Spark clusters and AWS Glacier for long-term archival of compliance datasets, lowering storage costs by 40%.
Automated CI/CD for data pipelines using AWS CodePipeline, CodeDeploy, Git, and Terraform, reducing release cycles by 35%.
Deployed modular microservices on AWS EC2 with Docker, improving throughput and scalability of financial data processing.
Monitored pipelines with AWS CloudWatch and Airflow logs, performing root cause analysis and corrective actions to maintain 99.9% uptime for critical workflows.
Partnered with data scientists to productionize fraud and credit risk models, aligning engineering pipelines with analytical requirements.
Strengthened governance with AWS IAM, RBAC, and encryption, ensuring compliance with SOX and PCI-DSS regulations.
Automated infrastructure provisioning via Terraform, reducing manual deployment errors and accelerating delivery.
Mentored junior engineers on SQL optimization, Spark, and AWS data engineering best practices, enhancing team delivery capacity.
Environment: AWS Glue, AWS S3, Apache NiFi, Informatica, Talend, SSIS, Apache Spark (PySpark/Scala), Apache Kafka, Apache Airflow, AWS Step Functions, AWS Athena, AWS Redshift, AWS DynamoDB, PostgreSQL, SQL, Parquet, Python, PyTorch, scikit-learn, AWS QuickSight, Tableau, Docker, AWS EC2, Terraform, Git, Jira, AWS IAM, AWS CloudWatch.
CLIENT: FORD, DEARBORN, MI. DEC 2019 - AUG 2021
ROLE: BIG DATA ENGINEER
Roles & Responsibilities:
Engineered real-time ingestion pipelines for telematics data covering sensors, dashcams, and EV chargers from fleets using Kafka and Hadoop Distributed File System (HDFS), processing 10K+ events per second.
Built and maintained a high-volume Hadoop data lake, integrating Apache Hive tables to store and query 500B+ telemetry records per month.
Developed ELT workflows in Apache Airflow and Apache Oozie to schedule large-scale ETL jobs, ensuring on-time delivery for downstream analytics.
Created fleet-level dashboards in Tableau and MySQL-backed APIs, displaying speeding, idling, and maintenance deviations for over 1M connected vehicles.
Designed MapReduce jobs for batch processing historical telemetry datasets, reducing aggregation runtime from 8 hours to 2 hours.
Implemented anomaly detection models in Python with scikit-learn, flagging unusual EV charging behaviors and preventing system overloads.
Enabled near-real-time alerts via Kafka topics and Airflow triggers, improving response time to mechanical issues by 45%.
Used Hive and SQL for fleet maintenance forecasting models, cutting unplanned downtime by 18%.
Tuned Hive partitioning and bucketing strategies to optimize query performance and reduce compute cost by 40%.
Applied Great Expectations validations on Hive and MySQL data sources, catching 50K+ corrupt records per month.
Established data lineage tracking for Hadoop jobs using Airflow metadata and GitLab CI/CD pipelines.
Built MapReduce workflows for usage-based analytics, improving the accuracy of fuel efficiency reports by 22%.
Automated driver behavior notification triggers via Kafka to Python microservices, improving compliance by 20%.
Developed Python feature extraction libraries for speeding, braking, and idling metrics, cutting new model development time by 35%.
Collaborated with security teams to manage IAM roles for Hadoop clusters and ensure compliance for sensitive telematics data.
Integrated MySQL as a lightweight operational datastore for live fleet status APIs, reducing API response times by 50%.
Leveraged GitLab version control and CI/CD pipelines to deploy ETL scripts, ensuring consistent and audited releases.
Trained 15+ analysts on HiveQL, Airflow, and MapReduce for large-scale telematics data analysis.
Environment: Hadoop, Apache Hive, Apache Oozie, Kafka, MapReduce, Apache Airflow, GitLab, SQL, MySQL, Python, Tableau, Great Expectations, scikit-learn.
CLIENT: INAUTIX TECHNOLOGIES INDIA PVT LTD, INDIA MAY 2016 - SEP 2019
ROLE: SQL DEVELOPER
Roles& Responsibilities:
Developed SQL scripts to manage and transform datasets up to 500M+ rows, enabling faster analytics for cross-department technology projects.
Designed and maintained MySQL and PostgreSQL databases, optimizing schemas and indexing to improve query speed by 35%.
Built and maintained Talend ETL jobs for seamless data integration between 6+ enterprise systems, reducing manual transfers by 90%.
Automated data cleansing and integration processes with Python, cutting processing time from hours to minutes while improving data accuracy by 15%.
Managed Amazon Redshift data warehouses, optimizing cluster configurations to scale storage and boost query performance by 40%.
Created dynamic Tableau dashboards with real-time KPIs, empowering executives with actionable insights and reducing report preparation time by 70%.
Automated recurring ETL pipelines using Apache Airflow, increasing reliability and meeting 99.9% SLA compliance.
Implemented Git workflows for version control, ensuring transparent collaboration and eliminating deployment conflicts.
Modeled and optimized relational schemas, ensuring high data integrity and supporting sub-second query responses for key datasets.
Led data migration initiatives from on-prem to cloud environments, achieving zero data loss and less than 30 minutes downtime.
Tuned SQL queries and database configurations, reducing critical report run times by 50%.
Developed Python-based validation rules to automatically detect anomalies, ensuring 100% accuracy in monthly reporting datasets.
Provided L3 technical support for data issues, maintaining 99.95% system availability.
Collaborated with engineering teams to integrate new data tools, improving platform efficiency and extending analytical capabilities.
Authored process documentation and data architecture diagrams, enabling faster onboarding for new developers.
Environment: SQL, MySQL, PostgreSQL, Talend, Python, Amazon Redshift, Tableau, Apache Airflow, Git.