Senior Data Engineer ETL, Spark, Snowflake, MLOps, Cloud

Location:

Vadodara, Gujarat, India

Posted:

November 12, 2025

Contact this candidate

Resume:

Chandrika Miryala

Senior Data and MLOps Engineer +1-469-***-**** *********@*****.*** https://www.linkedin.com/in/chandrikamiryala/

Professional Summary:

Data Engineer and ML/MLOps Specialist with 7 years of experience designing, developing, and deploying large-scale data pipelines, ETL workflows, and machine learning solutions on AWS and GCP. Proficient in big data technologies such as Apache Spark, Hive, Pig, Talend, Kafka, Flink, PySpark,Scala Hadoop and Oozie, with hands-on expertise in building batch and real-time streaming pipelines. Proficient in ETL tools such as IBM DataStage, SSIS, and Talend. Experienced in MLOps practices, including automated model training, deployment, monitoring, and retraining using AWS SageMaker, MCP, Vertex AI, Docker, Airflow, and CI/CD tools. Skilled in managing distributed storage (HDFS, Avro, ORC, Parquet) and cloud databases (BigQuery, Redshift,Snowflake, RDS, MongoDB, Cassandra), as well as developing production-grade ML models for anomaly detection, predictive analytics, and operational insights. Adept at building dashboards with Power BI, Tableau, QuickSight, and Looker to enable data-driven decision-making and improve operational efficiency.

Technical Summary:

Programming / Scripting Python (Pandas, NumPy, Scikit-learn, TensorFlow), SQL, Hive SQL, T-SQL, Shell scripting,Scala

Big Data / ETL Apache Spark, PySpark, Ab Initio SSIS Hive, Pig, Talend, Apache Kafka, Flink, Spark Streaming, Apache Airflow, Oozie, GCP Dataflow, GCP Dataproc, AWS Glue, Kinesis Data Streams, Kinesis Data Firehose Teradata, BTEQ

Machine Learning / MLOps AWS SageMaker, Vertex AI, MLOps pipelines, Docker, CI/CD (Jenkins, GitLab CI/CD, GitHub Actions, Cloud Build), Hyperparameter tuning, Experiment tracking, Model versioning, Model Monitoring (CloudWatch, Vertex AI Model Monitoring)

Cloud Platforms / Services AWS (S3, EC2, Lambda, Redshift, Elastic Beanstalk, CodePipeline, CodeBuild, Step Functions, CloudWatch), GCP (BigQuery, Pub/Sub, Functions, Composer/Airflow), Cloud VMs Microsoft Fabric (OneLake, DirectLake, Data Activator)

Data Storage / File Formats HDFS, Avro, ORC, Parquet, RDS, MongoDB, Cassandra

Visualization / Reporting Power BI, Tableau, Amazon QuickSight Power BI (Fabric DirectQuery, Data Activator, Semantic Models), Tableau, Amazon QuickSight

Infrastructure / DevOps Terraform, Ansible, CloudFormation, Docker, CI/CD pipelines Argo CD, GitOps, Cloud Build

Feature Engineering / Analytics Data preprocessing, Feature engineering, EDA, Anomaly detection, Predictive modeling, Batch & real-time workflows

Experience:

Bank of America February 2024 – Present

Data and MLOps Engineer

Responsibilities:

Developed scalable ETL pipelines using Apache Spark, Hive, Kafka, Flink, PySpark, and Talend, processing structured and unstructured banking data including transactions, KYC, and credit bureau data for AI/ML models and real-time decisioning.

Developed and optimized PySpark DataFrame transformations and joins for large-scale banking datasets, improving ETL performance and reducing latency by 40%.

Implemented PySpark Structured Streaming pipelines to process Kafka topics in real time, delivering clean and curated data to Snowflake and ML workflows.

Built PySpark UDFs and window functions for advanced feature engineering and time-series analytics supporting credit risk and fraud detection models.

Optimized Spark configurations (executors, partitions, caching) using Spark UI and Airflow integration to enhance cluster efficiency and job reliability.

Designed and implemented MLOps pipelines using Python, Docker, AWS SageMaker, and CI/CD tools, enabling automated model training, deployment, monitoring, and versioning for banking ML workflows.

Integrated external financial and banking datasets via MCP-compliant connectors, ensuring secure, governed, and auditable access for AI agents and LLMs.

Implemented monitoring and logging for MCP endpoints to ensure reliable, scalable, and auditable data access for AI agents.

Deployed AI-powered chatbots and LLM copilots for banking staff, masking sensitive customer data while ensuring contextual accuracy.

Implemented role-based access controls, data masking, and governance policies in Snowflake to ensure secure handling of sensitive banking data while enabling scalable analytics and ML model consumption.

Designed AI-ready data pipelines that supply clean, structured, and governed data directly to ML models and MCP endpoints for real-time decisioning.

Designed scalable ETL pipelines feeding ML models with transaction, KYC, and credit bureau data accessible via MCP endpoints.

Automated SSIS job scheduling and monitoring through SQL Server Agent for end-to-end ETL execution.

Experience in developing, scheduling, and optimizing Ab Initio graphs and workflows for large-scale data processing.

Hands-on experience with plan files, parallelism (component/partition), and parameter sets for performance optimization.

Expertise in data extraction, transformation, and loading using Ab Initio, SQL, and shell scripting.

Applied data validation, error handling, and logging frameworks within SSIS packages to maintain data integrity and auditability.

Tuned SQL queries, indexes, and SSIS workflows to enhance performance of data loading and reporting operations.

Performed complex SQL transformations and data reconciliations using Teradata and BTEQ scripts, optimizing query performance for high-volume financial datasets.

Designed and implemented batch and real-time streaming workflows using Apache Kafka, Flink, Spark Streaming, and Oozie, ensuring scalable and reliable data integration.

Managed distributed storage and file formats (HDFS, Avro, ORC, Parquet) for efficient data processing and analytics readiness.

Designed and automated data pipelines to ingest banking data into Snowflake, leveraging Snowpipe, Streams, and Tasks for near real-time processing and integration with ML workflows.

Integrated Microsoft Fabric (OneLake and DirectLake) with existing Snowflake data warehouse to enable unified analytics and low-latency data access for dashboards and real-time reporting.

Configured and optimized DirectQuery and mirrored datasets in Fabric for seamless connectivity with Snowflake, ensuring consistent performance and schema synchronization.

Developed semantic models and metrics layers in Fabric for business users, improving query response time and consistency across Power BI reports.

Leveraged Data Activator in Fabric for proactive alerts and event-driven insights from Snowflake and Kafka streams.

Supported BI and analytics teams with Fabric-based self-service data exploration using Power BI integration and shared workspaces.

Integrated model training and evaluation workflows into CI/CD pipelines with Jenkins, GitLab CI/CD, and GitHub Actions, enabling reproducible and scalable ML deployments.

Automated workflow orchestration and job scheduling using Apache Airflow, Oozie, and Shell scripting, improving operational efficiency and reducing manual interventions.

Developed SQL, Hive/SQL queries, and T-SQL scripts for data extraction, transformation, aggregation, and reporting across relational and NoSQL databases (Oracle, MySQL, MongoDB, Cassandra, RDS).

Monitored production ML models for performance drift and retraining triggers using CloudWatch, custom Python scripts, and logging frameworks, ensuring consistent model accuracy.

Implemented cloud infrastructure for big data environments using AWS EC2, S3, Lambda, Redshift, Elastic Beanstalk, and Cloud VMs, ensuring scalable, secure, and highly available deployments.

Developed RESTful APIs using Flask to streamline data retrieval processes, reducing system latency and improving data accessibility.

Designed and optimized Hadoop-based data pipelines leveraging HDFS, Hive, Pig, and Spark for high-volume data ingestion, transformation, and analytics.

Performed data validation, reconciliation, and quality checks using Python, Talend, Unix shell scripting, ensuring accurate and reliable datasets.

Delivered LIG file export and quote functionality, enabling accurate and governed data delivery for downstream ML models and reporting workflows.

Spark, Hive, and SQL queries and optimized Hadoop cluster utilization, improving processing efficiency and throughput.

Built LLM-based intent detection and NLG pipelines using AWS Bedrock and SageMaker for banking copilots.

Developed LangChain-style autonomous agent workflows for document summarization and customer query resolution.

Integrated Pinecone vector database for semantic search and RAG pipelines supporting financial document Q&A.

Fine-tuned Claude, GPT-4, and Amazon Titan models for fraud detection and credit scoring.

Created CI/CD pipelines for LLM workflows using GitHub Actions and SageMaker Pipelines.

Environment/Tools: Apache Spark / PySpark, Hive / Hive SQL / SQL / T-SQL, Apache Kafka / Flink / Spark Streaming, Talend, AWS (S3, EC2, Lambda, Redshift, Elastic Beanstalk), Docker, AWS SageMaker / MLOps tools, AI, MCP, Apache Airflow / Oozie, Power BI / Apache Superset, HDFS / Avro / ORC / Parquet.

Frontier Communications – Dallas, TX August 2023 – December 2023

Data Engineer

Collected, ingested, and pre-processed large-scale network traffic data from logs, real-time packet streams, and monitoring tools using AWS S3, AWS Kinesis Data Streams, Python, and SQL.

Built scalable ETL pipelines with AWS Glue and Kinesis Data Firehose to clean, transform, and standardize raw network data for downstream analytics and ML applications.

Engineered features for anomaly detection workflows, including IP behavior metrics, connection frequency, and performance anomalies, ensuring high-quality inputs for ML models.

Built ML-ready data pipelines from customer usage, call records, and IoT devices, leveraging MCP for secure and context-rich data delivery.

Optimized data retrieval and preprocessing pipelines using MCP tools to improve LLM response accuracy and reduce token consumption.

Optimized real-time recommendation systems for service upsell/cross-sell using MCP for tokenized and contextual AI inputs.

Supported migration of Ab Initio jobs across DEV, QA, and PROD environments and resolved issues during QA testing.

Implemented data validation, error handling, and logging within Ab Initio to ensure high-quality and auditable data movement.

Conducted exploratory data analysis (EDA) using Python (Pandas, NumPy, Matplotlib) and SQL to identify patterns, latency spikes, and abnormal traffic behaviors.

Developed and trained machine learning models for anomaly detection (Logistic Regression, Random Forest, Gradient Boosting) using AWS SageMaker, Scikit-learn, and TensorFlow, integrating with production pipelines.

Implemented MLOps pipelines with SageMaker Pipelines, automated hyperparameter tuning, experiment tracking, and model versioning for reproducible and scalable ML deployment.

Deployed ML models in production using AWS Lambda for event-driven triggers and AWS Fargate for serverless orchestration, enabling scalable and continuous anomaly detection.

Built CI/CD pipelines using AWS CodePipeline and CodeBuild to automate end-to-end ML workflows, from data ingestion to model deployment.

Integrated anomaly detection and ML workflows into Frontier’s network monitoring systems, improving real-time detection of potential failures and security incidents.

Designed scalable ETL pipelines processing telecom billing, usage logs, and CRM data for AI/ML modeling.

Developed ETL pipelines to load and transform telecom billing, usage logs, and CRM data into Snowflake, enabling unified reporting and advanced ML feature engineering.

Optimized Snowflake queries, clustering, and partitioning strategies to improve performance and reduce compute costs across large-scale network and IoT datasets.

Designed and implemented Microsoft Fabric workspaces to mirror Snowflake datasets into OneLake, supporting near real-time analytics for network performance dashboards.

Utilized DirectQuery and semantic modeling to connect Snowflake data with Power BI reports, improving data freshness and interactive exploration.

Collaborated with data architects to establish Fabric–Snowflake integration pipelines ensuring high data availability for analytics and predictive monitoring.

Developed and optimized ETL pipelines in Apache Spark using Scala for high-volume data transformations and aggregations.

Developed and optimized PySpark ETL pipelines on AWS EMR for large-scale network traffic data ingestion, transformation, and enrichment.

Implemented PySpark DataFrame APIs and SQL transformations to standardize and aggregate high-volume IoT and telecom datasets for ML model training.

Optimized Spark job performance through partitioning, caching, and broadcast joins, reducing overall job execution time and improving cluster efficiency.

Built PySpark-based feature engineering pipelines for anomaly detection models, computing rolling metrics, session statistics, and time-based aggregations.

Implemented reusable Scala-based Spark jobs for processing structured and unstructured datasets on AWS EMR.

Established monitoring and retraining pipelines with SageMaker, CloudWatch, and Step Functions to detect model drift and automate retraining, maintaining model accuracy and reliability.

Designed and delivered interactive dashboards using Amazon QuickSight and Tableau, combining SQL and Python scripts to visualize network traffic metrics, anomaly trends, and model performance for operations teams.

Built LangChain-based RAG pipelines for real-time telecom recommendations using Kinesis, SageMaker, and DynamoDB.

Integrated FAISS vector search for anomaly detection in network traffic and semantic alerting.

Deployed LLM-powered chatbots for customer support using AWS Lambda and Bedrock.

Implemented LangGraph-style agent orchestration for telecom diagnostics and service recommendations.

Environment: Python (Pandas, NumPy, Scikit-learn, TensorFlow), SQL, AWS (S3, Kinesis Data Streams, Glue, SageMaker, Lambda, Fargate, CodePipeline, CodeBuild, Step Functions, CloudWatch), MLOps, AI, MCP, Logistic Regression, Random Forest, Gradient Boosting, Tableau, Amazon QuickSight

IKS Health – India May 2018 – July 2022

Data Engineer

Designed and implemented ETL pipelines using GCP Dataflow to ingest, clean, transform, and load multi-source healthcare data (EHRs, IoT devices, insurance claims, patient activity logs) into BigQuery, enabling large-scale analytics and ML-ready datasets.

Developed automated data ingestion pipelines with GCP Pub/Sub and Dataflow, processing over 120M daily health events including vitals, lab results, medication adherence, and hospital workflows.

Developed and optimized PySpark ETL pipelines for large-scale healthcare data processing, transforming patient, claims, and IoT data from multiple sources before loading into BigQuery.

• Implemented PySpark DataFrame transformations, UDFs, and window functions to generate time-series and aggregated features for patient risk prediction and anomaly detection models.

• Optimized PySpark jobs through partitioning, caching, and broadcast joins, improving execution efficiency and reducing Dataflow job costs.

• Integrated PySpark jobs within Cloud Composer (Airflow) for automated scheduling, dependency management, and monitoring of healthcare data pipelines.

Built feature engineering workflows with Python (Pandas, NumPy, Scikit-learn, TensorFlow) to transform patient data into ML-ready formats for predictive modeling and clinical analytics.

Implemented MLOps pipelines with Vertex AI Pipelines and Cloud Composer (Airflow) for model training, evaluation, deployment, and automated retraining.

Monitored data and ML pipelines using Vertex AI Model Monitoring, Cloud Logging, Cloud Monitoring, and alerting policies via Cloud Operations Suite, ensuring pipeline reliability and proactive issue resolution.

Developed scalable data pipelines to support ML workflows, including patient risk stratification, anomaly detection in vitals/lab results, and treatment adherence prediction.

Built CI/CD pipelines using Cloud Build and Argo CD (GitOps) to automate ML model deployment and ensure reproducibility, compliance, and clinical safety.Monitored data and ML pipelines using Vertex AI Model Monitoring and Cloud Logging, detecting model drift, data drift, and anomalies, triggering automated retraining workflows.

Integrated GCP Cloud Build with GitHub Actions to automate testing and deployment of data pipelines, enhancing version control and reducing release errors.

Designed and maintained a data warehouse in BigQuery, improving data accessibility, query performance, and providing structured datasets for analytics and ML tasks.

Migrated healthcare datasets from BigQuery to Snowflake for cross-platform analytics, enabling interoperability between GCP and Snowflake environments.

Implemented Microsoft Fabric integration with Snowflake to centralize healthcare analytics and reporting workloads under a unified governance model.

Configured OneLake storage and DirectLake connections to deliver near-real-time insights from Snowflake data warehouses to Power BI dashboards.

Created Fabric semantic models and DirectQuery datasets for clinical, IoT, and claims data, supporting self-service BI and regulatory reporting.

Used Fabric Data Activator for proactive notifications on patient metrics and anomaly detection triggers from Snowflake event streams.

Partnered with data governance teams to implement Fabric-based access control aligned with HIPAA compliance and healthcare data privacy standards.

Implemented Snowflake data governance policies, including role-based access, dynamic data masking, and HIPAA-compliant security controls, ensuring secure handling of sensitive patient data.

Refactored legacy ETL workflows and SQL logic from on-prem and Snowflake environments into GCP-native pipelines using Dataflow, Composer, and BigQuery, improving maintainability, scalability, and pipeline performance.

Created interactive dashboards in Google Data Studio and Looker to visualize patient risk scores, readmission probabilities, adherence metrics, and ML model performance.

Optimized data pipelines and ML workflows to reduce operational costs, improve throughput, and ensure high availability of production pipelines for healthcare analytics.

Collaborated with healthcare analysts, ML engineers, and IT teams to define data and ML requirements, ensuring seamless integration across clinical data, IoT streams, and ML systems.

Developed NLP pipelines for patient intent classification and clinical note summarization using HuggingFace Transformers.

Built LangChain-style agents for healthcare Q&A using BigQuery, Snowflake, and FAISS-based retrieval.

Fine-tuned LLaMA and GPT-4 models for clinical risk prediction and treatment adherence explanations.

Integrated CI/CD for LLM pipelines using Cloud Build and Argo CD with model versioning and drift monitoring.

Environment: Vertex AI, GCP Dataflow, BigQuery, Cloud Composer (Airflow), GCP Functions, Cloud Build, Python (Pandas, NumPy, Scikit-learn, TensorFlow), Pub/Sub, Google Data Studio, Looker, MLOps

Education Qualification:

Masters in Computer science – Southern Arkansas University – 2023

Bachelors in Computer Science – RBVRR Hyderabad – 2018.

Contact this candidate