Post Job Free
Sign in

Senior Data Engineer - Cloud Data Platform & MLOps Expert

Location:
Jersey City, NJ, 07306
Posted:
November 19, 2025

Contact this candidate

Resume:

SUMANTH CHAPPIDI

PH: +1-614-***-**** ************@*****.***

Sr Data Engineer

www.linkedin.com/in/sumanth-ch-b1a532235

PROFESSIONAL SUMMARY:

8+ years of Full Software Development Life Cycle (SDLC) experience in Software, System Analysis, Design, Development, Testing, Deployment, Maintenance, Enhancements, Re-Engineering, Migration, Troubleshooting, and Support of multi-tiered applications.

Delivered cross-cloud data solutions on AWS, Azure, and GCP integrating Databricks, Airflow, and Snowflake.

Hands-on expertise in Big Data ecosystem: Hadoop (HDFS, MapReduce), Hive, Pig, Sqoop, HBase, Zookeeper, Couchbase, Storm, Solr, Oozie, Flume, Spark, and Kafka.

Skilled in Cloud Platforms: AWS (EC2, S3, Lambda, Redshift, EMR, DynamoDB, RDS, VPC, CloudFormation), Azure (Data Factory, ADLS, Synapse, Databricks, HDInsight, Stream Analytics), GCP (BigQuery, Vertex AI, Pub/Sub, Dataflow, Dataproc, GCS).

Proficient in ML-Ops tools: MLflow, Kubeflow, Airflow, Vertex AI pipelines, and CI/CD for ML workflows.

Strong knowledge of Infrastructure-as-Code (IaC) with Terraform, CloudFormation, and Ansible for automated environment provisioning.

Experienced in Containerization & Orchestration using Docker, Kubernetes, and OpenShift for scalable ML and data applications.

Solid background in CI/CD systems: Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps pipelines, and TFS Git.

Enabled analytics and ML teams by delivering unified, high-quality feature data pipelines for predictive insights.

Built fault-tolerant data pipelines with retry logic, SLA monitoring, and self-healing orchestration to ensure 99.9% uptime.

Expertise in Data Lakes & Governance: Delta Lake, Unity Catalog, Lakehouse architecture design, and secure data governance practices.

Skilled in Data Integration & ETL Tools: Informatica, Talend, SSIS, MuleSoft, Ab Initio, Apache NiFi, AWS Glue, and DBT.

Implemented data observability framework using Grafana, Prometheus, and custom Python validators to ensure SLA adherence.

Experience building and orchestrating real-time streaming pipelines using Kafka, Kinesis, Spark Streaming, and Flink.

Strong programming experience with Python, Scala, Java, SQL, Shell scripting, and COBOL.

Proficient in Python libraries: Pandas, NumPy, PySpark, SciPy, scikit-learn, GeoPandas, TensorFlow, PyTorch, and FastAPI.

Knowledge of TypeScript, JavaScript, and REST API development for integrating ML models into applications.

Expertise in Data Warehousing: Snowflake, BigQuery, Redshift, Azure Synapse, Teradata, DB2, Oracle, and SQL Server.

Skilled in NoSQL databases: MongoDB, Cassandra, HBase, DynamoDB, Couchbase.

Strong knowledge of Data Modeling: Star schema, Snowflake schema, normalization, denormalization, and dimensional modeling.

Designed and implemented Spark jobs for data cleaning, preprocessing, machine learning model training, and advanced analytics.

Developed real-time ML pipelines integrating Spark, Kafka, and TensorFlow for predictive analytics use cases.

Hands-on experience with workflow orchestration: Apache Airflow, Oozie, Azkaban, and Control-M.

Implemented data quality frameworks and monitoring solutions using Great Expectations and custom Python scripts.

Built and optimized data pipelines for batch, streaming, and near real-time analytics.

Deployed and managed ML models in production with monitoring, automated retraining, and version control.

Previously developed and maintained Ab Initio ETL workflows for large-scale data integration projects; experience translating existing ETL logic into metadata-driven frameworks.

Strong experience with monitoring & logging tools: Prometheus, Grafana, ELK Stack, CloudWatch, Datadog.

Developed and deployed microservices-based data ingestion APIs using Java/Spring Boot and RESTful interfaces for streaming and batch data pipelines.

Implemented end-to-end CI/CD pipelines for ML and Data applications using Jenkins, GitHub Actions, and GitLab CI.

Add a bullet in your resume about API Gateway (Apigee/Kong) familiarity or similar integration/REST API experience.

Mention CI/CD automation with Ansible or Jenkins explicitly for Azure deployments.

Include a line about production issue troubleshooting and performance optimization for pipelines.

Strong background in Data Security and IAM: Role-based access, encryption (KMS, HashiCorp Vault), and compliance (GDPR, HIPAA).

Experience with Data Visualization & Reporting: Power BI, Tableau, SSRS, and Looker.

Worked on geo-spatial data engineering with GeoPandas, PostGIS, and Google BigQuery GIS.

Knowledge of Message Queues & Event-Driven Architectures: RabbitMQ, Kafka, AWS Kinesis, Azure Event Hub.

Skilled in Linux/AIX system administration with Shell scripting for automation and job scheduling (Cron, Autosys).

Strong understanding of parallel computing and distributed processing (MapReduce, Spark DAGs, RDDs, lineage graphs).

Performed performance tuning and query optimization across databases, Hadoop clusters, and Spark jobs.

Migrated legacy ETL and ML workflows to cloud-native platforms like AWS Glue, GCP Dataflow, and Azure Data Factory.

Proven ability to collaborate with data scientists, frontend engineers, and platform teams to integrate ML models into production systems.

Strong problem-solving, analytical, communication, and leadership skills, with experience leading data engineering teams in Agile environments.

TECHNICAL SKILLS:

Operating Systems: Linux, UNIX, AIX, Windows

Development Methodologies: Agile/Scrum, Waterfall

IDEs: Eclipse, NetBeans, IntelliJ, VS Code, PyCharm

Big Data Platforms: Hortonworks, Cloudera CDH4/CDH5, MapR, Databricks Lakehouse

Programming Languages & Libraries: Python, Scala, SQL, COBOL, Shell Scripting, TypeScript, NumPy, SciPy, Pandas, GeoPandas, Matplotlib, Seaborn, NLTK, TextBlob, BeautifulSoup, PySpark, TensorFlow, PyTorch, scikit-learn

Hadoop Components: HDFS, Sqoop, Hive, Pig, MapReduce, YARN, Impala, Hue, Zookeeper, Oozie, Flume

Spark Modules: Spark Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, SparkR, Delta Lake, Unity Catalog

ETL & Data Integration: Informatica, Ab Initio, Talend, MuleSoft, Apache NiFi, AWS Glue, SSIS, DBT

RDBMS/NoSQL Databases: Oracle, MySQL, SQL Server, Teradata, PostgreSQL, DB2, MongoDB, Cassandra, HBase, DynamoDB, Couchbase

Cloud Technologies:

AWS: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, Redshift, Glue, Lambda, Kinesis, DynamoDB

Azure: Azure Data Factory, ADLS, Synapse SQL DW, HDInsight, Databricks, Stream Analytics

GCP: BigQuery, Vertex AI, Dataflow, Dataproc, Pub/Sub, GCS

DevOps & Version Control: Git, GitHub, GitLab, Git TFS, Jenkins, Maven, Ant, Azure DevOps, GitHub Actions

Containerization & Orchestration: Docker, Kubernetes, OpenShift

Infrastructure-as-Code (IaC): Terraform, AWS CloudFormation, Ansible

MLOps Tools: MLflow, Kubeflow, Vertex AI Pipelines, Airflow

Monitoring & Logging: Prometheus, Grafana, ELK Stack, Datadog, Nagios, CloudWatch

Visualization & Reporting: Power BI, Tableau, SSRS, Looker

PROFESSIONAL EXPERIENCE:

Citi- New York, NY Aug 2024 - present

Role: Sr Data Engineer

Responsibilities:

Led large-scale cloud migration initiative from AWS to Azure (and hybrid vice versa) modernizing Citi’s global data platform for investment banking analytics.

Architected and deployed a Snowflake-based data lakehouse integrated with Azure Data Factory, AWS Glue, and Databricks for unified ELT pipelines.

Migrated legacy Redshift, Teradata, and Hadoop workloads to Snowflake, improving performance, scalability, and governance.

Designed multi-zone data architecture (Raw Curated Analytics) using ADLS Gen2, S3, and Delta Lake for secure and governed data management.

Built and optimized PySpark, DBT, and SQL-based ETL pipelines to process financial, trading, and market data at scale.

Implemented real-time data streaming using Kafka, Kinesis, and Snowpipe for continuous ingestion and event-driven analytics.

Automated infrastructure provisioning and data pipeline deployments using Terraform, CloudFormation, and Azure DevOps CI/CD, reducing manual setup by 60%.

Integrated AWS Lambda, Azure Functions, and Glue Jobs to automate data ingestion, validation, and transformation workflows.

Developed and maintained data quality and observability frameworks using Great Expectations, Airflow, and custom Python checks.

Created monitoring and alerting dashboards in Grafana, CloudWatch, and Azure Monitor to ensure SLA compliance and cost transparency.

Architected AI/ML pipelines using MLflow, Kubeflow, and Vertex AI for training, versioning, and deployment of predictive models.

Collaborated with data scientists to develop and operationalize AI models for trade-risk scoring, anomaly detection, and liquidity forecasting.

Implemented MLOps workflows integrated with Snowflake and Databricks to streamline model retraining and batch inference.

Developed GenAI assistants using Azure OpenAI and LangChain to automatically generate SQL, summarize datasets, and document ETL logic.

Enabled retrieval-augmented generation (RAG) by connecting LLMs to Snowflake metadata and data catalogs for intelligent query generation.

Containerized Spark and Python applications with Docker and Kubernetes, enabling scalable, cross-cloud execution.

Integrated Power BI, Tableau, and Looker dashboards with Snowflake to deliver real-time insights for trading and compliance teams.

Designed and enforced RBAC, masking, and encryption policies in Snowflake and Azure for secure data access and compliance.

Partnered with global architecture and DevOps teams to design a multi-cloud deployment strategy for data pipelines and AI services.

Delivered a high-performance, AI-ready, and GenAI-enabled multi-cloud data ecosystem supporting 20+ TB of financial data with 99.9% uptime and 40% faster analytics delivery.

Environment: Azure (Data Factory, ADLS Gen2, Synapse, Databricks, HDInsight, Purview, Azure Functions, Azure DevOps), AWS (S3, Lambda, Glue, Redshift, Kinesis, DynamoDB, CloudFormation), Snowflake (Snowpipe, Streams, Tasks, RBAC, Masking Policies), Hadoop Ecosystem (HDFS, Hive, Pig, Sqoop, Oozie, HBase, Cassandra), Spark (Core, SQL, Streaming, MLlib, Delta Lake, PySpark), Kafka, NiFi, DBT, Terraform, Docker, Kubernetes, MLflow, Kubeflow, Vertex AI, Great Expectations, Airflow, Grafana, CloudWatch, Azure Monitor, Power BI, Tableau, Looker, GitHub Actions, Jenkins, GitLab CI/CD, Python, Scala, Shell Scripting, PowerShell.

Molina Health - long beach, California Jun 2023 –Jul 2024

Role: Data Engineer

Responsibilities:

Analyzed stories, participated in grooming sessions and point estimation for development according to Agile methodology.

Developed Spark jobs using Scala and Python on top of YARN / MRv2 for interactive and batch analysis.

Developed Databricks ETL pipelines using PySpark/Delta Lake/ADF, handling batch and real-time data.

Implemented CI/CD deployment for notebooks and jobs.

Optimized Databricks SQL/Spark SQL queries, reducing runtime by 25%.

Built streaming pipelines using Kafka + Spark Structured Streaming.

Applied Docker/Kubernetes for scalable pipeline execution.

Ensured data governance, integrity, and monitoring across pipelines.

Developed Spark jobs using PySpark and Scala for batch and interactive analytics, optimizing existing workflows and improving pipeline efficiency by 25%.

Built NiFi and Kafka workflows for near real-time streaming ingestion and processing.

Migrated data pipelines from on-premises to Azure cloud, integrating ADF, ADLS, Synapse, and Azure Databricks.

Implemented Oozie and Airflow orchestrations for batch ETL and automated data processing.

Built dashboards and visualizations using Power BI to support operational and business insights.

Worked with Apache Spark for improving performance optimization of existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Pair RDD’s, Spark YARN, Spark MLLib.

Involved in data movement implementation from on-premises to cloud in MS Azure.

Evaluated the performance of Apache Spark in analyzing genomic data.

Developed a NiFi workflow to pick up data from SFTP server and send to Kafka broker.

Used Python (NumPy, SciPy, Pandas, NLTK, Matplotlib, Beautiful Soup, TextBlob) for ETL processing of clinical data for NLP analysis.

Migrated data from traditional databases to Azure SQL databases.

Used Machine Learning modules to run Spark jobs on daily and weekly basis.

Migrated complex MapReduce programs into Spark RDD transformations & actions.

Designed and developed Oozie workflows to orchestrate Hive scripts, Sqoop.

Developed the Spark Framework to structure and process data using Spark Core API, DataFrame, Spark-SQL, Scala.

Recreated existing logic in Azure Data Lake, Data Factory, Azure SQL Database.

Developed Spark scripts to import large files from Azure, pulling data from HDFS/HBase into Spark RDD.

Extensively worked on Python to build custom ingest framework.

Designed, built, and delivered operational & management tools for Azure Data Lake, driving implementations to Cloud Operations team.

Worked with Spark on YARN for interactive and batch analysis.

Implemented Spark SQL queries based on business requirements.

Used Spark API over Hortonworks Hadoop YARN to perform analytics on HDFS data.

Designed Column Families in Cassandra, ingested data from RDBMS, performed transformations, exported to Cassandra.

Migrated Data Pipeline jobs from Oozie to Airflow.

Developed Scala scripts, UDFs using DataFrames, SQL, Datasets, RDD in Spark for data aggregation & queries, writing back into OLTP system via Sqoop.

Created Azure Stream Analytics Jobs to replicate real-time data into Azure SQL Data Warehouse.

Implemented real-time streaming ingestion using Kafka + Spark Streaming.

Built visualizations & dashboards using Power BI.

Created POS data dashboards with Power BI for business insights.

Environment: Big Data, Spark, Python, NumPy, SciPy, Pandas, NLTK, Matplotlib, Beautiful Soup, Text Blob, MS Azure, Azure SQL, Azure Data Lake, Azure Data Factory, Scala, YARN, Spark Context, Spark-SQL, PySpark, Pair RDD's, Spark YARN, Spark MLLib, NiFi, Kafka, Oozie, OLAP, Power BI.

Ulta Beauty-Bolingbrook, Chicago Jan 2022 - May 2023

Role: Data Engineer

Responsibilities:

Administered, provisioned, patched, and maintained Cloudera Hadoop clusters on Linux.

Worked on analyzing Hadoop stack and different Big Data analytic tools including Pig, Hive, HBase, Sqoop.

Administered and optimized Cloudera Hadoop clusters, including HDFS, Hive, Pig, and HBase.

Built ETL pipelines using Python, MapReduce, Hive, and Spark, processing data from multiple sources (CSV, JSON, XML, Parquet).

Migrated ETL pipelines to AWS cloud, leveraging S3 and Redshift, improving storage efficiency and query performance.

Developed Oozie workflows and SSIS packages for batch ETL processing.

Ensured data quality and validation using custom Python scripts and Hive queries.

Utilized Python (Matplotlib, SciKit-Learn) for prototype visualization and Tableau for advanced dashboards.

Written multiple MapReduce programs to perform extraction, transformation, aggregation from 20+ data sources (XML, JSON, CSV, compressed formats).

Worked in AWS environment for development and deployment of Custom Hadoop Applications.

Created Oozie workflows for Hadoop-based jobs (Sqoop, Hive, Pig).

Involved in file movement between HDFS and AWS S3.

Created Hive external tables, loaded and queried data using HQL.

Performed data validation on ingested data using MapReduce by building custom cleansing models.

Imported data from various sources, performed transformations using Hive & MapReduce, loaded into HDFS, and extracted data from MySQL into HDFS via Sqoop.

Transferred data using Informatica from AWS S3 to AWS Redshift.

Wrote HiveQL queries by configuring reducers/mappers as required.

Transferred data between Pig scripts & Hive using HCatalog, migrated RDBMS data with Sqoop.

Involved in ETL process using SSIS, and generated reports using SSRS.

Analyzed data with HiveQL and Pig Latin scripts.

Managed cluster coordination services with Zookeeper. Installed & configured Hive, and written Hive UDFs.

Environment: Hadoop, Pig, Hive, HBase, Sqoop, Python, Matplotlib, Scikit Learn, Tableau, HDFS, Map Reduce, AWS, S3, RedShift, MySQL, OLAP, Oozie, Sqoop, HQL, SSIS, SSRS, MS SQL Server, HiveQL.

PNC- Pittsburg, PA Feb 2020 – Dec 2021

Data Engineer

Responsibilities:

Worked with BI team in gathering report requirements and used Sqoop to export data into HDFS and Hive.

Developed MapReduce and Hive workflows to extract, transform, and load data into Hadoop ecosystem.

Automated ETL processes with Oozie and Pig, reducing manual effort by 30%.

Worked with NoSQL databases (HBase, Cassandra) for high-throughput batch processing.

Performed data profiling, cleansing, and validation, supporting BI reporting and analytics.

Involved in analytics using R, Python, Jupyter Notebook.

Data collection & treatment: analyzed internal/external data, handled entry errors, classification errors, missing values.

Data Mining: Cluster Analysis (customer segments), Decision Trees (profitable vs non-profitable customers), Market Basket Analysis (customer purchasing behavior, product association).

Developed multiple MapReduce jobs (Java) for data cleaning & preprocessing.

Assisted with data capacity planning and node forecasting.

Installed, configured, and managed Flume infrastructure.

Administrator for Pig, Hive, HBase (installing updates, patches, upgrades).

Worked with claims processing team to identify patterns in fraudulent claims.

Developed MapReduce programs to extract & transform data, exported results back to RDBMS via Sqoop.

Applied text mining in R and Hive to detect fraud patterns.

Exported processed data to RDBMS using Sqoop for downstream analytics.

Built staging tables and partitioned tables in EDW from MapReduce outputs.

Created Hive tables and loaded structured data.

Wrote HiveQL queries for extracting insights, supporting market analysts with trend analysis.

Imported data (e.g., log files) into HDFS using Flume.

Automated data loading with Oozie and pre-processed using Pig.

Managed and reviewed Hadoop log files.

Tested raw data and executed performance scripts.

Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Eclipse, Cloudera, and Python.

Axis Bank-Mumbai, India Sep 2017 – Nov 2019

Hadoop Developer

Responsibilities:

Involved in the evaluation of functional and non-functional requirements.

Installed and configured Hadoop MapReduce, HDFS; developed multiple MapReduce jobs (Java) for data cleaning and pre-processing.

Installed and configured Pig and wrote Pig Latin scripts.

Wrote MapReduce jobs using Pig Latin.

Managed and reviewed Hadoop log files.

Imported data using Sqoop to load data from MySQL to HDFS on a regular basis.

Developed scripts and batch jobs to schedule various Hadoop programs.

Written Hive queries for data analysis to meet business requirements; created Hive tables and worked with HiveQL.

Imported and exported data into HDFS and Hive using Sqoop.

Experienced in defining job flows.

Good experience with NoSQL databases like Solr and HBase.

Loaded data into Hive tables and constructed Hive queries executing in MapReduce fashion.

Created a custom Filesystems plug-in for Hadoop to access files on the Data Platform.

Enabled Hadoop MapReduce applications, HBase, Pig, Hive to directly access files.

Designed and implemented MapReduce-based large-scale parallel relation-learning system.

Extracted feeds from social media sites (Facebook, Twitter) using Python scripts.

Set up and benchmarked Hadoop clusters for internal purposes.

Environment: Hadoop, MapReduce, HDFS Hive, Java Hadoop distribution of Horton Works, Cloudera, Pig, HBase, Linux, XML, MySQL, MySQL Workbench Java 6, Eclipse, Oracle 10g PL/SQL, SQL PLUS Sub Version Cassandra.

Education:

Bachelor’s in Computer Science

Jawaharlal Nehru Technological University -Kakinada

August 2013-May 2017



Contact this candidate