Machine Learning Data Engineer

Location:

Plano, TX

Posted:

August 25, 2025

Contact this candidate

Resume:

Malavi Padala

M: ******.****.********@*****.*** C: +1-945-***-****)

SUMMARY:

Data Engineer with 6+ years of experience specializing in AWS services (Glue, S3, Redshift), AI/ML workflows, and analytical solutions to drive business intelligence and decision-making.

Proficient in SQL, Python, Spark, Hadoop, and Kafka, Hive, with expertise in transforming complex datasets into actionable insights using Power BI, Tableau, and machine learning techniques.

I am skilled with learning experience in Big Data technologies, cloud platforms (AWS/Azure), and data warehousing (Snowflake, Redshift, Big Query).

I am adapted at handling structured and unstructured data, implementing real-time streaming solutions, and improving data quality and performance.

Passionate about leveraging data to enhance operational efficiency and support data-driven strategies. Experienced with other tools like Jenkins, GitLab, GitHub, Autosys, Airflow, building efficient, scalable data pipelines.

I am skilled in Python and working in Agile environments, delivering real-time data solutions and machine learning models to enhance business performance and optimize data processing.

Proven expertise in improving data accuracy, efficiency, and operational outcomes through innovative cloud-based solutions.

Extensive experience with machine learning (ML) solutions, leveraging Python to generate insights and visualization data for business analysis.

Hands-on experience using Python tools like Pandas, NumPy, Matplotlib, and Scikit-Learn to build machine learning models and data analysis workflows.

Proficient in machine learning algorithms including Naive Bayes, Random Forests, Decision Trees, Linear and Logistic Regression, PCA, SVM, Clustering, and Neural Networks.

Passionate about implementing deep learning techniques with frameworks like Keras and Theano for complex data models.

Experienced in applying time series forecasting methods for demand prediction and sales forecasting using techniques like Moving Average and Holt-Winters.

Extensive expertise with business intelligence (BI) tools such as OLAP, data warehousing, reporting, and querying tools to drive data-driven business decisions.

Managed GitHub repositories including branching, tagging, and version control for efficient team collaboration and project management.

Proficient in using analytics tools such as Anaconda, Jupyter Notebooks, and R for data analysis and visualization.

Skilled in using Python packages for developing visualizations, including Seaborn, Matplotlib, ggplot, and Pygal to present data insights effectively.

Extracted and worked with data from multiple databases such as Oracle, SQL Server, DB2, MongoDB, PostgreSQL, Teradata, and Cassandra.

Strong understanding of Systems Development Life Cycle (SDLC) methodologies, including Agile and Waterfall, for managing data engineering projects.

Experienced in ingesting and processing data from various sources, such as Oracle Database, Flat Files, and CSV files, and loading it into target data warehouses.

Proficient in writing SQL queries for data extraction, transformation, and performance optimization.

Hands-on experience with a wide range of ML and DL algorithms, including logistic regression, random forest, AdaBoost, KNN, ANN, RNN, LSTM, and clustering techniques.

Developed predictive models using decision trees, random forests, Naive Bayes, logistic regression, clustering, and neural networks to derive insights from data.

Built NLP models for Topic Extraction and Sentiment Analysis using libraries like NLTK for processing text data and finding patterns.

Worked with Proof of Concepts (POCs), performing gap analysis and preparing data for exploration and analysis using data munging techniques.

Well-versed in OLTP/OLAP systems with a focus on Oracle Hyperion Suite, developing Star Schema and Snowflake Schema for relational and dimensional data modeling using Erwin tool.

Expertise in normalization and de-normalization techniques to optimize performance in both relational and dimensional databases.

Strong ability to communicate data insights effectively to both technical and non-technical stakeholders to inform decision-making.

Experienced in designing and evaluating controlled experiments for optimization, advising teams on experimental designs.

Skilled in supervised and unsupervised learning techniques for predictive modeling and analysis.

Hands-on experience in containerizing applications using Docker for efficient development and deployment workflows.

Proficient in Keras and TensorFlow, with experience in deep learning models for complex data processing tasks.

Highly skilled in using Tableau and Power BI to create interactive dashboards for data visualization and reporting.

TECHNICAL SKILLS:

Category

Skills

Languages

Python, JavaScript, MATLAB, SAS, Spark, Docker, SQL, VBA, C++, C

Python Packages

NumPy, Pandas, Sci-kit-learn, TensorFlow, SciPy, Matplotlib, Seaborn, Numba, SpaCy, NLTK, LightGBM, XGBoost, CatBoost, Dask, Gensim

Machine Learning

Time Series Prediction, Natural Language Processing (NLP), Support Vector Machines (SVM), Machine Intelligence, Generalized Linear Models, Clustering, Regression

Deep Learning

Machine Perception, Neural Networks, TensorFlow, Keras, PyTorch

Data Query

Azure, Google Cloud, Oracle, Amazon Redshift, Kinesis, EMR, RDBMS, Snowflake, SQL and NoSQL databases (MongoDB, Cassandra, PostgreSQL)

Cloud Platforms

Amazon AWS (Lambda, S3, SNS/SQS, DynamoDB, Glue, Redshift, Kinesis, CloudFormation, Step Functions, Opensearch, Quicksight), GCP, Heroku, Microsoft Azure

Big Data

Hadoop, Spark (PySpark, MLib), Kafka, HDFS, MapReduce

Visualization

Power BI, Cognos 11 Analytics, Tableau, Jupyter Notebook, IBM DataStage, AWS Quicksight

Version Control

Git, GitHub, SVN

ETL Tools

Apache NiFi, AWS Glue, Talend, Informatica

Databases

SQL Server, Oracle, PostgreSQL, MongoDB, Cassandra, MySQL

Data Modeling

Star Schema, Snowflake Schema, Data Warehousing, OLTP, OLAP, Dimensional Modeling, Erwin

Containerization

Docker, Kubernetes

Operating Systems

Linux, Windows

Data Processing

Apache Spark, Apache Hive, Apache Flume, Apache Kafka, EMR

Development Frameworks

Flask, Django

DevOps

Jenkins, CI/CD Pipelines

Others

Agile/Scrum methodologies, Data Governance, Data Security, Anaconda, Jupyter Notebooks, Gen AI, Linux Shell scripts

CERTIFICATION:

Advanced Java Programming - Completed an advanced Java course from Codecademy.

Python for Data Science - Completed a Python focused course for data science from Coursera.

AWS Certified Data Analytics – Specialty The AWS Certified Data Analytics – Specialty certification is designed for data engineers, data architects, and analysts who work with big data solutions on AWS. It validates expertise in designing, building, securing, and maintaining analytics solutions on AWS.

Azure Data Engineer Associate (DP-203) The DP-203 - certification is designed for Data Engineers who work with Azure data services, including ingestion, transformation, storage, and security of data in the cloud.

PROFESSIONAL EXPERIENCE:

Data Engineer Visa INC JAN 25 to Present

Responsibilities:

Developed Sentiment Analysis using LLM Bert by training historical Data provided by the organization to understand the sentiment of end-users using Python using Google vertex ai.

Used guardrail ai with LLM to create a response structure to the query.

Designed and implemented serverless data pipelines using AWS Lambda and Glue, ensuring automated and seamless data transformation into Redshift.

Built scalable data processing workflows, leveraging AWS CloudFormation to manage infrastructure and ensure the deployment of cloud resources.

Designed, developed, and maintained robust data ingestion and transformation pipelines using Python and SQL to support high-volume data processing.

Built a fault-tolerant and scalable data processing architecture using Spark on YARN, with checkpointing, partitioning, and stateful streaming.

Optimized Spark jobs by tuning memory configurations, serialization formats, and using broadcast variables to reduce shuffle operations and job latency.

Created custom Kafka connectors using Kafka Connect for seamless integration with RDBMS, NoSQL stores, and cloud storage systems.

Designed and developed machine learning models for classification, regression, and clustering using Python libraries like Scikit-learn and TensorFlow.

Assisted in implementing data preprocessing pipelines including data cleaning, transformation, and normalization for both structured and unstructured data sources.

Supported the development and fine-tuning of predictive models for customer behavior analysis and fraud detection.

Collaborated with cross-functional teams to gather, analyze, and translate business requirements into functional specifications for data onboarding processes.

Performed auditing and validation of incoming client/participant data (SSN, DoB, etc.) to ensure accuracy, completeness, and de-duplication.

Utilized Linux shell scripting to automate data validation, file comparison, and integrity checks across onboarding workflows.

Identified and resolved data anomalies, inconsistencies, and duplication issues during ingestion of new client data.

Acted as a liaison between business stakeholders and technical teams to ensure alignment on data quality expectations and onboarding rules.

Supported root cause analysis for data issues and worked with downstream teams to implement remediation plans.

Developed and maintained documentation for data validation rules, exception handling processes, and business workflows.

Conducted gap analysis between incoming data and system requirements to drive continuous data quality improvements.

Collaborated with senior data scientists to build and test hypotheses, improving model accuracy through iterative experimentation.

Utilized GenAI and large language models (LLMs) for natural language understanding tasks, including sentiment analysis and text summarization.

Applied statistical techniques to validate model assumptions and ensure robustness of results under various business conditions.

Worked with tools like Pandas, NumPy, and SQL to extract, manipulate, and analyze large datasets from multiple data sources.

Contributed to building scalable data pipelines for ingesting, processing, and storing real-time data using platforms like Apache Spark or Airflow.

Implemented data quality validation and schema enforcement processes to ensure accuracy, consistency, and compliance across datasets.

Designed and developed large-scale distributed data processing systems using Apache Spark and Apache Kafka in a hybrid cloud/on-prem environment.

Built real-time and batch data pipelines leveraging Kafka Streams and Spark Streaming for high-throughput, low-latency data processing.

Implemented end-to-end ETL workflows using Apache Spark (Core, SQL, and MLlib) and integrated with data warehouses like Snowflake.

Developed robust Kafka solutions including topic design, partitioning strategy, producer/consumer configuration, and message schema management using Avro and Schema Registry.

Led the optimization and tuning of Spark jobs for performance and scalability in a distributed cluster environment.

Participated in end-to-end ML workflows, from data ingestion to model deployment, under the guidance of senior engineers.

Used TensorFlow and PyTorch to implement deep learning models for image recognition and NLP use cases.

Assisted in anomaly detection models to identify outliers and reduce operational risk in business processes.

Documented experiment design, modeling results, and feature engineering techniques to maintain transparency and reproducibility.

Supported the integration of AI/ML services into existing business applications to drive automation and decision-making.

Designed fault-tolerant data ingestion and processing pipelines using Kafka Connect, Spark Structured Streaming, and custom connectors.

Wrote efficient SQL queries and Unix shell scripts to support data extraction, transformation, and loading tasks across diverse systems.

Built and orchestrated end-to-end workflows using Apache Airflow, automating complex dependencies and reducing manual interventions.

Supported Actuarial and analytics teams by delivering customized datasets and pipeline modifications for predictive modeling and statistical analysis.

Proactively monitored and resolved data pipeline issues, including job failures, performance bottlenecks, and data anomalies.

Utilized AWS services such as EMR, S3, Athena, and Glue to enable scalable, cloud-native data processing architectures.

Integrated AWS DynamoDB for low-latency storage and AWS S3 for scalable data storage.

Developed real-time analytics applications using AWS Quicksight and Redshift to generate insights for various business use cases.

Built and optimized AWS Glue ETL pipelines, enabling efficient data integration with Amazon Redshift and RDS for real-time data analysis and reporting.

Developed and deployed AI/ML models using Python, improving business decision-making and data predictions.

Collaborated with cross-functional teams in an Agile environment, delivering data engineering solutions within sprint cycles.

Developed reusable libraries and components in Scala/Python for data processing and transformation tasks.

Collaborated with cross-functional teams including Data Engineers, Architects, and Business Analysts to design data-driven solutions.

Ensured data integrity and pipeline resilience by implementing checkpointing, data deduplication, and replay mechanisms in Kafka and Spark.

Monitored and troubleshot Spark and Kafka clusters using tools like YARN, Ganglia, and Kafka Manager, ensuring high system availability.

Applied best practices in data modeling, including dimensional and relational modeling, to support reporting and analytics use cases.

Created interactive dashboards and visualizations using Tableau to provide actionable insights for business teams.

Improved data querying performance by 40% by optimizing SQL queries in Redshift and leveraging window functions.

Performed Data Collection, Data Cleaning Data Visualization, and Text Feature Extraction and performed key statistical findings to develop business strategies using Python.

Employed NLP to classify text within the dataset. Categorization involved labeling natural language texts with relevant categories from a predefined set using Python.

Performed root cause analysis and deployed fixes to ensure high pipeline reliability and minimize downtime.

Collaborated cross-functionally with platform engineering, analytics, and business teams to align data delivery with evolving business priorities.

Authored and maintained technical documentation and data dictionaries to promote transparency and knowledge sharing within the team.

Employed dbt to manage SQL-based transformations and enforce data modeling standards across the analytics layer.

Established monitoring and alerting for pipeline health, leveraging logging frameworks and performance dashboards.

Developed reusable Python modules and frameworks to accelerate development and ensure consistent engineering practices.

Managed GitHub repositories and permissions, including branching and tagging.

Applied Python and Deep learning algorithm to develop ATM cash demand prediction model which ATM Cash Demand with 25% more accuracy than the old mechanism and reduces transport/logistics, freezing & insurance costs.

Created a model to validate Health Insurance claims using ML and DL techniques.

Using strong technical business/domain knowledge and handling of Big Data analytics to develop and deliver multiple business analytics visualizations using Tibco Spotfire, SAP Business Objects, Crystal Reports.

Created ML model for Health Insurance prediction to facilitate customers with better health plans as per their needs and help organizations avoid unforeseen losses and liabilities.

Developed a model to recognize fraudulent card transactions. The model can detect both fraud and non-fraud Credit card transactions which increased the accuracy by 30%.

Built models using Python to predict the probability of attendance for various campaigns and events.

Developed a Machine Learning CI/CD pipeline on Google Cloud.

Used text to understand user sentiments over time. Data was facilitated from various sources such as the company's official website, Twitter, Facebook, Quora, etc.

Initiated various pre-processing phases of text like Tokenizing, Stemming, and lemmatization, Stop Words, Vocabulary Phrase Matching, POS Tagging using NLTK and Spacy libraries on Python, and converting the raw text to structured data using Python.

Develop time series based ARIMA models to understand sales and influencing factors on it.

Created a Sparse dataset by using Count vectorizer, Document Term Matrix, and TF-IDF Vectorization while assigning IDs for each word and checking the frequency of words in the corpus using Python.

Environment: Machine learning, AWS, MS Azure, Cassandra, SAS, GitHub, Spark, HDFS, Jupyter Notebooks, Hive, Pig, Linux, Python, MySQL, Eclipse, PL/SQL, SQL connector, Spark ML.

Data Engineer Accenture IND March 2023 – Jan 2024

Responsibilities:

Responsible for developing and implementing software applications

Designed and implemented scalable ETL pipelines using Apache Spark (PySpark/Scala) and Hadoop Ecosystem (HDFS, Hive, Sqoop, Kafka) to process large datasets efficiently.

Developed and optimized Spark jobs using RDDs, DataFrames, and Datasets, improving data processing speed by 40% through partitioning, caching, and optimized transformations.

Ingested and processed real-time streaming data from sources like Kafka, implementing Spark Streaming for low-latency analytics and anomaly detection.

Optimized Hadoop and Spark cluster performance, fine-tuning YARN resource allocation, HDFS block sizes, and Hive table partitioning, reducing query execution time by 30%.

Automated data workflows using Apache Airflow and Autosys, reducing manual intervention by 50% and ensuring reliable data orchestration.

Built scalable data lake solutions using HDFS and integrated with Hive for structured query and data analytics.

Ensured compliance with enterprise data security policies by implementing encryption, fine-grained access controls, and secure cluster configurations.

Collaborated with cross-functional teams to operationalize machine learning models, integrating them into real-time pipelines for business decision-making.

Led integration testing efforts across Spark, Hive, and Kafka layers to validate data quality and schema consistency in a distributed environment.

Worked with NoSQL databases like HBase and MongoDB to store and retrieve structured and unstructured data for real-time and batch analytics.

Implemented data governance and security best practices, ensuring data encryption, access control, and compliance with enterprise policies in a multi-tenant Hadoop environment. Collaborated with cross-functional teams including Data Scientists, DevOps, and Business Analysts to deploy machine learning models and integrate analytics solutions into production systems.

Automated job scheduling and workflow orchestration using Apache Airflow, Autosys, and Shell scripting, streamlining data pipeline execution and reducing manual intervention by 50%

Environment: Spark Streaming, Kafka, Hive, HBase, MongoDB, Airflow, Autosys.

Big Data Engineer Capgemini IND Jan 2018 – March 2023

Responsibilities:

Assisted in the design and development of software applications Involved in gathering and analyzing business requirements and designing Hadoop stack as per the requirements.

Experience in Importing and exporting data into big data, HDFS and Hive using Sqoop. Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.

Data Processing & Transformation: Developed and optimized PySpark scripts to process large-scale datasets efficiently on Apache Spark, improving ETL pipeline performance.

Data Migration: Led data migration projects, transferring terabytes of structured and unstructured data from On-Prem Clusters to AWS (S3, EMR, Glue, Redshift, Athena) ensuring minimal downtime and data integrity.

Big Data Ecosystem Management: Managed Apache Spark and Hive clusters for data warehousing and analytics, optimizing query performance and resource utilization.

AWS Cloud Integration: Designed and implemented serverless data processing workflows using AWS Glue, Lambda, and Step Functions, automating data ingestion and transformation pipelines.

Designed and developed complex ETL pipelines using PySpark, Hive, and HDFS, supporting high-volume batch data processing for enterprise analytics.

Migrated terabytes of data from on-prem Hadoop clusters to AWS cloud storage (S3, Glue, EMR, Redshift) with zero data loss and minimal downtime.

Built serverless workflows using AWS Glue, Lambda, and Step Functions to automate ingestion and transformation processes.

Designed scalable data architectures integrating Cassandra, MongoDB, and MySQL to support hybrid transactional/analytical use cases.

Managed Apache Spark and Hive clusters, tuning configurations for performance, cost optimization, and scalability in multi-tenant environments.

Created Hive partitioning and bucketing strategies, improving query performance for large datasets by 60%.

Used Apache Airflow for end-to-end data pipeline scheduling and monitoring, enhancing job visibility and failure recovery processes.

Implemented logging, monitoring, and alerting using AWS CloudWatch and Spark UIs to support proactive maintenance and issue resolution.

Cluster Management & Optimization: Monitored and optimized on-premises Hadoop and Spark clusters, tuning configurations to enhance processing efficiency and reduce costs.

SQL & Hive Query Optimization: Wrote and optimized complex Hive queries, leveraging partitioning and bucketing techniques to improve execution speed for large datasets. Data Pipeline Automation: Built and scheduled end-to-end ETL workflows using Apache Airflow, ensuring seamless data movement across environments. Performance Tuning & Debugging: Diagnosed and resolved performance bottlenecks in Spark jobs, leveraging Spark UI, YARN logs, and AWS CloudWatch for monitoring and troubleshooting.

Collaboration & Stakeholder Communication: Worked closely with data analysts, data scientists, and business teams to ensure data availability and consistency across platforms.

Environment: AWS (S3, EMR, Glue), Spark, Hive, Airflow, Cassandra, Hadoop

EDUCATION:

Bachelors in computer science JNTUH, India

Contact this candidate