Machine Learning Data Engineer

Location:

Texas City, TX

Posted:

September 10, 2025

Contact this candidate

Resume:

Sreya Reddy Golipally

*********************@*****.*** Ph: 469-***-****

www.linkedin.com/in/sreya-somireddy-100647287

PROFESSIONAL SUMMARY:

Data Engineer with 9+ years of experience in designing, building, and maintaining scalable ETL/ELT pipelines, data ingestion frameworks, and cloud data platforms.

Experienced with machine learning algorithm such as logistic regression, KNN, SVM, random forest, neural network, linear regression, lasso regression and k – means.

Adept in statistical programming languages like Python and R.

Strong expertise in SQL (Postgres, Redshift, SQL Server, MySQL) and hands-on experience with Azure Data Factory (ADF), AWS Glue, DBT, Airflow, and Redshift.

Experienced Data Engineer skilled in designing and optimizing Snowflake data warehouses, focusing on efficient data ingestion, transformation, and cost management.

Proven experience working on Supervised, Unsupervised techniques such as Regression, Classification, Clustering, and Machine Learning (ML) to manage data driven programs.

Expertise in leveraging the Exploratory Data Analysis (EDA) with all numerical computations and by plotting all kinds of relevant visualizations to do feature engineering and to get feature importance.

Expertise in Data Analysis, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export using multiple ETL tools.

Proficient in Machine Learning, Data/Text Mining, Statistical Analysis & Predictive Modelling.

Experienced in building data models using machine learning (ML) techniques for Classification, Regression, Clustering and Associative mining.

Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2.

Proficient in data visualization tools such as Python Matplotlib to create visually powerful and actionable interactive reports and dashboards.

Proficiency and understanding of supervised and unsupervised machine learning (ML) algorithms like Random Forest, SVM, Clustering and neural networks.

Equipped with passionate vision and flexible customer approach that drive the successful publication/delivery of major projects.

Adept at troubleshooting data quality issues, monitoring pipeline performance, and collaborating with cross-functional Agile teams

TECHNICAL SKILLS:

Languages

Python, R, SQL

Python Libraries/Packages

NumPy, SciPy, Boto, Pickle, PySide, PyTables, Data Frames, Pandas, Matplotlib, SQL Alchemy, HTTPLib2, Urllib2, Beautiful Soup, Py Query

Machine Learning Methods

Classification, regression, prediction, dimensionality reduction, density estimation, and clustering to problems that arise in retail, manufacturing, market science, finance, and banking.

Statistical Methods

Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Autocorrelation

Cloud Data Systems

Azure, AWS (RedShift, Kinesis, EMR)

Reporting & Visualization

Tableau, Crystal Reports, SSRS.

Database

SQL, MySQL, MongoDB

Methodologies

Agile, Scrum and Waterfall

Version Control Tools

SVN, GitHub.

Operating Systems

Linux and Windows

PROFESSIONAL EXPERIENCE:

Mabrey Bank Bixby, OK Jan 2024 – Present

Data Engineer

Applied Supervised Machine Learning (ML) Algorithms Logistic Regression, Decision Tree, and Random Forest for the predictive modelling various types of problems.

Partnered with business teams to translate reporting requirements into efficient database-layer solutions.

Used pandas, NumPy, Seaborn, SciPy, matplotlib, sci-kit-learn, NLTK in Python for developing various machine learning (ML) algorithms.

Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.

Implemented query optimization techniques in Snowflake, including clustering, partitioning, and materialized views, resulting in significant query performance improvements.

Led sprint planning, backlog refinement, and sprint retrospectives to ensure timely and efficient delivery of data solutions in an Agile environment.

Integrated Apache Kafka with Spark and Hadoop for real-time data processing, ensuring minimal latency and efficient message delivery.

Integrated YARN with Hadoop ecosystems, boosting the throughput of data pipelines and reducing job failures by optimizing memory and resource allocation strategies.

Integrated Glue with AWS Lake Formation for secure, governed data access.

Performed data profiling and quality checks in Postgres to ensure consistency before ETL execution.

Setup storage and data analysis tools in AWS cloud computing infrastructure.

Developed tools using Python, Shell scripting, XML to automate some of the menial tasks.

Managed partitioning and compression in Glue ETL jobs to reduce query cost in Redshift.

Managed DBT documentation and lineage tracking, improving data discoverability for analysts.

Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap analysis.

Optimized Kafka producers and consumers to handle millions of events per second, improving data pipeline performance and reliability.

Successfully managed and scaled clusters using Apache YARN, providing dynamic resource scheduling and fault tolerance for high availability in mission-critical applications.

Monitored Redshift workload management (WLM) queues to optimize query performance.

Collaborated with business teams to translate reporting needs into DBT models for dashboards.

Developed and optimized data workflows, integrating Spark, Kafka, and Snowflake for scalable data ingestion and processing pipelines.

Led the implementation and optimization of Apache YARN in the cluster management architecture, improving resource allocation and job scheduling efficiency for large-scale data processing.

Built reusable SQL views and stored procedures to support BI reporting and analytics.

Optimized Glue jobs using PySpark scripts for handling semi-structured JSON/XML data.

Extracted data from database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.

Architected and designed robust data pipelines using Apache Spark and Kafka, integrating data from diverse sources to enable real-time analytics.

Tuned Redshift clusters by managing distribution keys, sort keys, and compression encodings.

Configured AWS DMS tasks to replicate Postgres data into Redshift with minimal downtime.

Monitored and troubleshot pipeline issues, adding alerting/recovery workflows in Airflow.

Integrated DBT workflows with Airflow for automated execution, ensuring smooth CI/CD deployment.

Built custom Airflow operators and hooks to connect Postgres, Redshift, and AWS S3 pipelines.

Implemented alerting and monitoring in Airflow, ensuring timely failure notifications and rerun strategies.

Migrated legacy relational databases into AWS using DMS full-load and CDC (Change Data Capture) methods.

Developed complex SQL queries in Redshift to support reporting and predictive models.

Used AWS S3, DynamoDB, AWS lambda, AWS EC2 for data storage and models' deployment.

Tuned Redshift workloads and pipelines, improving performance and reducing query costs.

Developed SQL scripts in Postgres for complex joins, aggregations, and stored procedures supporting business intelligence use cases.

Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.

Configured AWS DMS tasks for continuous replication from Postgres into Redshift with minimal downtime.

Environment: Python (Pandas, NumPy, SciPy, Scikit-learn, NLTK, Matplotlib, Seaborn), Hadoop (HDFS, Hive, Pig Latin), Apache Spark (MLLib), AWS (S3, EC2, Lambda, DynamoDB, Cloud Search), MySQL, JSON, XML, Shell Scripting.

National Western Life Insurance Austin, TX Jan 2022 – Dec 2023

Data Engineer

Gathering, retrieving and organising data and using it to reach meaningful conclusions.

Conducted data quality checks and implemented corrective measures, ensuring alignment with regulatory standards.

Designed and implemented a high-throughput Kafka-based event-driven architecture to support real-time data feeds into business intelligence tools.

Orchestrated ETL workflows in Apache Airflow, scheduling Python and SQL tasks for data cleansing, ingestion, and transformation.

Collaborated with business teams to migrate on-prem SQL Server workloads to Azure SQL Database for scalable analytics.

Supported troubleshooting and performance tuning of cluster-based data workloads using Apache YARN, addressing bottlenecks and resource contention for Spark jobs.

Participated in Agile Scrum teams, collaborating with product managers and developers to deliver incremental updates to the data platform.

Built and optimized Azure Data Factory pipelines for ingestion of financial datasets into Redshift and Snowflake (POC).

Configured Azure Key Vault and Managed Identities for secure credential management in data pipelines.

Derived data from relational databases to perform complex data manipulations and conducted extensive data checks to ensure data quality. Performed Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas library.

Designed DBT macros and tests to enforce data quality, ensuring accuracy of curated financial reporting layers.

Designed and maintained CI/CD pipelines using Azure DevOps to streamline data pipeline deployments.

Coordinated with DBAs to assess migration performance and optimize table mappings in DMS.

Built layered models (staging, intermediate, marts) in DBT to support consistent data lineage.

Configured Airflow DAGs to manage dependencies between ingestion, transformation, and validation steps.

Developed Snowflake schema for organizing large datasets, optimizing both data storage and analytical query performance.

Enabled continuous replication pipelines via DMS to keep source and target data synchronized.

Implemented data ingestion from Postgres to Redshift, enabling downstream reporting and analytics.

Performed data modelling operations using Power Bi, Pandas, and SQL.

Researched extensively on the nature of the customers and designed multiple models to perfectly fit the necessity of the client and Performed Extensive Behavioural modelling and Customer Segmentation to discover behaviour patterns of customers by using K-means Clustering.

Orchestrated ETL workflows in Apache Airflow, scheduling Python and SQL tasks for data cleansing and transformation.

Orchestrated Airflow DAGs integrating Python, SQL, and DBT tasks for automated CI/CD data workflows.

Designed and optimized Postgres schemas and indexes to improve query performance for high-volume transactional datasets.

Performed data modelling and built reusable reporting objects to support actuarial and compliance reporting.

Environment: Python, R, Hadoop (Hive, Impala), SQL, ETL Tools, Tableau, scikit-learn, NLTK, NumPy, Pandas, Matplotlib.

Virinchi Technologies Ltd Hyderabad, India Mar 2018 – Aug 2021

Data Engineer

Implemented application of various machine learning algorithms and statistical modelling like Decision Tree, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.

Developed Informatica ETL mappings and workflows for ingestion from multiple source systems into a centralized data warehouse.

Developed collaborative filtering-based recommendation engines using Python and R to recommend retail products.

Integrated diverse data sources into Snowflake from on-premises and cloud-based systems, ensuring seamless and scalable data ingestion pipelines.

Analyzed data trends using packages like NumPy, Pandas and Matplotlib in Python.

Created sentiment analysis model and complex query model of Twitter data using Hadoop ecosystem, HiveQL, Impala, and regular expression.

Designed and implemented data quality rules and profiling checks to identify anomalies before data loads.

Designed and implemented data warehousing solutions using Snowflake, optimizing data storage and query performance for large-scale analytics.

Built real-time data streaming pipelines using Apache Kafka, integrating data from multiple sources for near-instantaneous processing and analysis.

Performed Exploratory Data Analysis and Data Visualizations.

Text analytics on review data machine learning (ML) technique in python using NLTK.

Generated ETL mappings, sessions and workflows based on business user requirements to stack data from source files, RDBMS tables to target tables.

Used Meta data tool for importing metadata from repository, new job categories and creating new data elements.

Built data pipelines integrating Python and SQL to support predictive modelling and reporting dashboards.

Used text-mining process of reviews to determine customers' concentrations.

Designed and implemented a probabilistic churn prediction model with 80k customer data to predict the probability of customer churn out using Logistic Regression in Python.

Designed reporting views and materialized tables to improve performance for business intelligence teams.

Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas library.

Designed Tableau bar graphs, scattered plots, and geographical maps to create detailed level summary reports and dashboards.

Migrated datasets into cloud platforms (AWS S3 + Redshift), enabling downstream analytics.

Environment: Python, NumPy, Pandas, Matplotlib, NLTK, Tableau, Agile, GITHUB, Windows.

W3softech Hyderabad, India July 2015 – Feb 2018

Data Engineer

Gathered, documented, and implemented business requirements for analysis or as part of a long-term document/report generation.

Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning (ML) algorithms.

Implemented Predictive analytics and machine learning (ML) algorithms to forecast key metrics in the form of designed dashboards on to AWS (S3/ EC2).

Participated in all phases of Data mining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis.

Involved working in Data science using Python on conducting different experiments, campaigns on the PGC tool using A/B testing and collection of data from varied data sources.

Programmed a utility in Python that used multiple packages (NumPy, SciPy, pandas).

Detected source data anomalies using SQL queries and improved the overall data load operations by 23%.

Used Python to develop machine learning (ML) algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means etc based on Unsupervised/Supervised Model.

Managed AWS EC2 instances using Autoscaling groups and used ticketing tools like JIRA to monitor work.

Providing support for data processes. This will involve monitoring data, profiling database usage, trouble shooting, tuning and ensuring data integrity.

Environment:

Python, Pandas, NumPy, SciPy, Seaborn, Matplotlib, Scikit-learn, NLTK, SQL, AWS (S3, EC2, Auto Scaling), JIRA, PGC Tool, Predictive Analytics, Machine Learning (Supervised & Unsupervised), Data Mining, Data Cleaning, Data Visualization.

Contact this candidate