Data Scientist Machine Learning

Location:

Cleveland, OH

Posted:

June 03, 2025

Contact this candidate

Resume:

SAIKIRAN THODETI

Sr. AI/ML Data Scientist

Mobile: +1-216-***-****

Email ID: *********@*****.***

Linkedin: https://www.linkedin.com/in/sai-kiran-2355221b5/

Background Summary:

Senior Data Scientist with 11+ years of experience in developing and deploying scalable AI/ML solutions, specializing in customer analytics, personalization, and real-time data applications. Strong background in statistical modeling, causal inference, and experimentation, with hands-on experience in A/B testing, uplift modeling, and performance measurement. Proven success in working with cross-functional teams to optimize product and marketing strategies using advanced machine learning and big data tools. Passionate about leveraging data to improve user engagement and drive advertiser ROI through intelligent targeting and auction optimization.Experience in designing, building, and maintaining ETL jobs and data pipelines to integrate data from different sources like Kafka, S3, SFTP servers, RDBMS, and multiple data stores like HBase, Hive, Athena, Dynamo DB, etc.

Experienced in NLP model development for Smart Inventory Management and sentiment analysis.

Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, Map Reduce, YARN, Apache Cassandra, HBase, Hive, Oozie, Impala, Pig, Zookeeper and Flume, Kafka, Sqoop, Spark.

Expertise in leveraging GCP, Data bricks, and Snowflake for scalable data solutions.

Hands - on experience in Machine Learning algorithms such as Linear Regression, GLM, CART, SVM, KNN, LDA/QDA, Naive Bayes, Random Forest, SVM, Boosting.

Strong technical background with a focus on integrating AI technologies into development environments.

Experience in developing real-time analytics, data pipelining (data ingestion, cleaning, Enrichment) and prediction engine using Java, Akka Apache Kafka, Storm, Flink, Sqoop, Flume, and Spark Streaming, Spark MLLib, Scikit-learn, Cassandra, MongoDB, Cloudera/HDP stack (HDFS, HBase, Impala, Map Reduce, Yarn Oozie, Pug, Hive, Kerberbos/SASL) and ElasticSearch

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR. Hands-on experience with Amazon RDS, Auto Scaling, Cloud Watch, SNS, Athena, Glue, Kinesis, Redshift,

Dynamo DB, and other services of the AWS family.

Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy, and Pandas for data analysis and numerical computations.

Developed end-to-end machine learning workflows using Vertex AI Pipelines.

Extensively worked on Spark using Python and Scala on the cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical applications by making use of Spark with Hive and

SQL/Oracle/Snowflake.

Good Understanding of Data mining, Data Preprocessing, Machine Learning, and Analyzing large data sets using Python, T-SQL, Power BI & Tableau.

In-depth understanding of Enterprise Data Warehouse system, Dimensional Modeling using Facts,

Dimensions, Star Schema, Snowflake Schema, and OLAP Cubes like MOLAP, ROLAP, and HOLAP

(hybrid).

Executed various OLAP operations of Slicing, Dicing, and Roll-Up, Drill-Down, and Pivot in multidimensional data and analyzed reports in Analysis Tool Pak in MS Excel.

Vast experience in using Excel, Minitab, Python, R, Tableau, and Qlik Sense to generate visualizations including density charts, tree maps, choropleth maps, page trails, and animations to better express data.

Ability to quickly learn new technology, business domains, and processes.

Known for being a sharp analytical thinker who can approach work with logic and enthusiasm.

Proficient in using ETL packages to apply control flow tasks (For Loop, For Each Loop, Execute SQL task) and data flow tasks using various transformations (Conditional Split, Data Conversion, Lookup, Derived

Column, Merge, Union all).

Expert in exploratory data analysis (EDA) with Python and visualization tools such as Tableau, Power BI,

Python Matplotlib, Seaborn, and R ggplot2 to identify patterns, correlation, and data quality issues.

Hands-on experience with SQL and NoSQL databases such as Snowflake, HBase, Cassandra, Teradata databases, and MongoDB.

Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.

Extensive knowledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift, Glue and Athena processing, High availability, fault tolerance, Scalability, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, Cloud Watch, Cloud Formation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.

Expertise in developing Spark applications for interactive analysis, batch processing and stream processing, using programming languages like Pyspark, Scala and Java.

Advanced knowledge in Hadoop based Data Warehouse (HIVE) and database connectivity (SQOOP).

Ample experience using Sqoop to ingest data from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.

Experience in working with various streaming ingest services with Batch and Real-time processing using Spark Streaming, Kafka.

Proficient in using Spark API for streaming real-time data, staging, cleaning, applying transformations and preparing data for machine learning needs.

Extensive knowledge in working with Amazon EC2 to provide a solution for computing, query processing, and storage across a wide range of applications.

Expertise in using AWS S3 to stage data and to support data transfer and data archival. Experience in using AWS Redshift for large scale data migrations using AWS

Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.

Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase and Cassandra.

Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services.

Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of

Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.

Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Data bricks, Azure Data Lake, Azure Blob Storage, and Azure Data Factory.

Worked on various programming languages using IDEs like Eclipse, Net Beans, and Intellij, Putty, GIT.

Working experience with Linux lineup like Red hat and CentOS.

Technical Skills:

Languages:

Python, Java, Shell Scripting

Cloud Platform:

AWS (Amazon Web Services), Microsoft Azure, GCP

Reporting Tools:

Business Objects, Crystal Reports.

Tools & Software:

TOAD, MS Office, BTEQ, Teradata SQL Assistant.

ETL Tools:

Pentaho, Informatica Power, SAP Business Objects XIR3.1/XIR2, Web Intelligence.

Big Data Technologies:

Yarn, Spark SQL, Kafka, Presto, Hadoop, HDFS, Hive, Pig, HBase, Sqoop, Flume.

BI Tools:

SSIS, SSRS, SSAS.

Modeling Tools:

IBM Info sphere, SQL Power Architect, Oracle Designer, Erwin, ER/Studio, Sybase Power Designer.

Database Tools:

Oracle 12c/11g/10g, MS Access, Microsoft SQL Server, Teradata, Poster SQL,DB2

Operating System:

Windows, Mac OS, Linux/Unix

Other tools:

TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio and MS Office, have worked on C++, UNIX, PL/SQL etc.

Machine Learning/AI Technologies:

Regression, Classification, Clustering, Dimensionality Reduction, Ensemble Methods (Random Forest), Neural Nets, Deep Learning (CNN, RNN, LSTM), Natural Language Processing (BERT), Decision Tree, Naïve Bayes, LLM, RAG, multi-agent frameworks

Work Experience:

Client: JB Hunt Transport Services, Lowell, AR

February 2022 to Present

Role: Sr. AI/ML Data Scientist

Responsibilities:

Designed and deployed machine learning models to improve customer segmentation, personalized targeting, and cross-channel engagement, resulting in a 20% increase in user retention and marketing ROI.

Led the design and execution of controlled experiments and A/B testing frameworks to evaluate the impact of feature rollouts and personalization strategies.

Applied causal inference techniques (including propensity score matching and uplift modeling) to distinguish correlation from causation in user behavior and marketing attribution.

Built end-to-end ML pipelines using Python, Spark, and AWS SageMaker, automating training, evaluation, and deployment of models for real-time recommendations.

Collaborated closely with product managers, marketing stakeholders, and engineering teams to define success metrics, model KPIs, and actionable insights, driving strategy for personalized ad delivery.

Contributed to auction and bidding model enhancements by simulating ad ranking mechanisms and optimizing conversion probabilities.

Facilitated knowledge transfer of delivered work to customer internal teams, ensuring ongoing support and usability of solutions.Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

Developed and deployed deep learning models for NLP tasks, optimizing performance through MLOps practices.

Provide technical guidance on LLM and RAG architecture, helping the offshore team define integration points and overall workflow for AI deployments.

Built RESTful APIs using FastAPI and Flask for seamless integration of ML models into applications.

Migrate data from traditional database systems to Azure databases.

Demonstrated expertise in AI-specific utilities, including proficiency in ChatGPT, Hugging Face Transformers, and associated data analysis methods, highlighting a comprehensive understanding of advanced artificial intelligence tools and techniques.

Developed NLP models for sentiment analysis using Vertex AI's AutoML capabilities.

Expertise knowledge on AI/ML application lifecycles and workflows from data ingestion to model deployment in cloud environment like Azure, AWS & GCP.

Deployed and fine-tuned LLM models including Azure Open AI and Llama 2/3 to create a chatbot to find relevant content from organization documents (process documents). This chatbot improved processes and reduced content search time with better summary.

Mentor Data Scientist/ ML Engineer to get up to speed to start contributing into the client project.

Participate in customer communication to discuss challenges and provide development status update.

Collaborate with cross functional team to gather requirements and explore AI solutions for their problems by developing proof of concepts (PoC) utilizing latest technologies like Azure open AI and Azure Search.

Build Complex distributed systems involving huge amount data handling, collecting metrics building data pipeline, and Analytics.

Oversee the deployment and fine-tuning of LLM models, ensuring compliance with customer-specific requirements and optimizing performance.

Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services like Synapse and Azure Data Factory.

Extracted indicators of service variability that caused bad customer experience by clustering similar tickets using NLP techniques such as document embeddings and topic modeling.

Implemented real-time inference solutions on Vertex AI for dynamic user recommendations.

Environment: Azure, SQL, ETL, ML, LLM, AWS, GCP, RAG, MLOps, Data Warehouse, Python, Vertex AI

Client: CBRE, Dallas, TX

August 2020 to January 2022

Role: Sr. Data Scientist/ML Engineer

Responsibilities:

Designed and optimized data processing workflows for ML applications, ensuring high availability and reliability.

Implemented a Python-based k-means clustering via Pyspark to analyze the spending habit of different customer groups.

Utilized Pandas and NumPy for data cleaning, feature engineering, and normalization to prepare datasets for modeling.

Designed and implemented predictive models using TensorFlow and PyTorch, experimenting with various deep learning architectures including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)to handle sequential data effectively.

Used Python to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random Forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.

Developed NLP models for sentiment analysis using Vertex AI's AutoML capabilities.

Led the design and implementation of a customer segmentation project using AWS S3 for data storage, Python, and Pandas for data manipulation, applying K-means clustering in Scikit-learn to segment customers, enhancing marketing strategies.

Developed a GAN-based model to generate high-quality synthetic images for training a computer vision system, significantly improving its accuracy and robustness.

Applied advanced natural language processing (NLP) methodologies to extract insights from unstructured data sources.

Integrate AI/ML Model and APIs into production AWS SageMaker.

Designed and developed natural language processing (NLP) pipelines to enhance search relevance and user experience by integrating semantic search capabilities.

Implemented real-time inference solutions on Vertex AI for dynamic user recommendations.

Worked with cross-functional teams (including data engineer team) to extract data and rapidly execute from MongoDB through MongD3 connector for Hadoop.

Conducted performance testing and benchmarking of cognitive search systems to identify bottlenecks and optimize system scalability and response times.

Dealt with large amount of cloud data storage to Identify faces of same person from Image data storage and faces with similar features using NumPy, Seaborn, PIL, matplotlib, Pandas, OpenCV and Sci-kit learn Libraries.

Create a Flask API to process input failure log files, generate summarized content, and integrate this with a Large Language Model (LLM) to produce concise text summaries.

Developed the different Python workflows triggered by events from other systems. Collected, analyzed, and interpreted the raw data from various clients’ REST APIs.

Created interactive dashboards in Tableau that provide a high-level overview of transaction activities and fraud detection metrics. Used Tableau’s built-in statistical tools to perform analyses like correlation studies, regression analysis, or time-series forecasting.

Developed Pyspark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.

Integrated Vertex AI with BigQuery for seamless data analysis and model training.

Applied Supervised Machine Learning Algorithms for the predictive modelling to tackle various types of problems

Deployed LLMs in customer interaction systems to enhance virtual assistants and chatbots,

Environment: Python, R, Tableau, Power BI, Machine Learning (Scikit-Learn, Keras, PyTorch), Generative AI. Deep Learning, Natural Language Processing, Cognitive Search, Data Analysis (Pandas, NumPy), Vertex AI, SQL, NoSQL (MySQL, PostgreSQL), Django Web Framework, HTML, XHTML, AJAX, CSS, JavaScript, XML, JSON, Flask, Agile Methodologies, SCRUM Process, LLM

Client: Qualcomm, San Diego, CA Sep 2019 to July 2020

Role: Data Scientist

Responsibilities:

Built real time pipeline for streaming data using Kafka and Spark Streaming.

Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.

Importing existing datasets from Oracle to Hadoop system using SQOOP.

Responsible for importing data from Postgres to HDFS, HIVE using SQOOP tool.

Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.

Sqoop jobs, Hive scripts were created for data ingestion from relational databases to compare with historical data.

Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.

Written a tool that scrubs numerous files in Amazon S3, getting rid of unwanted characters and other activities using Scala and Akka.

Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.

Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala

Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.

Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources

Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.

Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.

Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: Map Reduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, AWS, Kafka, JSON, XML PL/SQL, Sql, HDFS, Unix, Python, Pyspark.

Client: Deutsche Bank, Cary, NC

Jun 2016 to Aug 2019

Role: Data Scientist

Responsibilities:

Extensively worked with Spark-SQL context to create data frames and datasets to preprocess the model data.

Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.

Ran routine reports on a scheduled basis as well as ad-hocs based on key point indicators.

Develop Data Stage jobs to cleanse, transform and load data to Data Warehouse and sequencers to encapsulate the Data Stage job flow.

Responsible with ETL design (identifying the source systems, designing source to target relationships, data cleansing, data quality, creating source specifications, ETL design documents),

Installed and configured Airflow for workflow management and created workflows in python.

Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.

Wrote Junit tests and Integration test cases for those Microservices.

Worked in Azure environment for development and deployment of Custom Hadoop Applications.

Developed Nifi workflow to pick up the multiple files from ftp location and move those to HDFS on daily basis.

Environment: Linux, Erwin, SQL Server, Crystal Reports9.0, HTML, DTS, SSIS, Informatica, Data Stage Version, Azure, Oracle, Toad, MS Excel, Pow.

Client: Vedic Soft Solutions, PVT LTD, India Dec 2013 - Apr 2016

Role: Data Analyst

Responsibilities:

Gained Knowledge in creating Tableau dashboard for reporting analyzed data.

Extract Transform and Load data from Sources system to Data Warehouse using a combination of SSIS, T-SQL, Spark SQL.

Worked with NLTK library to NLP data processing and find the patterns.

Worked on migration and conversion of data using Pyspark and Spark SQL for data extraction, transformation and aggregation from multiple file formats for analyzing and transforming using Python.

Ability to apply Data Frame API to complete Data Manipulation within spark session.

Created Data Quality Scripts to compare data built from spark data frame API.

Design and develop ETL Integration patterns using python on spark.

Analyzed SQL scripts and design it by using Pyspark SQL for faster performance.

Performed ETL Transformation activities in SSIS and built several packages and loaded data to Data warehouse

Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Keras etc.

Involved in writing stored procedures in T-SQL do the transformations of the data

Involved in Data Modeling

Engage with business users to gather requirements, design visualizations and provide training to use self-service BI tools.

Environment: SSIS, SSRS, Report Builder, Office, Excel, Flat Files, NLTK, Mlib, Mflow, T-SQL, MS SQL Server, SQL Server Business Intelligence Development Studio.

Contact this candidate