Data Engineer Machine Learning

Location:

Richardson, TX

Salary:

80000

Posted:

October 15, 2025

Contact this candidate

Resume:

PRASHANTH THAMSHETTI

DATA ENGINEER

TX, USA 201-***-**** *********.*****@*****.*** LinkedIn

SUMMARY

Data Engineer with 5 years’ experience delivering scalable data solutions across healthcare, finance, and telecom, leveraging Azure Databricks, Spark SQL, Snowflake, Apache Hadoop, and Hive to modernize pipelines and regulatory-compliant data ecosystems.

Specialized in orchestrating end-to-end ingestion and transformation workflows using AWS Glue, S3, Redshift, Apache Airflow, PostgreSQL, MongoDB, and Oracle, enabling machine learning readiness and reducing manual processing workloads across global analytics platforms.

Proven expertise in data visualization and reporting through Power BI, Tableau, Plotly, and SSRS, creating 250+ interactive dashboards that improved decision-making for clinical operations and financial risk monitoring.

Strong proficiency in predictive modeling and quality frameworks with Scikit-learn, Talend Data Quality, and Informatica, achieving measurable improvements in default detection accuracy and anomaly identification across multi-million record datasets.

Hands-on experience in deploying and optimizing production systems using Docker, Jenkins, Kubernetes, Google BigQuery, and SSIS, enhancing cluster performance, reducing downtime by 18%, and ensuring high-frequency release reliability.

SKILLS

Languages: Python, SQL, R

Packages & Libraries: NumPy, Pandas, SciPy, Scikit-learn, TensorFlow, PySpark Visualization Tools: Tableau, Power BI, Advanced Excel (Pivot Tables, VLOOKUP), Matplotlib, Plotly, Seaborn IDEs & Notebooks: Visual Studio Code, PyCharm, Jupyter Notebook Cloud Platforms: Amazon Web Services (AWS), Azure Databricks, EC2, S3, Glue, Redshift, Athena, AWS Lambda, Snowflake, DynamoDB, Kinesis, SQS, SNS

Databases: MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, DynamoDB Big Data Technologies: Apache Airflow, Apache Spark, Apache Hadoop, Apache Kafka, ETL/ELT, HDFS, Hive, NIFI Other Technical Skills: SSIS, SSRS, SSAS, Docker, Kubernetes, Jenkins, Terraform, Informatica, Talend, Dataiku, Google BigQuery, Data Quality and Governance, Machine Learning Algorithms, Big Data, Advanced Analytics, Statistical Methods, Data Mining, Data Warehousing, Git, GitHub Operating Systems: Windows, Linux, Mac

Methodologies: SDLC, Agile, Waterfall

WORK EXPERIENCE

HCA Healthcare, TX, USA Data Engineer May 2024 – Present

Automated the ingestion of clinical trial datasets from S3 to Redshift using AWS Glue, enabling seamless integration of over 12 data sources, and reducing manual processing time by 45% across analytics teams.

Engineered distributed data transformation pipelines using Apache Hadoop and Hive, processing over 2 TB of clinical trial datasets to standardize reported & lab data, ensuring alignment with CDISC & regulatory compliance frameworks.

Created 16 subject-level analytics datasets using Snowflake, leveraging dynamic partitioning and metadata tagging to facilitate real-time clinical monitoring, leading to improved protocol deviation detection across oncology trials.

Built and maintained 20+ production-grade Apache Airflow DAGs for scheduling extraction from PostgreSQL, MongoDB, and Oracle systems, ensuring timely data availability for machine learning model training and validation.

Developed interactive monitoring dashboards in Power BI and Plotly to visualize patient enrollment, visit trends, and dropout rates, supporting clinical operations teams across 5 global study locations.

Containerized micro-batch ETL services using Docker and orchestrated deployments with Jenkins, reducing release failure rates by 28% and improving deployment frequency across Pfizer’s cloud-native data platform.

Collaborated with cross-functional teams including clinicians, data scientists, and compliance officers, fostering strong communication and leadership to ensure alignment of technical deliverables with clinical research objectives. Deloitte, India Data Engineer Sept 2020 – Nov 2022

Designed and automated complex ETL workflows using Informatica to integrate 14 disparate financial data sources, ensuring schema validation and producing harmonized datasets for reporting across treasury and investment teams.

Built advanced visualization dashboards in Tableau to track liquidity, derivative exposure, and credit line utilization, enabling business users to monitor 42+ KPIs in real time with drill-down capability.

Orchestrated large-scale event-driven data pipelines with Apache Kafka, handling 1.2 million daily trade settlement messages and ensuring low-latency processing for monitoring of settlement failures across global trading desks.

Conducted predictive modeling workflows using Scikit-learn, developing classification models for early default detection and achieving a 17% improvement in accuracy compared to previous legacy rules-based identification.

Deployed interactive workflow monitoring solutions via SSIS & SSRS, generating 240+ scheduled reports monthly and enabling senior managers to review compliance and operational in centralized portals without manual intervention.

Implemented data quality frameworks with Talend Data Quality, identifying anomalies in 3.8 million daily records, enforcing cleansing routines, and ensuring adherence to Basel regulatory standards across loan and credit portfolios.

Built scalable analytical queries on Google BigQuery, enabling portfolio segmentation for 25 million accounts and supporting risk modeling teams with faster simulation turnaround by reducing runtime from 46 minutes to 21 minutes.

Containerized batch-processing jobs with Kubernetes, automating resource allocation across compute nodes, reducing downtime incidents by 18% & ensuring consistent scaling performance during quarterly reporting workloads.

Optimized storage and retrieval layers with Hive on Hadoop, reducing report generation timelines by 32% while cutting infrastructure overhead costs by 11%, directly improving profitability of enterprise reporting infrastructure. Tata Consultancy Services, India Data Analyst Mar 2019 - Aug 2020

Led the design and implementation of scalable data pipelines using Azure Databricks, reducing data processing time by 25% and enabling faster insights for stakeholders.

Spearheaded the optimization of Spark SQL queries for data transformation and aggregation, improving query performance by 30% and enhancing overall system efficiency.

Led cross-functional teams to integrate Azure Databricks with Azure Data Lake Storage and Azure Synapse Analytics, enabling seamless data flow and real-time analytics capabilities.

Directed the automation of data ingestion and ETL processes using Azure Data Factory and Databricks Notebooks, reducing manual intervention by 40% and improving operational efficiency.

Developed and led the implementation of advanced data visualization solutions using Power BI connected to Azure Databricks, delivering actionable insights to stakeholders and improving decision-making efficiency.

Led the monitoring and optimization of Databricks clusters for cost-efficiency and performance, achieving a 20% reduction in cloud infrastructure costs.

Initiated, conducted data quality assessments, implementing data cleansing routines improving data accuracy by 15%. EDUCATION

MS in Business Analytics and Artificial Intelligence Jan 2023 – Dec 2024 University of Texas at Dallas, Texas, USA

Integrated Dual Degree in Civil Engineering (B.Tech+M.Tech) Jul 2013 – Jan 2019 JNTUH-college of engineering, Hyderabad, India

CERTIFICATIONS

AWS Certified Cloud Practitioner - LINK

Microsoft Certified: Power BI Data Analyst Associate - LINK

Microsoft Certified: Azure Data Engineer Associate - LINK

Contact this candidate