Rajakoti Dasari
Data Engineer (AI & ML)
Austin, TX 540-***-**** ***************@*****.*** LinkedIn US Citizen PROFESSIONAL SUMMARY
Data Engineer (AI/ML) with over 6 years of experience designing, building, and optimizing scalable data architectures across cloud platforms including Azure, AWS, and GCP. Proven success in developing end-to-end batch and streaming pipelines, modernizing analytics infrastructure, and enabling machine learning workflows across industries such as finance, healthcare, and e-commerce. Expertise in big data processing (Apache Spark, Hadoop, Hive), real-time data streaming (Kafka, Pub/Sub, Dataflow), and data warehousing (Snowflake, Redshift, BigQuery). Strong background in Python (PySpark, Pandas, Polars), SQL, and MLOps tools including SageMaker, Azure ML, Kubernetes, and TensorFlow/PyTorch. Adept at partnering with cross-functional teams to deploy data solutions that drive measurable business outcomes, such as reducing data latency, accelerating ML model deployment, and improving analytics performance. Known for a pragmatic approach to data quality, pipeline reliability, and production scalability in Agile/Scrum environments.
TECHNICAL SKILLS
Programming Languages: Python, SQL, Java, Scala, Bash Big Data & Distributed Computing: Apache Spark (PySpark, Spark SQL), Hadoop (HDFS, YARN, MapReduce, Hive, Pig) Streaming & Messaging Systems: Apache Kafka, Spark Structured Streaming, AWS Kinesis, GCP Pub/Sub ETL / ELT & Orchestration Tools: Apache Airflow, dbt (Data Build Tool), Talend, AWS Glue, Azure Data Factory, SSIS Machine Learning & Feature Engineering: TensorFlow, PyTorch, scikit-learn, XGBoost, pandas, NumPy, Polars, PySpark MLlib, feature engineering, data preprocessing
Model Deployment & MLOps: AWS SageMaker, Azure ML, TensorFlow Serving, TorchServe, MLflow, Kubeflow DevOps & Infrastructure Automation: Docker, Kubernetes, Jenkins, Terraform, GitHub Actions, Git, CI/CD pipelines, IaC Cloud Platforms: AWS (S3, EMR, Redshift, Glue, Lambda), Azure (Data Lake Storage, Synapse, Databricks, Azure ML), Google Cloud Platform (BigQuery, Cloud Storage, Dataflow, Pub/Sub, Vertex AI) Databases & Data Warehousing: PostgreSQL, MySQL, SQL Server, Oracle, MongoDB, Cassandra, DynamoDB, Snowflake Data Visualization & Business Intelligence: Power BI, Tableau, Looker, Plotly, Seaborn, Matplotlib Methodologies & Practices: Agile (Scrum, SAFe), DataOps, data lake architecture, data modeling, system design EDUCATION
Post-Graduation in Machine Learning & Artificial Intelligence University of Texas at Austin, USA Master’s in Engineering Indian Institute of Science (IISc), India Bachelor’s in Electrical & Electronics Engineering GMRIT, RAJAM, India PROFESSIONAL EXPERIENCE
Join Parachute LLC TX Data Engineer (AI & ML) July 2022 – Present
• Architected a scalable AWS data lake using S3 (Parquet/Delta Lake), AWS Glue, and EMR, centralizing 8TB+ of structured/unstructured data daily, enabling 50% faster pipeline development via Athena-based federated queries for AI/ML feature stores.
• Optimized ETL/ELT workflows by implementing incremental loading and parallel processing in PySpark, reducing data pipeline execution time by 40% and cutting cloud compute costs by $15K/month.
• Developed hybrid storage solutions by integrating PostgreSQL (structured data) with MongoDB (unstructured trade logs), improving data access latency by 25% for AI-powered customer applications.
• Built real-time ML monitoring dashboards using Spark Streaming, Tableau, and Python (Plotly, Dash), reducing decision latency by 20% through dynamic visualization of model drift and prediction accuracy.
• Automated MLOps pipelines using Jenkins, Terraform, and CI/CD, streamlining Spark/Airflow deployments by 30% and ensuring reliability for batch/streaming ML workflows.
• Implemented SQL-driven hyperparameter tuning for production ML models by leveraging Snowflake UDFs and dynamic SQL queries, optimizing model accuracy by 15% while reducing training costs by 20% through efficient parameter search.
• Led Agile initiatives as a Scrum lead, mentoring 4 junior engineers in sprint planning, backlog refinement, and iterative delivery of data/ML pipelines, improving team velocity by 35%. BlackRock NY Data Engineer April 2021 – July 2022
• Designed and deployed 20+ scalable data pipelines on Azure using Azure Data Factory, Databricks, and Azure Synapse Analytics, accelerated ETL performance by 40% and reducing cloud costs.
• Implemented distributed processing of large-scale financial data using Apache Spark (batch and streaming), orchestrated with Apache Airflow, reducing analytics latency by 35%.
• Rebuilt analytics infrastructure with dbt, Snowflake, and Power BI, delivering modular data transformations, row-level security, and automated data lineage, accelerating reporting speed by 50%.
• Built automated data engineering frameworks in Python using Pandas, Polars, and PySpark, improving dataset reliability and cutting manual validation time by 60%.
• Developed a high-throughput real-time streaming platform with Apache Kafka, processing over 300K market events/second with sub-100ms latency for risk management and trading applications.
• Containerized PyTorch and TensorFlow models using Azure ML and deployed them via Kubernetes for real-time ML inference, reducing rebalancing costs by 18%.
• Engineered SQL-driven hyperparameter tuning workflows using dynamic SQL to automate grid searches across 30+ ML model configurations, improving prediction accuracy by 12%. UnitedHealth Group MA Data Engineer May 2020 – March 2021
• Engineered scalable ETL pipelines with AWS Glue and Amazon Redshift to automate ingestion from clinical and claims systems, improving data availability for analytics by 35%.
• Optimized large-scale batch data workflows on Hadoop using YARN and MapReduce, reducing execution time by 30% and ensuring SLA compliance for nightly jobs.
• Developed DynamoDB-backed microservices using event-driven architecture, reducing data retrieval latency by 25% and enabling dynamic autoscaling for patient-facing apps.
• Created reusable PySpark pipelines integrated with AWS SageMaker for ML feature engineering, boosting predictive performance in medication adherence models by 18%.
• Automated legacy data integration via SQL Server Integration Services (SSIS), streamlining workflows across Medicare, pharmacy, and EHR systems to ensure reliable and consistent data flow.
• Enhanced distributed data processing with Hive, HiveQL, Apache Pig, and Pig Latin, resulting in a 30% increase in distributed computing efficiency.
• Collaborated with data scientists to deploy and monitor ML models for fraud detection, integrating retraining workflows and reducing claim review turnaround time by 22%.
Wayfair MA Data Engineer January 2019 – May 2020
• Built and maintained batch and streaming pipelines using Google Cloud Platform tools (Pub/Sub, Dataflow, BigQuery) to support search, personalization, and ad attribution, increasing personalization throughput by 25%.
• Developed scalable big data workflows using Hadoop, Apache Spark, and Hive, improving data integration efficiency across structured and unstructured sources by 20%.
• Optimized ETL workflows for product and logistics data using Apache Spark (EMR) and PySpark, reducing job execution time by 30%.
• Created dynamic dashboards and visualizations using Power BI and Python libraries (Plotly, Seaborn), improving executive reporting and strategic decision-making.
• Engineered and optimized SQL data models in PostgreSQL and Snowflake, enhancing reporting performance and data accessibility.
• Acted as Scrum Lead in a SAFe Agile environment, coordinating sprint planning and stand-ups to ensure timely delivery of key data products, including a real-time recommendation API completed in under 3 weeks.
• Partnered with cross-functional teams (Data Science, Product, and Analytics) to define SLAs and metric standards, enabling consistent reporting such as daily SKU-level demand metrics.