Data Engineer

Location:

Posted:

October 15, 2025

Resume:

Yash Pankhania

+1-857-***-**** ****.************@*****.*** linkedin.com/in/yash-pankhania github.com/Draconian10 Portfolio Website Education

Northeastern University Boston, MA

Master of Science in Information Systems; GPA: 3.88 Dec 2024

• Relevant Coursework: Data Engineering with LLMs, Prompt Engineering, Machine Learning in Fintech, Advanced Data Science, Designing Data Architectures for BI, Data Science, Database Design University of Mumbai Mumbai, India

Bachelor of Engineering in Computer Science May 2018

• Relevant Coursework: Artificial Intelligence, Data Warehouse and Mining, Database Management System Professional Experience

AI Engineer Feb 2025 – Present

Humanitarians AI Boston, MA

• Created an AI-based Case Study portal, Case Crackers, leveraging GPT-4o with fine-tuning capabilities and Pinecone to store vector embeddings, containerized with Docker and integrated with CI/CD pipelines to boost feedback precision by 25%.

• Orchestrated OpenAI API integrations with ELT pipelines on Databricks for feature engineering and real-time data processing, utilizing Supabase for data storage, and optimizing data pipelines to generate over 100,000+ contextual learning scenarios. AI Engineer Co-Op Jan 2024 – Aug 2024

Northeastern University, Office of the Provost Boston, MA

• Developed the Patient Persona Chat Bot using a fine-tuned LLaMA 3 model on a Nursing dataset with retrieval augmented generation (RAG) via LangChain, deployed on Azure AI Studio, improving outcomes for 200+ nursing students by 30%.

• Implemented a BERT-based skill extraction and similarity search model across 2,000+ course syllabi using AWS Sagemaker, Amazon S3, and Amazon RDS, enabling real-time, low-latency querying and seamless integration with analytics dashboards. Research Assistant – Data Specialist Sep 2023 – Jun 2024 Northeastern University, D’Amore-McKim School of Business Boston, MA

• Processed 17.8M MIMIC-IV records into Snowflake using Snowpark, dbt and an Agentic Workflow for data transformation, boosting pipeline efficiency by 20%, and ensuring collaboration with version control and automated deployments via Jenkins.

• Designed MediAssist by fine-tuning a LLaMA 3 model from Hugging Face to perform patient summarization from EHR notes, extract relevant ICD codes, and classify patient risk categories with an F1-score of 0.87, adhering to HIPAA privacy standards. Data Engineer Dec 2018 – Jul 2022

Tata Consultancy Services Mumbai, India

• Engineered PySpark jobs on Amazon EMR to cleanse high-volume raw stock market data (Bank of America), provisioning infrastructure with Terraform and orchestrating robust data pipelines using Airflow to get 95% improvement in data accuracy.

• Optimized HiveQL queries to process 350,000+ daily records, integrated with AWS Glue ETL workflows and Apache Flink for real-time stream transformations, and delivered monitoring dashboards using Amazon QuickSight in an Agile environment. Academic Projects

Yelp Data Analysis on Azure Databricks Big Data Technologies, Northeastern University Sep 2024 – Oct 2024

• Processed and transformed 6+ GB of Yelp dataset stored in Parquet format using PySpark on Azure Databricks, deploying REST APIs via FastAPI to enable real-time querying, and achieving 30% improvement in query performance.

• Designed and orchestrated an end-to-end data pipeline with Azure Data Factory, storing data in Azure Data Lake Storage and data visualizations with Databricks by ranking 20,000+ restaurants in Phoenix using customer feedback. New York City Motor Vehicle Collisions Data Ingestion and Visualization, Northeastern University May 2024 – Jun 2024

• Conducted data profiling and loaded datasets totaling around 11 million records into staging tables using Talend, migrated ingestion workflows to AWS Glue and stored intermediate datasets in Amazon S3.

• Modeled and transformed data into integration schemas using MySQL on Amazon RDS and Amazon Redshift and developed dynamic dashboards in Tableau and Amazon QuickSight to monitor KPIs and generate valuable traffic insights. Technical Skills

Programming Languages: Python (NumPy, Pandas, PyTorch, TensorFlow), JavaScript, TypeScript, Node.js, SQL, R Big Data Systems: Apache Airflow, Apache Spark, Apache Flink, Apache Hive, Hadoop Distributed File System (HDFS) Database Systems: Amazon RDS, Amazon Redshift, Snowflake, Databricks, MongoDB (NoSQL database), Google BigQuery, PL-SQL (Oracle 11g databases), MySQL, PostgreSQL, Microsoft SQL Server, Azure SQL BI Tools: Tableau, Microsoft Power BI, Looker Studio Data Integration: Data Build Tool (dbt), Azure Data Factory, Alteryx Designer, Talend Studio, Apache Kafka Software and Tools: Jenkins, Git, Docker, Kubernetes, JIRA, Salesforce, Terraform, ER Studio, Jupyter Notebook, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Contact this candidate