ALPESH VIRANI BIG DATA/ LEAD ML DATA ENGINEER/ML OPERATION
Email: *************@*****.*** CELL: 408-***-****
EXPERIENCE SUMMARY
● Experienced Data Engineer with 14 years of experience in Big Data, Machine Learning, Cloud Technologies, and Distributed Systems. Strong expertise in AWS, GCP, Lambda, Kinesis Streaming, Kafka, Redis, AWS Data lakehouse, API on Docker, NoSql, Teradata, DynamoDB, Athena, Redshift, Python, Jupyter, Snowflake, Databrick, ML algorithm, MlFlow, ML devops, Scikit Learn, NLP, spacy, transformer, Java, J2ee, Java based Framework, Spark on Kubernete, Hadoop, spacy, transformer, vector db, Big Data, MPI and Hadoop Cluster, Map Reduce, Docker, Distributed and Parallel Programming. Have worked on Compute Engine, App Engine, Cloud Functions, BigQuery, Cloud Storage, Kubernetes Engine
(GKE), Cloud SQL, Pub/Sub.
● Have ML engineering experience and working with health data and exciting ML models. using technology like Spark on Kubernetes, PySpark, Python, go, AWS, S3, AWS batch/lambda, Glue, EKS, ML Library, Sagemaker, twilio, Mixpanel, Athena, Mysql, DynamoDB, Dynamo Stream, SQS, Gunicorn, Flask, Redis Cache, Airflow, sumologic, Scikit Learn
● Functional knowledge and worked on AI Technologies like OpenAI GPT-3.5/4, GPT-4 Turbo, PyTorch, TensorFlow, Langchain, Hugging Face, LLaMA Index, LLaMA 2/3, Falcon, Mistral, SentenceTransformers, Gemini Pro Vision
● Have worked and developed diabetes chatbot support system using NLP, transformer, NLTK, BERT Text Generation, Generative AI, Sentiment Analysis, Summarization, Entity Recognition
● Working on Big Data and AWS, MPI Cluster, Distributed and Parallel Architecture and development, Hadoop ecoSystem like Map Reduce, HDFS, Hive, HBase, Pig, Impala. Avro etc. since 2013. Have Experience in web based applications using Java, Multithreading, spring, MVC, J2ee, JavaScript, AngularJs, Shell, jacl, AJAX, JSON and Thread pools. SKILLS
● Big Data Technologies: Hadoop(CDH or Hortonwork), Spark, Hive, HBase, Kafka, AWS Lambda, DynamoDB, S3, GCP, Big Query, Cloud function
● Cloud Services: AWS, Databricks, Snowflake, Terraform, Docker
● Programming Languages: Python, Java, Scala, Go
● Machine Learning: Scikit-learn, NLP (spaCy, HuggingFace, transformers), TensorFlow, PyTorch, MLFlow
● DevOps: Airflow, Jenkins, Bitbucket Pipelines, Docker, Kubernetes
● Tools: Git, Jupyter, SQS, Redis, Twilio, Mixpanel
● Certifications: Cloudera Certified Hadoop Developer, Big Data Fundamentals EDUCATION
● Master of Philosophy in computer Science & Engineering, 2010, TGOU
● Master of Science in Advanced Software Technology, 2008, IIIT, Pune
● Bachelor of Engineering in Instrumentation & Control, 2004, DDIT, Nadiad WORK DETAILS
Santa Clara CA – Sr. ML Lead Data Engineer
Duration: Nov 2024 to Till date
Works and Responsibilities:
● Working on a federal project called GSA to build an ML data pipeline for confidential data. Can explain the project in an interview if required.
ENVIRONMENT/SOFTWARE: EKS, pyspark, python, java, go, Azure, AWS, S3, data Lakehouse, AWS batch/lambda, mlflow, Redshift, Athena, Mysql, DynamoDB, NLP, spacy, scikit, snowflake, pygwalker, sweetviz, petl,, SQS, fast API, Redis Cache, Airflow, AI chatbot, Scikit learn, spacy, transformer, Jupyter, docker, vector db etc..
Twin Health Mountain View CA – Sr. ML Lead Data Engineer Duration: Jan 2021 to Aug 2024
Works and Responsibilities:
● Twin has a successful diabetic reversal product powered by AI & IoT. Data is very complex. Objective of this project is to write the data cleaning jobs for Machine learning model in scalable form to handle more number of rows and model accuracy at the same time
● Design and architect S3 data lakehouse, end to end data pipeline for ML. developed ML data feature engineering, Data analysis, model selection and developed ML model for zoom meeting conclusion and twilio chat data
● Lead a team of 3 data engineer and reduce the aws cost by performance and continuous monitoring
● Worked on spacy model for entity recognition and PHI masking for meeting and chat data
● Working on vector db to store green and red food prediction model as well as end client API
● transformed machine learning jobs to scale vertically. Design and Architect the new data flow, data processing and data pipeline which scale horizontally
● worked on transforming all ML model job read write from mysql to s3 to improve performance
● Developing the tools like daily import from mysql to s3, s3 to mysql, data accuracy comparison and s3 to Dynamodb and Write using Pyspark. reduce data latency by 50%
● Working on moving API from mysql to Dynamodb, realtime dynamo stream on s3 and sqs
● Train, Build and deployment of ML model in data pipeline and feature data monitoring
● develop Data quality tool for ML feature engineering and feature selection
● develop hyperparameter tuning tool for improving accuracy for ML training and Learning
● Have worked, maintained and deployed regressive, classification, NLP based model in data pipeline
● Worked on an ML model like AI progress note using chatgpt 3.5 LLM, food and hunger prediction etc..
● Worked on twilio data pipeline for HIPAA based masking, A/B testing and NLP word embedded
● Have worked on RAG and prompt engineering for twilio data pipeline ENVIRONMENT/SOFTWARE: Spark on Kubernetes, pyspark, python, java, go, Azure, AWS, S3, data Lakehouse, AWS batch/lambda, EKS, databrick, ML library, sagemaker, twillio, mixpanel, ML regression & classification algorithm, deep learning, mlflow, Redshift, Athena, Mysql, DynamoDB, snowflake, Dynamo Stream, SQS, fast API, Redis Cache, Airflow, AI chatbot, Azure OpenAI service for GPT-3.5/4, Langchain, Hugging Face, LLaMA Index, Scikit learn, spacy, transformer, NLP, nltk, PyTorch, TensorFlow, Keras, Langgraph, CUDA, gpu, Jupyter, Bitbucket pipeline deployment, Redash, docker, vector db etc.. Gracenote A Nielsen Company Sunnyvale CA – Lead Big Data Principal Software Engineer Duration: Relycom Inc : November 2016 to April 2018, Gracenote: May 2018 to Dec 2020 Works and Responsibilities:
● Objective of this project is to remove oracle dependency and migrate all oracle databases on hadoop/AWS to make a scalable architecture which supports millions of avro files from mobile devices.
● Make design document and Architect the data lakehouse, data flow, data processing and data pipeline
● AWS lambda and kinesis streaming for server less architecture for data pipeline to handle daily TB data
● Handle the Deployment, performance and devops for AWS/Hadoop/Lambda/S3/Ansible and Terraform
● Created Tools for Avro extract, data extract, table comparison tools for S3 file system
● Write the java code and hive query for the read, write and Enrich avro file.
● Write the java/scala code for spark to do transformation and process of AVRO data ENVIRONMENT/SOFTWARE: Hadoop HWT 2.6, CDH, HWT ansible, Teradata, Terraform, GCP, Big Query, Cloud Function, AWS, AWS Lambda, Splunk, Kinesis Firehose/Data Streaming, Scala, Map Reduce, Spark, kafka, Gfana, Data Frame, Redshift, Snowflake, Spark SQL and Hive Context, Hive, impala, Kerberos, Docker, Tableau, Grafana, Yarn, AWS cluster, Data bricks, Airflow, Oozie, Amazon S3, Java, Python, Junit, Maven, Jira, P4, json, avro, orc and Parquet File etc.
Relycom Inc. – Sr. Implementation Analyst/ Big Data Engineer (Client Tivo Inc. San Jose CA) Duration: August 2016 to October 2016
Works and Responsibilities:
● Objective of this project is to prepare a data pipeline for the tivo box viewer data and do the analysis for viewer count, program rating and active set box. E.g. one was Presidential Debate 2016
● Write the java/scala code for spark and hive query to submit second by second ratings for the network
● Prepare final output in terms of excel and line graph for the news and program ratings.
● Worked on terabytes of data and actively involved in enhancing the data pipeline ENVIRONMENT/SOFTWARE: Devops, Gafana, Tableau, splunk, Teradata, Terraform, Hadoop, data bricks, Spark, Data Frame, Spark SQL and Hive Context, Hive, Yarn, AWS cluster, Oozie, Amazon S3, Java, Junit, Maven, Jira, Git, Parquet File etc.
Persistent Systems Limited – Sr. Implementation Engineer Big Data (Client Wells Fargo Bank Fremont CA) Duration: May 2015 to July 2016
Persistent Systems Limited – Sr. Implementation Engineer Big Data (Client Hitachi Santa Clara USA) Duration: Feb 2015 to April 2015
Persistent Systems Limited – SCM & SAP Global System
(Client Apple Inc. Sunnyvale USA – Sr. Implementation Engineer Big Data) Duration: March 2013 to Jan 2015
Persistent Systems Limited (Module Lead) – OIOP PS Client Eaton Duration: August 2010 to Feb 2013
IBM (Software engineer) - UPS (United Parcel Services) Website, Pune, India Duration: April 2010 to June 2010
I2IT (Research Assistant) - Analyzing Power generation process data using Distributed environment and MPI cluster, Pune India
Duration: August 2008 to March 2010
Spark InfoTech (Project Engineer) - Implementation of measurement system for diamond industry, Global Immigration, Consultancy and Financial Accounting, Surat India Duration: March 2004 to Jan 2007
ACHIEVEMENTS
Certificate
● French language certificate for first level from MS University, Vadodara.
● Finolex hope foundation scholarship, 2008
● Invited as a speaker in VIIT, Pune for parallel and distributed computing, 2009.
● Two ‘You Made a Difference Award’ from Persistent.
● Faculty colloquium for Parallel and Multi core programming in I2IT, Pune (sponsor by Intel) Publication
● Graded and Hessenberg Form together for Symmetric Matrices in QR Algorithm, Selected at International Journal of Scientific and Research Publication (IJSRP), Volume 3 February 2013
● Different parallel approach for matrix multiplication (MPI and MPI + Openmp), International conference on Information technology and business Intelligence, Nagpur, India, November 2009
● Blocked matrix multiplication algorithms, presented at International conference on mathematics and computer science, Chennai, India, January 2009
Workshop and Interest
● Actively participating in ML/AI, Hadoop and Big Data meetup
● Big Data fundamental certificate from Big Data University and Cloudera Hadoop Developer certificate
● Following many Spark, Hadoop, hive, big data communities and group, part of the discussion thread