MANOJ PALADI (He/ Him/ His)
+1-716-***-**** Jersey City, NJ *************@*****.*** LinkedIn GitHub Professional Summary
Data Science professional with over 3.5 years of expertise in developing end-to-end data-driven solutions, including data warehousing, building ETL pipelines, leveraging statistical and machine learning models, and handling large datasets using big data technologies. Work Experience
AXIS MY INDIA Mumbai
Data Engineer Sep 2021 – Jan 2023
• Migrated over 20 TB of data from legacy systems to Snowflake by designing robust ETL pipelines using Databricks and Azure Data Factory, employing batch processing techniques and schema optimization, reducing retrieval time from 2 days to 15 minutes.
• Engineered reliable data ingestion workflows by integrating REST APIs and processing flat files (JSON, CSV) for ingestion into the data warehouse, achieving 99% pipeline uptime.
• Implemented Spark-based distributed computing workflows, reducing ETL execution time by 60% and cutting resource costs by 35%.
• Monitored pipelines via Azure Monitor and Databricks Jobs, implementing automated alerts and error-handling scripts, reducing failure rates by 30% and ensuring real-time resolution.
• Optimized data extraction processes in ADF through parallelism techniques, cutting ETL execution time by 80% and lowering resource utilization by 50%, achieving significant cost savings.
• Implemented SQL procedures and triggers in Snowflake to automate data validation, deduplication, and transformation, improving data consistency and reducing manual errors by 40%.
• Leveraged Snowflake features like time travel, cloning, and data sharing to enhance data management, streamline collaboration across teams, and minimize data duplication.
• Worked closely with clients on 12+ market research projects to define objectives conducted data-driven modeling and advanced analytics using SQL, R, and Python, and delivered tailored solutions across industries such as consumer, retail, political, and finance.
• Delivered 20+ actionable dashboards in Power BI and Tableau, visualizing KPIs like market penetration and customer satisfaction, increasing operational efficiency by 30%.
National Institute of Technology Puducherry Puducherry Data Analyst/Engineer Dec 2020 – Aug 2021
• Streamlined data migration to PostgreSQL by developing ETL pipelines using Python and Apache Airflow, implementing schema validation, pre-deployment sandbox testing, and data auditing, reducing downtime to less than 1 hour and ensuring 99.5% accuracy.
• Built reporting workflows with Apache Airflow, SQL, and Python, automating reports for 10+ departments and saving 400+ hours annually.
• Programmed and refined SQL-based views for attendee registration, feedback analysis, and resource management, integrating data seamlessly for 90,000+ event attendees and ensuring 100% reliability through join optimizations and indexing.
• Implemented Linux-based automation scripts, improving ETL reliability by 30%. Projects
Vehicle Coupon Recommendation System: Pandas, Scikit-learn, Matplotlib, Streamlit, MLflow, DagsHub, Docker, FastAPI Feb 2024 – Apr 2024
• Utilized an 8-step data science pipeline to predict coupon acceptance, improving selection strategies and boosting acceptance.
• Designed and logged ML pipelines in MLflow on DagsHub, using preprocessing techniques, classification models, and hyperparameter tuning to achieve reproducible experimentation and identify key factors driving coupon acceptance, improving prediction accuracy by 20%.
• Deployed a FastAPI application containerized with Docker, hosted on a cloud platform, and integrated it with a Streamlit app for real-time interaction enabling efficient predictions and boosting marketing effectiveness by 10%. Real-Time Insights for E-commerce: Postgres, Kafka, Hive, MapReduce, Hadoop HDFS, PySpark, Airflow, Docker Oct 2023 – Dec 2023
• Implemented a Kafka-based real-time data ingestion pipeline, handling 50K+ user click events per second and enabling low-latency event streaming for analytics.
• Developed ETL workflows using PySpark and Hive, optimizing query execution time by 60% and reducing batch processing latency.
• Built MapReduce jobs to process 10+ GB of clickstream logs, enabling session-based analytics and reducing log processing time by 70%. Knowledge Enhancement & Career Advancement with MOOC: R Apr 2023 – May 2023
• Applied supervised learning techniques to model course completion rates and student engagement on a 400k+ MOOC dataset using R.
• Mitigated data imbalance by 25% using class weights; improved accuracy by 20% through precise processing.
• Implemented Decision Tree, Logistic Regression, and Bagging models, achieving 85% accuracy in predicting course completion rates. Education
State University of New York at Buffalo, NY Jan 2023 – May 2024 Master of Science in Data Science; GPA: 4.0/4.0
Courses: Data Intensive Computing, Deep Learning, Machine Learning, Analysis of Algorithms, Data Models and Query Language. Technical Skills
Programming Languages: Python, C, C++, SQL, R, Java, Latex, HTML, MATLAB. Libraries: Pandas, NumPy, Scikit-Learn, Matplotlib, Seaborn, PyTorch, TensorFlow, Requests, Flask. Technologies: Databricks, Hadoop, Spark, PySpark, Docker, Azure Data Factory, Apache Airflow, MLflow, CI/CD pipelines, Kafka, Hive. Databases & Cloud Services: PostgreSQL, MySQL, SQL Server, SQLite, Snowflake, Azure, AWS. IDE & Build Tools, Version Control: Jupyter Notebook, Jupyter Lab, Visual Studio Code, Google Colab, GIT, GitHub. Methodologies: Agile, Scrum.
Data Visualization Tools: Power BI, Tableau.