Data Scientist Machine Learning

Location:

Leesburg, VA

Posted:

November 19, 2024

Contact this candidate

Resume:

AAMER MOHAMMED

Senior Data Scientist

Email: *****.*.***********@*****.*** Ph.no.: 469-***-****

PROFESSIONAL SUMMARY

●Experienced Data Scientist with over 10 years of extensive experience spanning the Software Development Life Cycle (SDLC) specializing in Data Analysis, Data Engineering, Statistical Modeling, Machine Learning, Deep Learning, and Large Language Models (LLMs).

●Experienced in Python, SQL, and AWS Cloud technologies, including Amazon SageMaker, Amazon Lambda for model development and deployment, AWS Glue, S3, and Amazon Redshift, adept at managing and executing entire data projects successfully.

●Proficient in leveraging state-of-the-art Large Language Models (LLMs) like the GPT (Generative Pre-trained Transformer) series by OpenAI and Hugging Face models such as BERT (Bidirectional Encoder Representations from Transformers).

●Data Security and Compliance Expert in Snowflake, with hands-on experience implementing robust data governance practices, including access control, data encryption, and compliance with industry standards.

●Experienced in developing and deploying enterprise-scale machine learning solutions, particularly in the healthcare sector

●Experienced in deploying ML models using MLflow, Azure, and Databricks, with a strong background in the healthcare domain and systems knowledge

●Experienced in deploying AI models on cloud platforms (GCP, Azure) and on-premises infrastructure

●Experience with various NLP methods for information extraction, topic modeling, parsing, and relationship extraction Strong command of SQL for querying, manipulating, and managing relational databases.

●Proficient in data mining with large datasets of structured and unstructured data, encompassing data acquisition, validation, predictive modeling, and data visualization.

●Possesses advanced proficiency in analytical techniques including Regression, Classification, Neural Networks, Natural Language Processing, and Computer Vision.

●Advanced SQL and Query Optimization Skills within Snowflake, optimizing complex queries for improved performance and reduced computational costs, particularly in high-volume data environments

●Well-versed in theoretical foundations and practical application of supervised learning techniques (e.g., linear, and logistic regression, decision trees, Random Forest, Support Vector Machines, neural networks, NLP) and unsupervised learning methods (clustering, dimensionality reduction, recommender systems).

●Proficient in MLOps practices, deploying machine learning models into production, and writing clean, production-level code.

● Experienced in building Python libraries and APIs to streamline data processes and enhance model deployment efficiency.

●Experienced in end-to-end LLM application development, including training, fine-tuning, and deploying models using GPT-3, GPT-4, and Hugging Face. Proficient in advanced generative AI techniques such as Retrieval-Augmented Generation (RAG).

●Possesses expertise in probability & statistics, including experiment analysis, confidence intervals, and A/B testing, alongside a strong grasp of algorithms and data structures.

●Proficiency in databases, including design, management, and visualization using technologies like Oracle, MySQL, SQL Server, and NoSQL databases like MongoDB.

●Skilled in leveraging various data visualization tools and techniques, including Power BI, and Matplotlib, to present insights effectively to stakeholders, enhancing decision-making processes.

●Exceptional communication skills enabling effective collaboration with technical and non-technical stakeholders, adept at translating complex concepts into actionable insights.

TECHNICAL EXPERTISE

Programming Languages: Python, R, NumPy, Pandas, SQL, CScala, Unix Scripting, Spark

Web Technologies: HTML, CSS, JavaScript, and Bootstrap

Data Visualization Tools: Tableau, Power BI, Advanced Excel, and Data Studio

Database Systems: MySQL, MongoDB SQL, SQL Server

API Development: FastAPI, Flask, Model Server

Libraries: NumPy, Scikit-learn, TensorFlow, and Pandas

Cloud Tools: Amazon Web Services (AWS), and Azure

Version control: Git

Big Data Technologies: Hadoop Ecosystems (HDFS, Hive, MapReduce, Pig), Kafka, Apache, Spark,

Snowflake

Operating Systems: Windows, Mac OS, Unix, and Linux tools and Technologies:

ML Algorithms: Informatica Power Center (ETL Tool), Google Cloud Platform (GCP), and GI

Regression, classification, NLP, LLM, Bert, GPT-3, GPT-4, Clustering, Large

Language Models (LLMs) with Hugging Face and Lang chain, Transformers,

Speech Recognition

PROFESSIONAL EXPERIENCE

Senior Data Scientist

HonorHealth, Scottsdale, AZ April 2022 - Present

●Assisted in extracting, cleaning, and preprocessing diverse healthcare datasets, including electronic health records (EHR), medical claims, and demographic information, ensuring data quality and consistency.

●Conducted rigorous exploratory data analysis (EDA) using statistical techniques and visualization tools such as Matplotlib and Seaborn in Python. Identified relevant patterns, trends, and outliers within the healthcare data.

●Applied advanced machine learning algorithms to build predictive models, including logistic regression, random forest, support vector machines (SVM), gradient boosting, and recurrent neural networks (RNNs).

●Experienced in building and managing ETL pipelines within Snowflake, utilizing Snowpipe for real-time data ingestion and integrating with external tools like Apache Kafka and Spark.

●Developed statistical algorithms involving multivariate regression, linear regression, logistic regression, principal component analysis (PCA), random forest models, decision trees, and support vector machines (SVM) for estimating welfare dependency risks using Python, R, and SQL.

●Adapted quickly to evolving business strategies, delivering actionable solutions in high-pressure environments within tight deadlines.

●Implemented NER and clinical text summarization to automate data extraction from physician notes and improve patient care recommendations

●Created comprehensive documentation and visually engaging presentations to communicate complex data findings to diverse audiences, supporting strategic decision-making

●Proficient in optimizing complex SQL queries in Snowflake to improve query performance and reduce computational costs, using techniques like result caching, clustering, and partitioning

●Integrated cloud-native solutions with existing data pipelines to enhance model performance and data accessibility

●Spearheaded the development and implementation of advanced analytics solutions to track, measure, and improve HEDIS and STARS performance metric

●Leveraged Snowflake’s secure data-sharing capabilities to enable seamless data collaboration across teams and third parties without duplicating data, enhancing efficiency.

●Read, debug, and optimize code for ETL pipelines, ensuring alignment between technical implementations and business objectives

●Employed NLP techniques to process and analyze unstructured text data from customer interactions, improving sentiment analysis and enhancing the customer support experience.

●Conducted exploratory data analysis (EDA) on security and networking data, uncovering key trends that informed security policies

●Automated deployments with CI/CD pipelines for continuous integration and delivery of machine learning model

●Developed NLP pipelines for processing electronic health records (EHR), using SpaCy and Hugging Face transformers for entity extraction, which automated data extraction by 30% and improved clinical decision-making

●Developed and fine-tuned predictive models based on historical data insights and emerging clinical trends, empowering proactive healthcare decision-making and optimizing patient care strategies.

●Created and fine-tuned models on Hugging Face and Bloom, optimizing NLP capabilities for specific applications like text summarization and sentiment analysis

●Performed periodic tuning and retraining to optimize model performance and ensure continuous relevance

●Designed and optimized scalable Python code and complex SQL queries for data pipelines, automation, and analytics

●Developed a Snowflake-based forecasting model for a retail company, using real-time sales and inventory data to adjust stock levels dynamically across locations

●Built RAG pipelines using vector embeddings to allow for high-speed querying across extensive document repositories, enabling quick retrieval of accurate, relevant information for customer service and research teams. This setup reduced document retrieval time by 50%

●Leveraged SQL Server and Cosmos DB for handling patient and medical data, creating optimized queries to retrieve insights quickly. Implemented NoSQL with Cosmos DB for unstructured EHR data, improving query response time by 40% for large datasets

●Managed end-to-end ML lifecycle using MLflow for model versioning, tracking, and deployment, ensuring efficient production pipelines in healthcare projects.

●implemented real-time analytics pipelines using GCP Dataflow, enabling real-time fraud detection and reducing event processing latency by 35%, critical for financial transactions.

●Managed and processed Protected Health Information (PHI) while ensuring strict compliance with HIPAA and other healthcare regulations

●Generated actionable reports and visualizations to monitor key healthcare metrics and track patient outcomes.

●Implemented data manipulation and preprocessing techniques in Python to optimize model performance and interpretability.

●Managed large-scale healthcare datasets using AWS Redshift and S3, ensuring seamless data storage and retrieval for AI/ML applications

●Developed a Snowflake-based forecasting model for a retail company, using real-time sales and inventory data to adjust stock levels dynamically across locations

●Stayed abreast of emerging technologies and methodologies in healthcare analytics, such as telemedicine platforms, wearable health devices, and federated learning, to incorporate innovative approaches into data science projects.

●Contributed to developing and implementing scalable, reproducible analytical workflows, ensuring robustness and reliability in healthcare analytics pipelines.

Collaborate with teams to integrate AI/ML models into production environments using Nvidia’s TensorRT and RAPIDS for enhanced speed and performance.

●Developed and deployed scalable machine learning models using NVIDIA GPUs to accelerate model training and inference times by up to 10x.

Environment: Python, MySQL, SQL, Kafka, Machine Learning (Regression, Classification, Clustering, Dimensionality Reduction, Ensemble Methods), Deep Learning (Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks), NLP, Hugging Face, LLM’s, Gen AI.

Data Scientist (Fraud Detection)

Fortitude Reinsurance Company, Ltd., Jersey City, NJ Sept. 2019 – March 2022

●Engaged with stakeholders to define precise overpaid claims detection metrics and performance benchmarks, aligning objectives with industry standards and regulatory compliance, utilizing Python and SQL Server.

●Conducted thorough exploratory data analysis (EDA) on heterogeneous financial datasets sourced from transaction logs, account histories, and external APIs, identifying potential patterns of overpaid claims and anomalies, leveraging Python, Pandas, NumPy, and SQL databases.

●Integrated GPT models with APIs to provide real-time language processing services, reducing customer query resolution time by 20%

●Created a dashboard solution using Snowflake’s data sharing and real-time data ingestion capabilities, enabling stakeholders to monitor key performance metrics with minimal delay

●Implemented automated data lineage and governance practices within Snowflake, ensuring real-time compliance checks and audits for sensitive healthcare data

●Ensured data integrity and compliance by setting up data governance policies within Snowflake, including monitoring data quality and managing version control for critical datasets.

●Integrated Retrieval-Augmented Generation (RAG) to improve the accuracy of responses in NLP applications, combining large language models with a vector database (like FAISS or Pinecone) to provide context-rich answers from company knowledge bases. This approach reduced irrelevant responses by 35%, improving user satisfaction

●Managed large-scale data storage using vector search tools like FAISS, enabling retrieval-augmented workflows for complex NLP tasks.

●Deployed fraud detection algorithms on Linux and Windows-based environments, ensuring compatibility and seamless operation across systems

●Leveraged Airflow to process large datasets through the ML pipeline, enhancing data availability for chatbots.

●Established data quality checks and validation processes to ensure high-quality input for model development

●Utilized BigQuery and Dataflow on GCP to streamline the fraud detection workflow, improving data processing speed by 35% and enabling real-time analytics.

●Developed automated data transformation workflows within Snowflake using SQL stored procedures and tasks to clean, aggregate, and transform incoming raw data into analysis-ready formats.

●Used Oracle and MySQL databases to store and retrieve claims data, developing complex SQL queries for pattern recognition that led to a 30% increase in fraud detection accuracy. Integrated Redis for caching recent transaction data, improving query performance for high-frequency fraud checks

●Optimized Python code for training NLP models, reducing processing time for large datasets by implementing efficient data structures and parallel processing

●Translated complex quantitative analyses into user-friendly visuals using Power BI and Tableau for non-technical stakeholders

●Managed scalable data warehousing for a retail client experiencing high seasonal data volumes by utilizing Snowflake’s elastic compute scaling.

●Managed individual components of complex projects, ensuring timely delivery and customization of analytic solutions to meet client-specific requirements

●Employed advanced data preprocessing techniques, including noise reduction, missing value imputation, and feature scaling, to ensure data quality and prepare it for modeling, utilizing Python and SQL Server.

●Designed interactive data visualizations using Matplotlib, Seaborn, Plotly, and Dash to visualize patterns and trends in overpaid claims for investigative purposes, enhancing detection strategies.

●Utilized Generative AI models to analyze large datasets and automatically generate business insights, resulting in a 20% improvement in decision-making speed for senior management

●Set up robust data governance controls within Snowflake, including role-based access, multi-factor authentication, and regular access audits to comply with regulatory standards.

●Developed Python APIs to streamline ML workflows and enable efficient data integration

●Implemented ensemble methods like Random Forests and Gradient Boosting Machines (GBM) to build robust models capable of handling complex, high-dimensional data for detecting overpaid claims, utilizing Python.

●Implemented supervised learning algorithms, including Support Vector Machines (SVM) and Neural Networks, to classify claims as overpaid or legitimate based on historically labeled data, leveraging Python.

●Developed a dashboard and data pipelines to monitor model performance metrics in production, ensuring real-time alerts for key metric drops, which reduced downtime by 20% and improved issue resolution speed.

●Leveraged unsupervised learning techniques, such as clustering algorithms (e.g., K-means, DBSCAN), to group similar claims or entities for anomaly detection and pattern discovery related to overpaid claims, utilizing Python.

●Customized dashboards with user-friendly interfaces and customizable parameters using tools like Tableau, Power BI, or custom-built dashboards integrated with Python-based web frameworks like Flask or Django, enabling real-time monitoring of overpaid claims metrics and trends.

●Deployed overpaid claims detection algorithms and models in real-time or batch processing environments to identify and flag potential cases, ensuring timely detection and mitigation of overpayments.

●Regularly updated detection algorithms and models with fresh data and refined techniques to enhance accuracy and effectiveness, leveraging Python and SQL Server.

Conduct in-depth technical workshops and training sessions for clients to enhance their teams’ proficiency with Nvidia GPU-powered frameworks and AI tools.

●Monitored the performance of detection systems and promptly addressed any technical issues or false positives to ensure reliable capabilities, ensuring continuous improvement and optimization of overpaid claims detection strategies.

Environment: Python, MySQL, SQL, Machine Learning, Random Forest, Deep Learning, Power BI, GIT.

Data Scientist (Predictive Analytics)

Clinical Ink, Rockville, MD Feb 2018 – Aug 2019

●Engaged with healthcare professionals to establish precise metrics and benchmarks for predicting patient survival, aligning with clinical standards and regulatory requirements.

●Optimized complex, multi-join queries on large datasets by implementing partitioning and clustering keys, along with Snowflake’s automatic query caching features

●Implemented a RAG model to dynamically fetch data from structured databases and unstructured knowledge sources, providing users with up-to-date insights. This process was especially beneficial for supporting real-time decision-making in customer support and compliance reviews

●Integrated GPT-based AI to automatically generate business reports, reducing reporting time by 40% and ensuring consistent quality across teams

●Structured patient and treatment data in SQL Server and used MongoDB for NoSQL storage of clinical notes, enabling both structured and unstructured data access in a single analytics platform. This setup facilitated faster model training on diverse data types

●Enabled auto-scaling and fault tolerance by deploying models on AWS Lambda, Vertex AI, and Kubernetes, maintaining high availability under variable workloads

●Built NLP models using Python libraries to analyze patient feedback and physician notes, improving classification accuracy by 30% and enabling faster decision-making in clinical settings

●Leveraged Snowflake's data warehouse to manage large datasets, ensuring seamless model integration and real-time analytics.

●Conducted root cause analysis on model performance issues and implemented corrective measures to maintain high reliability

●Built and deployed sentiment analysis systems using Python NLP libraries to classify and analyze customer feedback, improving feedback categorization accuracy by 30%

●Used Git for version control to manage code changes and collaborate on predictive analytics projects

●Employed NLP techniques to process and analyze unstructured text data from customer interactions, improving sentiment analysis and enhancing the customer support experience.

●Conducted comprehensive exploratory data analysis (EDA) on varied healthcare datasets, including patient records and treatment outcomes, identifying crucial patterns and anomalies relevant to survival prediction.

●Implemented advanced data preprocessing techniques, such as noise reduction, managing missing values, and scaling features, ensuring data readiness for predictive modeling.

●Developed an automated customer support system using GPT-based models, reducing ticket resolution time by 35% and increasing customer satisfaction

●Utilized Python libraries like Pandas and NumPy for efficient data manipulation, feature engineering, and statistical analysis, extracting pertinent information from healthcare databases.

●Employed ensemble methods such as Random Forests and Gradient Boosting Machines (GBM), as well as supervised learning algorithms like Support Vector Machines (SVM) and Neural Networks, to develop robust survival prediction models.

●Collaborated with business stakeholders to align models with strategic goals and technical requirements

●Implemented stringent data governance practices to ensure HIPAA compliance while managing Protected Health Information (PHI) across datasets, maintaining data integrity and privacy.

●Developed and fine-tuned predictive models based on historical data insights and emerging clinical trends, empowering initiative-taking healthcare decision-making and optimizing patient care strategies.

●Continuously updated and refined survival prediction models with the latest data and advanced methodologies to enhance predictive accuracy, responsiveness to evolving healthcare dynamics, and robustness in clinical applications.

●Led the optimization of existing model architectures, improving the efficiency of NLP models by implementing state-of-the-art techniques such as Transformer-based architectures (e.g., BERT, GPT-3) to handle complex financial text data

●Monitored and managed the performance of survival prediction systems rigorously, promptly addressing technical issues, refining models, and integrating feedback loops to ensure reliable predictions and effective clinical decision support.

●Collaborated closely with multidisciplinary teams to integrate predictive analytics seamlessly into healthcare workflows, fostering a culture of data-driven innovation and evidence-based practice.

●Employed best practices in data governance and ethical considerations to uphold data integrity, privacy, and compliance with healthcare regulations throughout the predictive modeling lifecycle.

Environment: Python (NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn), Tableau, Hadoop, Hive, SQL Server, MS Access, MS Excel, Tableau

Data Analyst

Wells Fargo, San Francisco, CA April 2015 – Jan 2018

●Analyzed large and complex datasets to extract actionable insights crucial for strategic business decisions, utilizing advanced analytical techniques and tools.

●Used Generative AI models for content generation in marketing and communication strategies, improving content creation efficiency by 60%

●Applied segmentation techniques to customer data for targeted marketing, and developed NLP models to analyze text data, automating sentiment analysis and document classification.

●Incorporated data cleaning and transformation techniques to preprocess datasets effectively for modeling, ensuring data integrity and enhancing analytical outcomes.

●Designed interactive dashboards using Power BI and Tableau, providing senior management with real-time insights into business performance, which improved decision-making speed by 20%.

●Built predictive models and analytical frameworks to derive deeper insights and forecast trends, leveraging statistical methods and machine learning algorithms.

●Worked with Azure Blob Storage to store and retrieve large volumes of customer and transaction data, improving data retrieval speeds by 20% in high-traffic scenarios. Configured Azure Data Lake for long-term data retention and analysis, ensuring compliance with data governance policies

●Utilized advanced SQL skills to perform in-depth analysis of databases, identifying performance bottlenecks and recommending enhancement strategies for optimized data retrieval and analysis.

●Developed and optimized complex SQL queries, stored procedures, and joins to facilitate seamless data manipulation, analysis, and export operations.

●Conducted thorough data analysis and reporting tailored to meet diverse business unit needs, addressing unique requirements, and optimizing data utilization.

●Developed interactive dashboards and comprehensive reports using Tableau, Power BI, MS Access, and Excel, providing stakeholders with clear visualizations and insights for informed decision-making.

●Collaborated closely with cross-functional teams to ensure data accuracy, completeness, and consistency, fostering a culture of data-driven decision-making.

●Ensured data quality and governance through rigorous processes, maintaining integrity, accuracy, and completeness of organizational data assets.

●Documented functional and supplementary requirements meticulously, guiding project execution and ensuring alignment with organizational objectives and stakeholder needs.

●Conducted comprehensive testing and tuning exercises to improve database performance and query execution times, enhancing overall system efficiency and responsiveness.

●Managed data import processes into Oracle databases, employing advanced SQL techniques for efficient data extraction, transformation, and loading (ETL) operations.

Environment: SQL, Tableau, MS Access, Excel, SQL, Oracle, Python

EDUCATION AND PROFESSIONAL DEVELOPMENT

Jawaharlal Nehru Technological University (2006)

●Bachelors in computer science and technology

Contact this candidate