Machine Learning Ml Engineer

Location:

Houston, TX

Posted:

May 02, 2024

Contact this candidate

Resume:

SATISH CHITIMOJU

AI ML Engineer

***************.*****@*****.***

+1-443-***-****

https://www.linkedin.com/in/satish-chitimoju-65200189

Houston Texas USA-77063

Objective

12+ years of IT industry experience which includes 8+ years of experience as a AI / ML Engineer, I specialize in orchestrating AWS SageMaker pipelines, crafting precise Terraform scripts, overseeing model deployments and experiments, lineage tracking and establishing robust CI/CD pipelines, and configuring CloudWatch events for vigilant monitoring. And Gen AI text generation application development and implementation using Lang Chain, Lamma2, bard and Gemini. My career has showcased exceptional proficiency in harnessing advanced ML techniques and strategically applying algorithms to successfully address intricate real-world business challenges. Exceptionally motivated individual with a profound grasp of technology, adept at swiftly acquiring and comprehending new technologies.

Professional Experience

Implementing and Integrated the Facebook Text Generation Llama-2-7b-chat model into the AWS Cloud environment, deploying the same model through an AWS SageMaker hosting instance with the configuration of ml.g5.2x.large. Additionally, I meticulously mapped the essential IAM security policies. Rigorous testing ensued, encompassing both SageMaker console validation and successful execution through an HTTP AWS Gateway facilitated by Lambda functions.

Implemented MLflow for end-to-end batch processing testing, covering model training, packaging, validation, deployment, and monitoring, ensuring robustness and efficiency in the machine learning lifecycle.

Over three years of hands-on experience in developing and operationalizing Machine Learning (ML) and Artificial Intelligence (AI) models.

Proven track record of utilizing Google ML/AI pipelines for seamless development and deployment of machine learning models.

Extensive experience in working with Vertex AI Search, formerly known as "Enterprise Search," demonstrating proficiency in leveraging advanced search capabilities.

Skilled in PaLM 2, particularly in developing text-bison, chat-bison, and other Large Language Models (LLMs) to enhance search functionalities.

Successfully operationalized ML/AI models, ensuring their integration into real-world applications and systems.

Hands-on experience in implementing models and solutions using Google's Vertex AI, showcasing a deep understanding of Google Cloud's AI capabilities.

Proficient in optimizing ML/AI pipelines for efficiency, scalability, and improved model performance.

Demonstrated expertise in managing the entire lifecycle of machine learning models, from development and training to deployment and monitoring.

Adept at collaborating with cross-functional teams, including data scientists, engineers, and stakeholders, to achieve project goals and objectives.

Committed to staying abreast of the latest advancements in ML/AI technologies and consistently integrating new methodologies into projects for continuous improvement.

Demonstrated proficiency in leveraging ML frameworks such as TensorFlow and PyTorch to design and implement robust machine learning models, contributing to the development of advanced AI solutions.

Utilized critical ML frameworks like TensorFlow and PyTorch to enhance model training, optimize performance, and deploy effective machine learning solutions, showcasing a deep understanding of industry-standard tools in the field of artificial intelligence.

Implemented and managed data versioning using Data Version Control (DVC) within the MLOps pipeline, enhancing reproducibility and collaboration.

Implemented LangChain and LangIndex, demonstrating proficiency in language modeling technologies.

Strong understanding of AI principles and techniques, particularly in natural language processing.

Experience in developing and optimizing Gen AI systems for real-world applications.

Successfully set up and validated prominent AI tools, including Llmma 2, Lamda, Palm, BERT, Gemini, Falcon, Mistral, Vicuna-33B, Langchain, and Chat Gpt 4, in local and inference environments.

Demonstrated proficiency in configuring and optimizing AI tools across various environments, showcasing adaptability and hands-on experience with cutting-edge technologies.

Played a pivotal role in the development of a Text Generation Search Engine by implementing and fine-tuning a diverse array of AI tools, enhancing the engine's capabilities for natural language understanding and generation. This hands-on experience underscores a deep understanding of the AI landscape and its practical applications in real-world search engine solutions.

Ability to work with large datasets and apply statistical analysis and modeling techniques.

Developed AI models using LangChain and LangIndex to achieve specific outcomes like improved text generation, language translation, question answering.

Implemented in MLOps lifecycle, which included data preprocessing, feature engineering, model training, validation, deployment, and monitoring.

Proficient programming skills in Python, Terraform, Ansible, shell scripting and used a variety of tools and libraries, including MLflow, Prefect, TensorBoard, Pandas, Parquet, Airflow, and DAGs. I leveraged these technologies to streamline machine learning workflows, efficiently manage data pipelines, and visualize model performance.

Automated AWS infrastructure using Terraform Enterprise Registry within Jenkins pipelines for continuous integration, streamlining the deployment of libraries and dependencies essential for model.

Generated distinct versions of model artifacts in model.version.tar.gz format and deployment pipeline configurations, subsequently deploying them into the ECR registry.

Implemented continuous delivery by retrieving the registered artifact model.version.tar.gz file, input/output datasets, and deployment pipeline configuration from ECR, seamlessly deploying them into an EKS cluster.

Orchestrated the creation of an EKS cluster with five dedicated pods, each serving specific functions such as routing, model state management, Sagemaker configuration, primary decision model API calls, streaming requests in MSK, and default pods for additional tools. Integrated model repositories, S3 buckets, and Sagemaker with the configuration pod in the Sagemaker Config controller to enhance collaboration and resource utilization.

Established a streamlined process for analyzing inference requests and prediction outputs by reading data from a Kafka topic to the model monitoring pod. Finally, facilitated the transmission of inference input and prediction output to the model monitoring dashboard for comprehensive analysis.

Crafted Terraform and containerization scripts with a high level of expertise and precision and bridged the gap between data science and infrastructure by creating efficient and automated pipelines for deploying machine learning models and workflows.

Leveraged Terraform scripts to spin up SageMaker instances and orchestrate SageMaker pipelines for deploying machine learning models and deploying models across various AWS resources, including EC2 instances, EMR clusters, and Docker containers.

Collaborated closely with data engineering teams in the past to meticulously prepare and preprocess data for model training and inference, successfully creating robust data pipelines. Utilized AWS services such as Athena, RedShift, S3 and AWS Glue to streamline and automate data processing workflows and scheduled ETL jobs.

Experienced in version control systems using GitHub, Bitbucket, GIT and DVC for data control and used best practices to manage repos and branches.

Demonstrated a deep understanding of various algorithmic paradigms, including supervised, unsupervised, and reinforcement learning.

Deployed machine learning models to AWS SageMaker endpoints, utilizing a range of deployment configurations for real-time processing, batch processing, and managing the entire machine learning lifecycle.

Managed multiple model versions and implemented strategies for seamless model updates based on model performance metrics.

Established and customized Amazon SageMaker environments, including GPU instances, shared notebook instances, Spark EMR clusters, Zeppelin notebooks, containers, and Docker images. And ensured the security and isolation of SageMaker resources by configuring Amazon VPC, IAM roles, and security groups.

Hands-on experience in using Terraform to efficiently create and provision AWS resources, including instances, S3 buckets, AWS Glue jobs, and various other cloud resources. This expertise allows for streamlined and reliable resource deployment within AWS infrastructure.

Collaborating with data science teams and stakeholders to define relevant metrics that need to be monitored for model performance.

Experienced in Implementing mechanisms to collect and record metrics related to model accuracy, latency, throughput.

Established real-time monitoring solutions to continuously track model behavior and performance in production.

Implemented alerts and notifications to proactively detect anomalies or deviations from expected behavior.

Designed monitoring processes for batch processing scenarios, ensuring accurate data and model processing in scheduled intervals.

Utilized AWS CloudWatch, AWS CloudTrail, Event Bridge, and other monitoring and logging services to capture and visualize relevant metrics and logs.

Defined thresholds for different metrics and configured alerts to trigger notifications upon crossing predefined thresholds.

Analyzed historical monitoring data to identify long-term trends, patterns, and potential issues.

Used historical data to make informed decisions about model updates and improvements.

Monitored resource usage, including CPU, GPU, memory, and network, to ensure deployed models operated within acceptable resources and time limits.

Implemented mechanisms to detect model drift by comparing model predictions against actual outcomes.

Triggered alerts when model drift was detected, indicating the need for model retraining or updates.

Initiated a cost optimization strategy by creating an automated system driven by AWS Lambda and Event Bridge schedules. This system efficiently managed non-essential resources during non-working hours, leading to substantial cost savings. Importantly, this cost-effective approach ensured that critical workloads remained unaffected.

AI ML Engineer Tech AI (Client: Mondee Travel) Jan 2023 – Till Date