LLM Engineer - MLOps & RAG Architect

Location:

Dallas, TX

Posted:

February 09, 2026

Contact this candidate

Resume:

CHETAN ANAND PANTHUKALA

AI/ML Engineer

San Francisco, CA Open for relocation to Dallas, TX Seattle, WA Chicago, IL New York, NY ******.*******@*****.*** 469-***-**** www.linkedin.com/in/Chetananand9 SUMMARY

Engineer with 5+ years of experience building and optimizing large language models, RAG systems, and distributed MLOps pipelines across AWS, Azure, and GCP. Skilled in PyTorch, DeepSpeed, TensorFlow, and Kubernetes, delivering scalable and production-grade AI systems. Experienced in fine-tuning LLMs, implementing quantization and MoE strategies, and deploying inference pipelines using Triton and SageMaker. Passionate about designing cloud-native AI infrastructure that bridges research innovation with real-world enterprise applications. PROFESSIONAL EXPERIENCE

Perplexity AI – Perplexity Sonar (In-house Search LLM & API) January 2025 – Present Backend Software Engineer San Francisco, CA

• Engineered next-generation LLM backend leveraging Llama 3.3-70B and speculative decoding, enhancing factual Q&A reliability and massively scaling inference throughput across Cerebras clusters.

• Designed and implemented Mixture-of-Experts (MoE) routing and FP8 quantization pipelines, reducing inference latency and optimizing compute efficiency across large-scale distributed GPU instances.

• Architected Retrieval-Augmented Generation (RAG) stack with FAISS, BM25, and LangChain, boosting contextual precision by 20% and reducing hallucination rate by 18% on LM Arena benchmarks.

• Automated benchmark evaluation using MMLU, TruthfulQA, and GSM8K, integrating results through Model Context Protocol (MCP)–based services to expose structured evaluation context to AI assistants, improving F1 scores by 13% and reducing operational costs by 38% via optimized AWS SageMaker workloads.

• Deployed Sonar Pro across AWS EKS and Triton Inference Server, integrating MCP servers to enable AI assistants to access runtime model metadata, inference status, and deployment context, ensuring high availability and driving API adoption across 1,200+ enterprise developers.

• Leveraged PyTorch, DeepSpeed, vLLM, and FlashAttention-3 for distributed fine-tuning, implementing LoRA, SFT, and RLHF pipelines while exposing training and inference workflows through MCP-compatible server interfaces to support AI-assisted orchestration on AWS Cloud.

• Built hybrid retrieval systems integrating FAISS, ElasticSearch, and LangChain, optimizing response latency and contextual recall for large-scale factual Q&A applications.

• Utilized AWS Lambda, Step Functions, and EFS for orchestration of inference workflows, improving scalability and fault-tolerance across multi-cluster LLM serving pipelines.

• Monitored distributed inference with Prometheus, Datadog, and Weights & Biases, enabling automated alerting and 25% improvement in observability and cost-performance balance.

• Applied speculative decoding, MoE routing, RAG orchestration, FlashAttention, and self-verification frameworks to enhance model reasoning reliability and drive enterprise-grade AI adoption. Meta April 2024 – December 2024

Software Engineer - Machine Learning San Francisco, CA

• Engineered large-scale training pipelines for LLaMA 3 (70B–130B) using PyTorch 2.2, DeepSpeed, and AWS Cloud, enabling faster throughput and reduced compute cost across 12K GPU clusters.

• Optimized multimodal architecture by integrating text-vision encoders with FP8 quantization on AWS infrastructure, enhancing inference efficiency and improving cross-modal accuracy on internal benchmarks.

• Implemented RLHF + DPO fine-tuning pipelines with human feedback curation, enhancing alignment safety metrics by 27% and reducing hallucination rate in production responses by 33%.

• Collaborated on the open-source LLaMA 3.2 release, developing deployment scripts and evaluation suites on AWS, improving reproducibility and accelerating global community adoption beyond 1.7M downloads.

• Built distributed data-processing workflows on AWS EMR and S3 using Spark, Ray, PyArrow, and Hydra for trillion- token multilingual datasets, ensuring high reliability in continuous large-scale ingestion pipelines.

• Deployed optimized inference stack via Triton Inference Server, ONNX Runtime, TorchServe, leveraging MTIA v2 and H100 GPUs for latency-aware serving across Meta AI Assistant endpoints.

• Automated monitoring and performance analytics with Prometheus, Grafana, and Meta AI Dashboards, enabling predictive scaling and reducing infrastructure incidents by 22% in training clusters.

• Applied MLOps and DevOps practices using Kubernetes, Docker, CI/CD (GitHub Actions), accelerating experimentation cycles by 35% and improving model deployment stability across environments.

• Utilized advanced AI stack Transformers, PyTorch Lightning, LangChain, Weights & Biases, TensorRT, Faiss, LLM Ops frameworks driving reproducible, scalable, and explainable model delivery in 2025 AI ecosystem. Accenture April 2019 – June 2023

Software Engineer India

• Built scalable knowledge graph embeddings on AWS Cloud using TensorFlow and Keras, assisting research teams in optimizing TransE and ComplEx models, boosting link prediction accuracy by 32%.

• Developed automated ML pipelines for training, evaluation, and deployment using MLflow and Docker, supporting faster experimentation and improving reproducibility by 45% across distributed cloud environments.

• Contributed APIs and integration modules to AmpliGraph’s open-source release on AWS, assisting enterprise teams in seamless adoption for large-scale graph analytics and AI-driven insights.

• Tuned GPU-based distributed training setups and analyzed performance metrics to enhance convergence rates, supporting cost-efficient optimization that reduced cloud compute expenses by 18%.

• Applied advanced Graph ML and embedding techniques using Python, TensorFlow, PyTorch, and Keras, assisting data scientists in extracting relational patterns from structured and semi-structured datasets.

• Designed and implemented MLOps pipelines leveraging Docker, Airflow, and MLflow, supporting CI/CD automation, reproducible experiments, and robust retraining workflows in production.

• Integrated AWS SageMaker, Azure ML, and Vertex AI for model lifecycle management, analyzing Generative AI and LLM capabilities to support knowledge graph reasoning and infrastructure scalability. TECHNICAL SKILLS

Programming & Scripting: Python (FastAPI, Flask, Django, gRPC), JavaScript/TypeScript, Bash, SQL, NoSQL (MongoDB, DynamoDB)

AI/ML & Data Engineering: Large Language Models (LLMs – GPT-4.1, Claude, LLaMA 3.x, Mistral), PyTorch, TensorFlow, Hugging Face, DeepSpeed, vLLM, Transformers, TensorRT-LLM, Retrieval-Augmented Generation (RAG), Semantic Search, Knowledge Graphs, Vector Databases (FAISS, Pinecone, Weaviate, Milvus), MoE Routing, Quantization (FP8), LoRA, RLHF, DPO, NLP (Text Classification, Summarization), Data Ingestion & ETL Pipelines, Kafka, Pub/Sub MLOps & Model Lifecycle: MLflow, Weights & Biases (W&B), Airflow, SageMaker, Model Deployment & Monitoring, CI/CD Automation, Feature Stores

Cloud Platforms & Services: AWS (EC2, S3, Lambda, EKS, SageMaker, Step Functions, EFS), Azure (AKS, Azure ML, Storage), GCP, Cloud-Native Architectures

DevOps, Infrastructure & Observability: Docker, Kubernetes, Terraform, GitHub Actions, Prometheus, Grafana, Datadog, High Availability Systems

Databases & Storage: PostgreSQL, MySQL, DynamoDB, Redis (caching & performance optimization), Elasticsearch, FAISS, Time-Series Databases (InfluxDB, Prometheus TSDB), Data Indexing & Sharding Strategies Testing & Performance Optimization: PyTest, Load & Stress Testing, Model Profiling, Latency Optimization, Performance Benchmarking

Soft Skills & Leadership: Agile/Scrum Methodologies, Cross-functional Collaboration, Problem Solving, Code Review, Mentorship, Ownership & Delivery

EDUCATION

Masters in Advanced Data Analytics

University of North Texas -TX, USA

Contact this candidate