SDE I - AI Infrastructure & ML Systems

Location:

Vapi, Gujarat, India

Posted:

June 16, 2026

Contact this candidate

Resume:

Rehan Alam

+91-728******* # *************@*****.***

ï linkedin.com/in/rehan018 § github.com/Rehan018 Ð LeetCode Ã HackerRank Ð GFG Professional Summary

SDE1 AI Infrastructure and ML Systems Engineer with 2+ years of experience deploying high-throughput LLM serving pipelines, inference optimisation stacks, and distributed GPU workloads in regulated production environments. Hands-on with ONNX Runtime, TensorRT, vLLM, and NVIDIA Triton Inference Server for enterprise-scale model serving; experienced operating GPU clusters on Kubernetes with Ray for distributed compute orchestration. Focused on fundamentals: kernel-level GPU performance, memory hierarchy, and measurable systems impact.

Technical Skills

GPU Programming CUDA, HIP/ROCm, OpenCL, PTX, SIMD Programming Inference & Serving NVIDIA Triton Inference, vLLM, TensorRT, ONNX Runtime, SGLang, INT8/FP16 quantisation, layer fusion Performance & Profiling NVIDIA Nsight Systems/Compute, rocprof, Perfetto, kernel occupancy tuning, shared-memory bank conflict analysis, memory-access pattern optimisation

LLM & DL Frameworks PyTorch, TensorFlow, LangChain, OpenAI API, Hugging Face Transformers, operator design, pruning Languages C/C++, Python, Go, SQL, Java, JavaScript Cluster Orchestration Kubernetes (GPU operators, device plugins), Ray (distributed inference & training), Docker, Terraform Cloud & Infra AWS (EC2 P3/P4, S3, ECS, Lambda), GCP (GKE, BigQuery), GitHub Actions, CI/CD Observability & Eval Grafana, Prometheus, structured LLM evaluation pipelines, hallucination detection, context-limit stress testing Data & Pipelines Apache Spark/PySpark, Airflow, Kafka, Pandas, ETL/ELT, anomaly detection Tools & Env Linux (Ubuntu/CentOS), Git, Postman, Jira, Confluence, VS Code Professional Experience

Meril Life Sciences Jun 2024 – Present

Software Development Engineer I Vapi, Gujarat

– Deployed NVIDIA Triton Inference Server as the primary model-serving layer for healthcare ML workloads; configured dynamic batching, concurrent model instances, and model ensemble pipelines improving GPU utilisation by 45% over the previous single-request ONNX Runtime setup.

– Integrated vLLM-style continuous batching for an internal LLM document-extraction service; replaced a sequential inference loop with a paged KV-cache architecture, increasing throughput by 2.4 at constant GPU memory footprint.

– Optimised deep learning inference using ONNX Runtime (CUDA EP) and TensorRT INT8 quantisation; profiled kernel bottlenecks with Nsight Systems and tuned memory-access patterns reducing model prediction latency by 35–40% on Linux air-gapped hospital infrastructure.

– Ported CPU-bound ML preprocessing to CUDA-accelerated implementations; eliminated host-device transfer bottlenecks via pinned memory and overlapping H2D transfers with kernel execution, achieving a 3 throughput improvement on batch workloads.

– Managed GPU cluster provisioning (AWS P3+NVIDIA A100 nodes) using Kubernetes GPU operators and Helm-based deployment automation; defined shared-memory allocation and kernel launch configs for stable concurrent HPC throughput; reduced deployment cycle from 6 weeks to 3 weeks via reusable Terraform+Docker templates parameterised per client GPU topology.

– Built an LLM evaluation harness for the document-extraction pipeline: automated hallucination detection across 5,000+document samples, context-limit stress tests, and regression alerts via Prometheus +Grafana reducing silent model failures before production releases.

– Engineered 12 production Python/PySpark data pipelines processing 20GB+/day of operational data; applied GPU-accelerated Spark operators to cut data-to-insight latency from 24 hours to under 15 minutes.

– Authored performance runbooks covering Nsight Systems traces, rocprof counter analysis, and kernel occupancy tuning; trained 15 + engineers on GPU performance regression debugging and LLM serving operational practices. Projects

Early Chronic Disease Detection System GitHub

– Built a GPU-accelerated healthcare prediction system using ONNX Runtime, TensorRT, and FastAPI for low-latency disease prediction and real-time inference.

– Improved inference performance using FP16/INT8 optimisation, efficient preprocessing pipelines, and GPU-backed model serving, achieving nearly 35–40% lower latency.

– Developed a complete ML workflow including feature engineering, SMOTE balancing, model explainability (SHAP), and containerised deployment using Docker.

– Designed the system for scalable deployment on Linux-based GPU environments with production-oriented API architecture and modular inference pipelines.

VulnHunter – Distributed Security Scan Orchestration Platform GitHub

– Built a distributed security scanning platform using FastAPI, SQLAlchemy, and containerised worker services for automated vulnerability validation workflows.

– Designed a parallel task execution and job orchestration system with heartbeat monitoring, retry handling, and scalable worker coordination for long-running scan operations.

– Implemented modular scanning workers using Python, httpx, and Docker, enabling independent deployment and scalable execution across multiple environments.

– Added structured audit logging, scan traceability, and asynchronous execution pipelines to improve reliability and operational monitoring of distributed workloads.

Education

RGPV University, Madhya Pradesh 2018 – 2022

B.Tech in Computer Science CGPA: 7.9

Achievements & Competitive Programming

• Solved 1,200 + DSA problems across LeetCode, GeeksforGeeks, and HackerRank — emphasis on graph algorithms, parallel search, and memory-optimal data structures applicable to GPU kernel design.

• Ranked 3rd in Coding Ninjas CodeKaze · 5th in ServiceNow Hire-Thon · 2nd in GeeksforGeeks Coding Contests — consistent performance under time-constrained conditions.

Contact this candidate