Rehan Alam
+91-728******* # *************@*****.***
ï linkedin.com/in/rehan018 § github.com/Rehan018 Ð LeetCode à HackerRank Ð GFG Professional Summary
SDE1 AI Infrastructure and ML Systems Engineer with 2+ years of experience deploying high-throughput LLM serving pipelines, inference optimisation stacks, and distributed GPU workloads in regulated production environments. Hands-on with ONNX Runtime, TensorRT, vLLM, and NVIDIA Triton Inference Server for enterprise-scale model serving; experienced operating GPU clusters on Kubernetes with Ray for distributed compute orchestration. Focused on fundamentals: kernel-level GPU performance, memory hierarchy, and measurable systems impact.
Technical Skills
GPU Programming CUDA, HIP/ROCm, OpenCL, PTX, SIMD Programming Inference & Serving NVIDIA Triton Inference, vLLM, TensorRT, ONNX Runtime, SGLang, INT8/FP16 quantisation, layer fusion Performance & Profiling NVIDIA Nsight Systems/Compute, rocprof, Perfetto, kernel occupancy tuning, shared-memory bank conflict analysis, memory-access pattern optimisation
LLM & DL Frameworks PyTorch, TensorFlow, LangChain, OpenAI API, Hugging Face Transformers, operator design, pruning Languages C/C++, Python, Go, SQL, Java, JavaScript Cluster Orchestration Kubernetes (GPU operators, device plugins), Ray (distributed inference & training), Docker, Terraform Cloud & Infra AWS (EC2 P3/P4, S3, ECS, Lambda), GCP (GKE, BigQuery), GitHub Actions, CI/CD Observability & Eval Grafana, Prometheus, structured LLM evaluation pipelines, hallucination detection, context-limit stress testing Data & Pipelines Apache Spark/PySpark, Airflow, Kafka, Pandas, ETL/ELT, anomaly detection Tools & Env Linux (Ubuntu/CentOS), Git, Postman, Jira, Confluence, VS Code Professional Experience
Meril Life Sciences Jun 2024 – Present
Software Development Engineer I Vapi, Gujarat
– Deployed NVIDIA Triton Inference Server as the primary model-serving layer for healthcare ML workloads; configured dynamic batching, concurrent model instances, and model ensemble pipelines improving GPU utilisation by 45% over the previous single-request ONNX Runtime setup.
– Integrated vLLM-style continuous batching for an internal LLM document-extraction service; replaced a sequential inference loop with a paged KV-cache architecture, increasing throughput by 2.4 at constant GPU memory footprint.
– Optimised deep learning inference using ONNX Runtime (CUDA EP) and TensorRT INT8 quantisation; profiled kernel bottlenecks with Nsight Systems and tuned memory-access patterns reducing model prediction latency by 35–40% on Linux air-gapped hospital infrastructure.
– Ported CPU-bound ML preprocessing to CUDA-accelerated implementations; eliminated host-device transfer bottlenecks via pinned memory and overlapping H2D transfers with kernel execution, achieving a 3 throughput improvement on batch workloads.
– Managed GPU cluster provisioning (AWS P3+NVIDIA A100 nodes) using Kubernetes GPU operators and Helm-based deployment automation; defined shared-memory allocation and kernel launch configs for stable concurrent HPC throughput; reduced deployment cycle from 6 weeks to 3 weeks via reusable Terraform+Docker templates parameterised per client GPU topology.
– Built an LLM evaluation harness for the document-extraction pipeline: automated hallucination detection across 5,000+document samples, context-limit stress tests, and regression alerts via Prometheus +Grafana reducing silent model failures before production releases.
– Engineered 12 production Python/PySpark data pipelines processing 20GB+/day of operational data; applied GPU-accelerated Spark operators to cut data-to-insight latency from 24 hours to under 15 minutes.
– Authored performance runbooks covering Nsight Systems traces, rocprof counter analysis, and kernel occupancy tuning; trained 15 + engineers on GPU performance regression debugging and LLM serving operational practices. Projects
Early Chronic Disease Detection System GitHub
– Built a GPU-accelerated healthcare prediction system using ONNX Runtime, TensorRT, and FastAPI for low-latency disease prediction and real-time inference.
– Improved inference performance using FP16/INT8 optimisation, efficient preprocessing pipelines, and GPU-backed model serving, achieving nearly 35–40% lower latency.
– Developed a complete ML workflow including feature engineering, SMOTE balancing, model explainability (SHAP), and containerised deployment using Docker.
– Designed the system for scalable deployment on Linux-based GPU environments with production-oriented API architecture and modular inference pipelines.
VulnHunter – Distributed Security Scan Orchestration Platform GitHub
– Built a distributed security scanning platform using FastAPI, SQLAlchemy, and containerised worker services for automated vulnerability validation workflows.
– Designed a parallel task execution and job orchestration system with heartbeat monitoring, retry handling, and scalable worker coordination for long-running scan operations.
– Implemented modular scanning workers using Python, httpx, and Docker, enabling independent deployment and scalable execution across multiple environments.
– Added structured audit logging, scan traceability, and asynchronous execution pipelines to improve reliability and operational monitoring of distributed workloads.
Education
RGPV University, Madhya Pradesh 2018 – 2022
B.Tech in Computer Science CGPA: 7.9
Achievements & Competitive Programming
• Solved 1,200 + DSA problems across LeetCode, GeeksforGeeks, and HackerRank — emphasis on graph algorithms, parallel search, and memory-optimal data structures applicable to GPU kernel design.
• Ranked 3rd in Coding Ninjas CodeKaze · 5th in ServiceNow Hire-Thon · 2nd in GeeksforGeeks Coding Contests — consistent performance under time-constrained conditions.