Post Job Free
Sign in

Senior DevOps/SRE Platform Leader with 15+ Years /analysis

Location:
Santa Clara, CA
Posted:
June 04, 2026

Contact this candidate

Resume:

Hassaan Moin Khan

San Francisco, CA ******@*****.***

Summary

Senior engineering leader (DevOps/SRE/Platform) with 15+ years building and operating large-scale distributed systems and cloud infrastructure. Deep experience in Kubernetes, Docker, Terraform, GCP/AWS/Azure, Hadoop/Spark/Kafka, high-throughput services, disaster recovery, performance optimization, and automation. Strong Linux systems background, networking/security, and on-call operations. AI-literate with hands-on GenAI/LLM integrations.

Core Skills

● Cloud/Infra: GCP (preferred), AWS, Azure; Kubernetes, Docker; Terraform; CI/CD

● Big Data/Streaming: Hadoop, Spark, Kafka, Hive; Airflow; Presto; (Druid/Opensearch familiarity)

● SRE/DevOps: Reliability engineering, DR planning, fault tolerance, scalability, incident response, on-call

● Systems: Linux/Unix internals, performance profiling, low-latency services, C/C++/Go/Java/Python, Bash

● Automation/Config: Shell scripting, Python, Ansible/Puppet/Chef (experience across environments)

● Observability: Grafana, custom metrics, alerting/on-call practices (PagerDuty-style workflows)

● Security/Networking: Network architecture, HAProxy, TLS, authN/Z, data security

● AI/ML: GenAI/LLM integration, AI-driven ops and recommendations Experience CTO, Quartech (Stealth) May 2024 – Present

● Designed and operated a cloud-native, horizontally scalable services architecture on GCP/AWS with Kubernetes and Docker; defined IaC with Terraform for multi-environment provisioning.

● Established SRE best practices: service SLIs/SLOs, runbooks, incident response, disaster recovery testing, and capacity planning; implemented health checks, canaries, and autoscaling.

● Built low-latency media and chat features using FFmpeg and custom messaging; performed performance benchmarking and kernel/network tuning on Linux.

● Implemented monitoring/alerting (Grafana/Prometheus stack) and on-call rotations for a globally distributed team.

● Integrated LLMs (Bayesian clustering, LlamaIndex) and Microsoft Copilot to deliver AI-driven recommendations and support automation.

Founder/CEO, SwiftTaxi Sep 2021 – May 2024

● Architected reliable backend for autonomous systems with Cassandra, Kafka-style messaging, and resilient data pipelines; implemented DR/backup strategies and HA clusters.

● Led Linux-based systems engineering for edge devices; created automation scripts (Python/Bash) for telemetry ingestion, monitoring, and recovery.

● Built maps and SLAM pipelines; enforced security and networking hardening for edge-to-cloud comms; conducted performance profiling and low-level debugging. Software Engineer, Facebook (Payments) Jan 2019 – Sep 2021

● Developed low-latency backend services in C++ with sharded MySQL for high-throughput payments; designed PSP onboarding and sandboxes.

● Worked on reliability and scale: performance measurement, resource utilization tuning, and rollout safety; integrated observability and alerting with on-call responsibilities.

● Built and maintained test environments and automation for infra and integrations. TL, Uber Sep 2016 – Jan 2019

● Established production readiness review (PRR) across org: DR planning, failure mode analysis, fault tolerance, graceful degradation.

● Built core distributed services (distributed locking, unique ID generation, timers) in Java/Go/Python; operated on large-scale clusters with HA and low-latency constraints.

● Owned Cassandra as-a-service tooling/maintenance; implemented automation for cluster operations, backups, and repair.

● Co-led architecture for Uber Elevate simulations on Hadoop; built regression/validation frameworks on Hadoop/Spark.

Engineering Manager, Zenefits Nov 2015 – Jul 2016

● Led infrastructure services team; created internal service framework on Jetty/Java with protobufs for consistent, reliable service deployments.

● Built Pub/Sub and async processing; created pipelines and data sync frameworks to support microservices migration; introduced DR and monitoring practices. Lead Engineer, Apple Mar 2012 – Nov 2015

● Built a high-throughput, replicated, fault-tolerant queue on Cassandra (250k+ ops/sec on 16-node cluster); implemented multi-partition caching and HA designs.

● Upgraded Apple Maps data pipeline (CDH5 + Cassandra) and improved publishing performance; conducted extensive performance profiling and low-level debugging.

● Implemented security infrastructure for internal/external communication; contributed to common infra components and reliability engineering.

Earlier Roles (selected)

● XA.net: Built RTB-scale user store on MongoDB with <15ms lookups; Hadoop analytics; automation and performance tuning.

● Cooliris: Designed backend ads infra (Java, protos, embedded Tomcat), integrated third-party networks, built real-time tracking and Hadoop analytics; wrote monitoring/alerting.

● Cloudmark: Multithreaded C/C++ systems on Solaris; 4x performance improvements; cross-platform C libraries; spam attack detection (real-time).

● Neato Robotics: Linux/C++ robotics stack; SLAM; embedded C/Assembly; systems performance and reliability at the edge.

Education

● MS, Electrical & Computer Engineering, Carnegie Mellon University

● BS, Electrical & Computer Engineering; Minor in Computer Science, Carnegie Mellon University Certifications & Patents

● Patent: LiDAR for Robotics

Highlights Aligned to Role

● Cloud & IaC: Extensive GCP/AWS; Kubernetes, Docker; Terraform for repeatable provisioning and DR.

● Big Data: Hadoop/Spark at Uber/Apple; Kafka pipelines; Airflow-based workflows; Hive/Presto familiarity.

● SRE Practices: PRRs, SLIs/SLOs, chaos and DR tests, failure mode analysis, on-call rotations, runbooks, Grafana/Prometheus; PagerDuty-style escalations.

● Systems/Perf: Low-level Linux tuning, C/C++ performance optimization, high-throughput queues, low-latency services.

● Automation: Python/Bash scripting for cluster ops, CI/CD, self-healing actions, and infra reactions.

● Security/Networking: HAProxy, TLS, network segmentation, data security controls for payments and distributed systems.

● AI Literacy: Direct experience integrating LLMs (LlamaIndex, Copilot) and AI-driven features; curiosity and hands-on experimentation.



Contact this candidate