Senior Site Reliability Engineer - Cloud & Observability Expert

Location:

Irving, TX

Posted:

April 14, 2026

Contact this candidate

Resume:

Mahammad Rasool

Shaik

Irving, TX

214-***-****

****************@*****.***

ABOUT

Senior Site Reliability Engineer with five-plus years keeping large-scale distributed systems healthy across AWS, Azure, and GCP. I've defined SLIs, SLOs, and error budgets from scratch, led incident response for high-severity production issues end-to-end, and turned recurring operational pain into tracked, owned improvements. I write Python the way most people breathe — it's my go-to for automation, health checks, and anything that shouldn't be done by hand twice. Strong in observability (Prometheus, Grafana, Splunk, OpenTelemetry), comfortable mentoring junior engineers, and genuinely enjoy the postmortem process when it's done right. TECHNICAL SKILLS

Languages Python (expert), Java, J2EE, Bash/Shell, SQL, JavaScript, Node.js, React SRE & Reliability SLIs, SLOs, Error Budgets, Toil Reduction, Blameless Postmortems, Incident Management, On-Call, Release Readiness, Graceful Degradation, Automated Failover, Disaster Recovery Cloud AWS (EC2, EKS, ECS, S3, RDS, Lambda, VPC, IAM, CloudWatch, API Gateway, SQS, SNS, EventBridge, X-Ray), Azure (AKS, VMs, Azure DevOps, Azure Monitor), GCP Containers Kubernetes (EKS/AKS), Docker, Helm, Istio (Service Mesh), Pod Lifecycle, HPA, Rollout/Rollback Observability Prometheus, Grafana, Splunk, ELK Stack, CloudWatch, OpenTelemetry, Dynatrace, AppDynamics, QuantumMetric/Tealeaf, Akamai CDN

CI/CD Jenkins, GitLab CI/CD, GitHub Actions, Azure DevOps, CodePipeline, Databricks, Git, Bitbucket, Helm

IaC Terraform, CloudFormation, Ansible, ARM Templates Networking HTTP/HTTPS, SSH, VPN, Firewalls, Routing, Load Balancing (ALB/NLB), Nginx, F5, Reverse Proxies, DNS

Security OAuth 2.0, OpenID Connect, SAML 2.0, JWT, IAM, RBAC, Secrets Management Databases PostgreSQL, MySQL, MongoDB, Cassandra, Cosmos DB, Couchbase, RabbitMQ, Kafka Tracking Jira, ServiceNow, Confluence

EXPERIENCE

Mastercard Jun 2024 – Present

Site Reliability Engineer

–Define, implement, and own SLIs, SLOs, and error budgets for critical microservices — using error budgets as a real decision-making tool to push back on risky releases and prioritize reliability work.

–Lead incident response for high-severity production incidents: I'm the escalation point, I drive the blameless postmortem, and I make sure every action item has an owner and a due date in Jira.

–Design and maintain observability platforms using Prometheus, Grafana, Splunk, CloudWatch, and OpenTelemetry

— metrics, logs, traces, and real-time telemetry — so problems surface before customers notice.

–Deep root cause analysis across distributed systems using CloudWatch metrics, SQS/RabbitMQ logs, and X-Ray traces — not just finding the symptom, but the actual cause.

–Build resiliency mechanisms including graceful degradation, automated failover, and disaster recovery across AWS-hosted platforms — the boring work that means production stays up.

–Convert repetitive operational toil into Jira epics with clear ownership. Less manual work, more capacity for reliability improvements.

–Partner with scrum teams on release readiness reviews and shift-left reliability — making sure operability and failure testing are part of the build, not an afterthought.

–Automate infrastructure provisioning with Terraform and CloudFormation; build Python scripts for deployment validation, health checks, and operational automation.

–Mentor junior SREs on observability best practices, incident response, and SLO-driven reliability — helping them grow from following runbooks to owning services.

Tata Consultancy Services Apr 2021 – Aug 2022

DevOps Engineer

–Built and maintained CI/CD pipelines using Azure DevOps, Jenkins, GitHub Actions, and Databricks — automated build, test, and deploy across multiple environments, cutting release cycle time significantly.

–Provisioned and managed AWS infrastructure (EC2, EKS, S3, IAM, VPC, RDS, ALB/NLB) — secure, scalable, high-availability cloud operations with Linux systems and network configurations maintained day-to-day.

–Implemented SLI/SLO tracking and observability with CloudWatch, Prometheus, and Grafana — dashboards and alerting pipelines covering availability, latency, error rates, and throughput.

–Containerized applications on Kubernetes/EKS — managing pod lifecycle, resource limits, HPA, and rollout strategies to improve scalability, resilience, and fault tolerance.

–Developed Python and Bash automation scripts for operational tasks, deployment validation, and system health monitoring — reducing toil and speeding up incident detection.

–Performed root cause analysis through log correlation, metrics analysis, and distributed tracing; documented findings and drove fixes that actually stuck.

–Managed networking configurations including HTTP/HTTPS routing, firewall rules, VPN, load balancer policies, and Nginx reverse proxy settings.

APSLOG Tech Jun 2019 – Mar 2021

Release Engineer

–Built backend REST APIs using Python and AWS SDKs for internal tools and third-party integrations — platform extensibility with security and reliability built in from the start.

–Developed admin utilities with Java and React that cut manual user-management overhead by over 60%.

–Automated infrastructure provisioning with Terraform and CloudFormation — repeatable, scalable deployments with IaC best practices enforced across dev and production.

–Integrated Git-based version control with Jenkins and GitLab CI/CD pipelines, improving deployment reliability with automated validation stages.

–Implemented IAM roles, encryption policies, and network access controls to enforce least-privilege access and meet compliance requirements.

–Set up CloudWatch and Splunk monitoring — system health metrics and log patterns tracked proactively, with support for incident triage built in.

–Documented API specs, runbooks, and operational procedures — because good documentation is a reliability tool, not an afterthought.

PROJECT HIGHLIGHTS

SLO-Driven Reliability & Observability Platform

–Defined SLIs and SLOs for critical microservices and built an end-to-end observability stack using Prometheus, Grafana, Splunk, and OpenTelemetry. The goal was earlier detection and faster recovery — reduced MTTR by 45% and used error budgets as a real lever for release and reliability decisions. Toil Reduction & Incident Management Program

–Tracked recurring operational pain points as Jira epics, then automated the repetitive ones with Python and Terraform. Reduced manual operational load by 35% over six months — freeing the on-call team for work that actually required a human.

Resilient CI/CD Pipeline with Automated Failover

–Designed Jenkins and GitLab CI/CD pipelines with automated rollback, health-check gates, and DR validation stages on EKS — zero-downtime deployments with graceful degradation capabilities built directly into the release process.

EDUCATION

Southeast Missouri State University

M.S., Applied Computer Science

Gudlavalleru Engineering College

B.Tech, Electronics & Communication Engineering

CERTIFICATIONS

–AWS Certified Solutions Architect – Associate

–Certified Kubernetes Administrator (CKA) – In Progress

–HashiCorp Terraform Associate – In Progress

–Microsoft Certified: Azure Data Engineer Associate – In Progress

Contact this candidate