Site Reliability Engineer Cloud & Kubernetes Expert

Location:

Dallas, TX

Salary:

75000

Posted:

February 23, 2026

Contact this candidate

Resume:

SAMPATH KANDADI

Site Reliability Engineer

*************@*****.*** +1-774-***-**** Austin, TX LinkedIn PROFESSIONAL SUMMARY

Cloud & Site Reliability Engineer with over 4 years of experience specializing in Kubernetes orchestration, database performance optimization, and distributed systems resiliency. Expert in resolving complex production bottlenecks including OOM kills, connection pool exhaustion, and race conditions. Proven track record of improving system availability through self-healing architectures, graceful degradation strategies, and proactive monitoring using Prometheus and Grafana.

SKILLS

• Cloud Platforms: AWS (EC2, S3, RDS, Lambda, EKS), Azure (AKS, VM), GCP (GKE)

• Infrastructure as Code: Terraform, Ansible, CloudFormation, Helm

• Containerization & Orchestration: Docker, Kubernetes (Multi-replica coordination, PodLockManager)

• Databases & Caching: PostgreSQL (PgBouncer, Prisma), Redis (Cache sync, Rate limiting), MySQL, Dy- namoDB

• Monitoring & Observability: Prometheus, Grafana, Datadog, ELK Stack, CloudWatch

• CI/CD & Tools: GitHub Actions, Jenkins, Argo CD, GitOps, Jira

• Reliability Engineering: OOM debugging, Race condition resolution, Deadlock prevention, Self-healing sys- tems, Graceful degradation

• Programming: Python, Go, Bash, Node.js

WORK EXPERIENCE

Johnson Controls Site Reliability Engineer Austin, TX Jan 2025 – Present

- Achieved 99.9% application uptime by implementing self-healing logic and graceful degradation for microser- vices during temporary database and Redis unavailability.

- Eliminated connection pool exhaustion and reduced database latency by 25% through the integration of PgBouncer and optimization of Prisma Query Engine configurations.

- Decreased pod restart frequency by 40% by performing OOM debugging on unbounded in-memory buffers and right-sizing Kubernetes resource limits for spend log transactions.

- Reduced mean time to recovery (MTTR) by 35% by defining SLIs/SLOs and automating incident alerts via Datadog and Prometheus for multi-pod deployments.

- Optimized cluster utilization and reduced compute costs by 25% by implementing Kubernetes Horizontal Pod Autoscaler (HPA) and resource quotas.

- Improved system visibility and reduced issue detection time by 75% by designing an end-to-end observability stack using Grafana, Loki, and Prometheus to track budget metrics and connection limits. Capgemini Services Pvt Ltd Cloud Engineer Hyderabad, India Aug 2022 – July 2023

- Resolved critical API deadlocks and race conditions in backend Node.js services by implementing PodLock- Manager for multi-replica coordination and fixing Redis cache sync issues.

- Increased deployment scalability for 50+ microservices by orchestrating containerized applications on EKS/AKS using custom Helm charts and YAML manifests.

- Improved deployment speed by 60% and ensured consistent rollouts across environments by designing automated Azure DevOps CI/CD pipelines.

- Reduced latency and improved page load speeds by 45% by configuring Akamai CDN with AWS CloudFront to optimize global content delivery.

- Enhanced production stability by participating in 24/7 on-call rotations and engaging in open source commu- nity debugging on GitHub to resolve upstream dependency issues.

- Prevented 10+ high-severity vulnerabilities pre-deployment by integrating Trivy scans and IAM policy valida- tion into GitOps-driven pipelines.

PepsiCo DevOps Engineer Hyderabad, India Jan 2021 – Jul 2022

- Reduced environment setup time from 2 days to under 2 hours by provisioning automated AWS infrastructure

(EC2, S3, RDS) for 6+ concurrent development projects.

- Decreased manual deployment effort by 80% by developing Jenkins CI/CD pipelines for automated build, test, and deployment stages.

- Improved operational uptime by 25% by creating Bash scripts for system health checks, automated backups, and log cleanup.

- Strengthened security posture by standardizing least-privilege access for 25+ developers through managed IAM roles and access policies.

- Achieved 100% backup reliability and reduced recovery validation time by 80% by automating EBS snapshot scheduling and RDS backup validation.

PROJECTS

EDUCATION

Trine University Master's in Information Studies Angola, Indiana Aug 2023 – May 2025

Contact this candidate