Post Job Free
Sign in

Senior SRE/Platform Engineer Kubernetes & Cloud Cost Focus (W2) in

Location:
Irving, TX
Salary:
125000
Posted:
February 27, 2026

Contact this candidate

Resume:

VAMSHI GOGULA

Senior Site Reliability Engineer Platform Engineer Cloud DevOps SRE **************@*****.*** +1-972-***-**** TX https://www.linkedin.com/in/vamshi-gv/ github.com/Vamshi Gogula

PROFESSIONAL SUMMARY

Seasoned Site Reliability Engineer with 10+ years architecting, operating, and optimizing mission-critical platforms across Healthcare Cigna, Leidos Health, Fintech Fidelity Investments, and Telecom Verizon Wireless sectors. Deep expertise in AWS and Azure cloud ecosystems, EKS AKS Kubernetes orchestration, and enterprise application support at massive scale.

Proven track record driving FinOps initiatives $500K+ cost savings, SLO/SLI definition, error budget management, and chaos engineering implementations. Expert in incident response, RCA, on-call operations, and release management for 24x7 production environments. Skilled in observability Datadog, New Relic, Splunk, Prometheus, Grafana, Catchpoint, CI/CD Jenkins, Azure DevOps, GitHub Actions, Argo CD, and Infrastructure as Code Terraform, CloudFormation, ARM, Bicep, Ansible.

Certified AWS Solutions Architect and CKA with hands-on experience in Kafka upgrades, Cassandra administration, SQL optimization, namespace management, cluster upgrades, DR activities, and batch job orchestration. Strong focus on upstream/downstream system reliability, PDB implementation, Vault secrets management, and production management with rigorous metrics.

PROFESSIONAL EXPERIENCE

FIDELITY INVESTMENTS Westlake, TX Fintech Domain

Senior Site Reliability Engineer / Cloud Platform Engineer / SRE Lead Aug 2024 – Present

Collaboration with Development Teams & Stakeholder Management

Orchestrated cross-functional alignment with 15+ engineering squads utilizing Jira for sprint tracking and Confluence for technical documentation, establishing standardized SLOs and operational runbooks

Championed release management coordination through Change Advisory Board processes, synchronizing deployment schedules across upstream and downstream dependencies

Facilitated blameless postmortems and knowledge-sharing sessions, improving team MTTR by 40% through enhanced communication protocols

Partnered with product owners to translate business requirements into technical reliability targets, defining error budgets for critical payment processing services

Mentored junior engineers on SRE best practices, incident response procedures, and observability implementation

Performance & Capacity Planning

Architected EKS cluster elasticity leveraging AWS Auto Scaling Groups, Karpenter, and Cluster Autoscaler, absorbing 300% traffic spikes during market volatility

Conducted capacity forecasting for platforms processing 10M+ daily transactions, ensuring 99.99% availability during peak trading volumes

Optimized namespace resource quotas and limits across 50+ EKS clusters, preventing noisy neighbor issues and ensuring fair resource allocation

Implemented HPA and VPA policies for Java/Spring Boot microservices, reducing over-provisioning by 35% while maintaining performance SLAs

Performed SQL query optimization and database performance tuning using DBeaver, reducing query execution time by 50% for reporting batch jobs

Infrastructure & Cloud Engineering

Led EKS cluster upgrades across 5 major Kubernetes versions (1.24 1.29) for 100+ microservices, validating CRD compatibility and API deprecations

Provisioned consistent environments across dev/staging/production using Terraform modules and CloudFormation templates, ensuring infrastructure immutability

Engineered multi-AZ EKS architectures with AWS NLB/ALB integration, achieving 99.99% uptime for critical brokerage applications

Automated namespace creation workflows with Terraform and custom operators, reducing provisioning time from 2 days to 15 minutes

Implemented AWS VPC peering, PrivateLink, and transit gateway configurations for secure cross-account communication

Observability & Monitoring

Architected comprehensive telemetry pipelines using Datadog APM, custom metrics, and distributed tracing, correlating upstream and downstream service dependencies

Integrated New Relic for Java application performance monitoring, identifying memory leaks and thread contention, improving P99 latency by 40%

Created Grafana dashboards and Prometheus alerting rules for EKS cluster health, node utilization, and pod resource consumption

Configured Splunk log aggregation and Splunk dashboards for audit compliance and security event monitoring

Established synthetic monitoring and SLO dashboards in Datadog, providing real-time visibility into error budget consumption

Automation & Toil Reduction

Constructed robust CI/CD pipelines via Jenkins, GitHub Actions, and Argo CD embracing GitOps principles, enabling declarative, auditable deployments

Automated Vault registration, secret injection, and certificate lifecycle management, eliminating manual toil by 80%

Developed Ansible playbooks for node configuration, security patching, and compliance hardening across 200+ EKS worker nodes

Implemented automated batch job scheduling and monitoring using AWS Lambda and Event Bridge, ensuring timely completion of critical financial reconciliations

Created self-service portals for developers to provision namespaces, configure PDBs, and deploy applications without platform team intervention

Incident Response & Production Support

Spearheaded 24x7 on-call rotation for mission-critical payment gateway, serving as Incident Commander during high-severity outages

Drove incident response and RCA processes using Jira for ticket tracking and Confluence for documentation, reducing MTTR from 2 hours to 15 minutes (87% improvement)

Conducted chaos engineering experiments (pod kills, node failures, AZ outages) to validate resilience, identifying and remediating 12 single points of failure

Implemented automated remediation runbooks using AWS Lambda and Datadog webhooks, resolving common alerts without human intervention

Shielded $2M+ daily revenue exposure through rapid incident response and proactive monitoring of payment processing batch jobs

Service Reliability & Uptime Ownership

Custodian of availability for enterprise applications processing 10M+ daily transactions; architected SLIs (availability, latency, error rate, throughput) and SLOs (99.99%)

Managed error budgets and reliability roadmaps, balancing feature velocity with platform stability

Executed DR activities including route-aways, rehydration exercises, and failover testing, validating RPO < 5 minutes and RTO < 30 minutes

Implemented PDBs (Pod Disruption Budgets) and graceful termination policies, ensuring zero-downtime deployments and node rotations

Owned production management with rigorous metrics, presenting platform health dashboards to C-level executives monthly

Data & Messaging Infrastructure

Architected real-time payment processing using Kafka Streams, handling 5M+ events/day with exactly once processing semantics

Led Kafka upgrades from 2.8 to 3.5 across 10 clusters, performing rolling upgrades with zero downstream impact

Built consumer lag monitoring and dead letter queue automation, reducing failed transaction recovery time by 60%

Tuned Kafka producer/consumer configurations for optimal throughput and latency, achieving 100K+ messages/second per cluster

Managed Cassandra clusters for time-series data storage, implementing repair schedules and compaction strategies

Security & Compliance

Automated SSL certificate renewal and Vault integration using Venafi and custom scripts, eliminating certificate expiry incidents

Implemented IAM roles for service accounts (IRSA), RBAC policies, and network policies for EKS workload isolation

Configured AWS KMS encryption for EBS volumes, S3 buckets, and RDS databases, ensuring data at rest protection

Enforced Pod Security Standards and security contexts, preventing privileged container escalations

Maintained PCI-DSS compliance for payment processing infrastructure, passing annual audits with zero critical findings

LEIDOS HEALTH Healthcare Domain

Site Reliability Engineer / Platform Engineer / MLOps Engineer Oct 2023 – Aug 2024

Collaboration with Development Teams & Stakeholder Management

Synchronized with data engineers, ML practitioners, and clinical teams via Jira and Confluence to blueprint HIPAA-compliant platform architecture

Shepherded Agile sprint execution and release management for healthcare AI/ML applications serving 5M+ patient records

Facilitated cross-functional war rooms during critical incidents, coordinating upstream system owners and downstream consumers

Documented operational procedures, runbooks, and architecture decisions in Confluence, improving on-call readiness

Partnered with compliance officers to ensure HIPAA adherence in all infrastructure changes and batch job processing

Performance & Capacity Planning

Calibrated AKS autoscaling policies, PDBs, and resource quotas for ML training workloads requiring GPU/CPU optimization

Orchestrated capacity planning for batch and streaming inference pipelines, ensuring optimal resource utilization for enterprise applications

Implemented node pools with taints/tolerations for GPU workloads, isolating expensive resources from general workloads

Monitored Cassandra cluster performance and adjusted replication factors for read/write latency optimization

Forecasted storage growth for patient data and imaging workloads, provisioning Azure Blob Storage with lifecycle policies

Infrastructure & Cloud Engineering

Automated infrastructure delivery via Azure DevOps, Terraform, ARM templates, and Bicep for consistent AKS provisioning

Engineered AKS cluster topology with Azure Site Recovery, enabling cross-region failover for life-critical healthcare systems

Performed cluster upgrades for 8 AKS environments, validating Azure CNI compatibility and API server availability

Automated namespace creation with resource templates, network policies, and RBAC bindings for multi-tenant workloads

Configured Azure Private Link, VNet peering, and firewall rules for secure hybrid cloud connectivity

Observability & Monitoring

Forged comprehensive observability fabric using Datadog, Azure Monitor, Log Analytics, and New Relic

Crafted tailored dashboards for healthcare compliance workloads, guaranteeing SLA adherence and preemptive anomaly detection

Implemented Prometheus and Grafana for AKS cluster metrics, custom application metrics, and alerting

Configured Splunk log forwarding for security audit trails and PHI access monitoring

Established distributed tracing for microservices, correlating upstream API calls with downstream database queries

Automation & Toil Reduction

Streamlined CI/CD orchestration for application and ML pipelines via Jenkins and Azure DevOps, compressing deployment cycles by 65%

Implemented GitOps with ArgoCD for declarative AKS deployments, enabling automated rollbacks and drift detection

Automated batch job scheduling using Azure Data Factory triggers and Azure Logic Apps, ensuring timely data pipeline execution

Developed Ansible playbooks for AKS node hardening, log agent installation, and monitoring agent configuration

Created self-service namespace provisioning via Azure DevOps pipelines, empowering developers with guardrails

Incident Response & Production Support

Rotated through on-call duties for Contact Center and patient data platforms steered incident response and RCA via Jira

Documented resolution playbooks in Confluence, curtailing repeat incidents by 40% through knowledge institutionalization

Participated in DR activities and failover testing for Azure Site Recovery, validating RTO/RPO objectives

Implemented PDBs and graceful shutdown hooks for ML inference services, preventing request drops during deployments

Managed incident command during HIPAA-related security events, coordinating with legal and compliance teams

Service Reliability & Uptime Ownership

Steward of reliability for AWS Connect-based Contact Center serving millions of patients; sustained 99.9% uptime for voice/email/chat

Defined SLOs for patient appointment scheduling, prescription refills, and claims inquiry systems

Managed error budgets and reliability investments, prioritizing technical debt reduction based on SLO breach analysis

Conducted chaos engineering experiments on AKS workloads, validating resilience of patient-facing services

Owned production management metrics, presenting monthly reliability reviews to healthcare leadership

Data & Messaging Infrastructure

Supported Kafka clusters for event-driven patient data integration between upstream EHR systems and downstream analytics platforms

Managed Cassandra deployments for high-velocity time-series health metrics, implementing compaction and repair automation

Orchestrated batch job processing for nightly claims adjudication, eligibility checks, and reporting workflows

Configured Azure Data Factory pipelines for HIPAA-compliant data movement between on-prem and cloud stores

Optimized SQL queries for reporting batch jobs, reducing execution time from 4 hours to 45 minutes using DBeaver analysis

Security & Compliance

Implemented Azure Key Vault integration for secrets management, certificate storage, and key rotation policies

Configured RBAC and Azure AD integration for AKS, enabling SSO and conditional access policies

Enforced network policies and Azure Firewall rules for PHI protection and lateral movement prevention

Automated SSL certificate renewal via Azure Key Vault and custom automation, ensuring zero expiry events

Maintained HIPAA audit trails via Splunk and Azure Monitor, supporting compliance investigations and reporting

CIGNA Plano, TX & Connecticut Healthcare Domain

Senior SRE / Cloud Engineer / Kubernetes Architect Oct 2021 – Sept 2023

Partnered with 20+ engineering teams via Jira and Confluence to refine release workflows; facilitated blameless postmortems and AWS excellence sessions, driving cultural shift toward reliability engineering while ensuring HIPAA-compliant controls through security team engagement.

Designed EKS auto-scaling architecture orchestrating 1M+ daily healthcare claims; ensured peak performance during 300% enrollment surges through predictive scaling, optimized Cassandra clusters for claims data, and reduced SQL query latency by 45% via database tuning.

Transformed 200+ on-premises servers to AWS using CloudFormation/Terraform; executed zero-downtime migration of claims processing applications to EKS, performed cluster upgrades (1.21 1.27) across 6 production environments, and architected multi-region infrastructure with Route 53 failover and RDS read replicas.

Configured AWS CloudWatch, Datadog, Prometheus, Grafana, and Splunk for EKS clusters, EC2 fleets, and RDS databases; established SLO dashboards, custom metrics exporters, and distributed tracing (AWS X-Ray/Datadog APM), enabling rapid resolution through operational runbooks.

Consolidated 100+ Jenkins instances into containerized enterprise Jenkins on EKS, reducing maintenance overhead 60%; automated configuration governance via Ansible (75% effort reduction), implemented CI/CD pipelines enabling 30+ daily deployments with 99% success rate, and developed Terraform modules for self-service infrastructure provisioning.

Managed on-call rotation for healthcare ecosystems; reduced MTTR 50% through automated alerting and runbook automation, preserved HIPAA compliance during incident response, led war rooms for critical outages coordinating upstream/downstream systems, and achieved 99.95% availability through SLO monitoring and error budget management.

Configured IAM roles, RBAC, service mesh mTLS, and AWS KMS encryption for HIPAA compliance; administered Cassandra and Kafka clusters for claims processing, orchestrated batch job workflows, enforced Pod Security Policies, and maintained audit trails via CloudTrail and Splunk for compliance investigations.

VERIZON WIRELESS Irving, TX Telecom Domain

Production Support Engineer / Platform Engineer / Cloud Architect Apr 2017 – Oct 2021

Engaged network engineering and 20+ application teams via Jira and Confluence to orchestrate release management for 50+ microservices; coordinated upstream billing system integrations with downstream customer notification platforms, facilitated Agile ceremonies, and partnered with vendor teams for telco-specific escalations while documenting architecture decisions and SLO targets.

Architected AWS and Azure Kubernetes foundations via Terraform and CloudFormation, standardizing on EKS/AKS for 200+ microservices; executed zero-downtime cluster upgrades across 12 environments, implemented multi-cloud DR strategies with Azure Site Recovery and AWS Pilot Light, and automated namespace provisioning with guardrails, resource limits, and network isolation policies.

Deployed Datadog, Grafana, Prometheus, New Relic, and Splunk for holistic monitoring of containerized telecom services; established synthetic monitoring for customer-facing APIs, created real-time dashboards for upstream network events and downstream customer impact, and reduced MTTR by 55% through automated remediation, improved runbooks, and Datadog-driven diagnostics while maintaining 24x7 on-call coverage.

Pioneered CI/CD automation with Jenkins, Git, Docker, Ansible, and Argo CD, enabling 50+ daily releases with 99.5% success rate; automated Vault integration and certificate lifecycle management, implemented GitOps for declarative deployments with automated rollbacks, developed self-service portals for namespace provisioning, and automated batch job scheduling for billing cycles and usage calculations.

Maintained 99.9% uptime for mission-critical billing platforms and customer-facing services; administered Kafka clusters processing 100K+ mgs/sec for event-driven billing, managed Cassandra deployments for subscriber data, optimized SQL databases reducing execution time by 60%, defined SLIs/SLOs for 50+ microservices with error budget management, and enforced PCI-DSS compliance through IAM, RBAC, network policies, and automated security scanning in CI/CD pipelines.

YASH TECHNOLOGIES India Enterprise Domain

DevOps Engineer / Production Support Engineer Apr 2013 – Dec 2015

Migrated and stabilized large-scale platforms on AWS and Azure, standardizing on Kubernetes-first architectures.

Operated AKS/EKS clusters with integrated monitoring, alerting, and log aggregation using Datadog, Grafana, and Splunk.

Developed CI/CD automation with Jenkins, Git, Docker, and Ansible, enabling reliable and frequent releases.

Participated in on-call rotations and incident response, applying post-incident learnings to improve platform resilience.

Experience on Power Shell scripts to automate the Azure Cloud system, creation including end-to-end infrastructure, VMs, storage, firewall rules.

Release builds, repository management in Maven to share snapshots and releases of internal projects using JFrog Artifactory tool.

Experience in creating Docker containers leveraging existing Linux Containers and AMI's in addition to creating Docker containers from scratch.

Created additional Docker Slave Nodes for Jenkins using custom Docker Images and pulled them to ECR.

Worked on all major components of Docker like Docker Daemon, Hub, Images, Registry, Swarm.

Implemented Infrastructure automation through Ansible for auto provisioning, code deployments, software installation and configuration updates. Automated configuration management and deployments using Ansible playbooks.

CERTIFICATIONS

AWS Certified Solutions Architect – Professional

Certified Kubernetes Administrator (CKA)

EDUCATION

Master of Science – Computer Information Sciences New England College, Henniker, NH 2018



Contact this candidate