VAMSHI GOGULA
Senior Site Reliability Engineer Platform Engineer Cloud DevOps SRE **************@*****.*** +1-972-***-**** TX https://www.linkedin.com/in/vamshi-gv/ github.com/Vamshi Gogula
PROFESSIONAL SUMMARY
Seasoned Site Reliability Engineer with 10+ years architecting, operating, and optimizing mission-critical platforms across Healthcare Cigna, Leidos Health, Fintech Fidelity Investments, and Telecom Verizon Wireless sectors. Deep expertise in AWS and Azure cloud ecosystems, EKS AKS Kubernetes orchestration, and enterprise application support at massive scale.
Proven track record driving FinOps initiatives $500K+ cost savings, SLO/SLI definition, error budget management, and chaos engineering implementations. Expert in incident response, RCA, on-call operations, and release management for 24x7 production environments. Skilled in observability Datadog, New Relic, Splunk, Prometheus, Grafana, Catchpoint, CI/CD Jenkins, Azure DevOps, GitHub Actions, Argo CD, and Infrastructure as Code Terraform, CloudFormation, ARM, Bicep, Ansible.
Certified AWS Solutions Architect and CKA with hands-on experience in Kafka upgrades, Cassandra administration, SQL optimization, namespace management, cluster upgrades, DR activities, and batch job orchestration. Strong focus on upstream/downstream system reliability, PDB implementation, Vault secrets management, and production management with rigorous metrics.
PROFESSIONAL EXPERIENCE
FIDELITY INVESTMENTS Westlake, TX Fintech Domain
Senior Site Reliability Engineer / Cloud Platform Engineer / SRE Lead Aug 2024 – Present
Collaboration with Development Teams & Stakeholder Management
Orchestrated cross-functional alignment with 15+ engineering squads utilizing Jira for sprint tracking and Confluence for technical documentation, establishing standardized SLOs and operational runbooks
Championed release management coordination through Change Advisory Board processes, synchronizing deployment schedules across upstream and downstream dependencies
Facilitated blameless postmortems and knowledge-sharing sessions, improving team MTTR by 40% through enhanced communication protocols
Partnered with product owners to translate business requirements into technical reliability targets, defining error budgets for critical payment processing services
Mentored junior engineers on SRE best practices, incident response procedures, and observability implementation
Performance & Capacity Planning
Architected EKS cluster elasticity leveraging AWS Auto Scaling Groups, Karpenter, and Cluster Autoscaler, absorbing 300% traffic spikes during market volatility
Conducted capacity forecasting for platforms processing 10M+ daily transactions, ensuring 99.99% availability during peak trading volumes
Optimized namespace resource quotas and limits across 50+ EKS clusters, preventing noisy neighbor issues and ensuring fair resource allocation
Implemented HPA and VPA policies for Java/Spring Boot microservices, reducing over-provisioning by 35% while maintaining performance SLAs
Performed SQL query optimization and database performance tuning using DBeaver, reducing query execution time by 50% for reporting batch jobs
Infrastructure & Cloud Engineering
Led EKS cluster upgrades across 5 major Kubernetes versions (1.24 1.29) for 100+ microservices, validating CRD compatibility and API deprecations
Provisioned consistent environments across dev/staging/production using Terraform modules and CloudFormation templates, ensuring infrastructure immutability
Engineered multi-AZ EKS architectures with AWS NLB/ALB integration, achieving 99.99% uptime for critical brokerage applications
Automated namespace creation workflows with Terraform and custom operators, reducing provisioning time from 2 days to 15 minutes
Implemented AWS VPC peering, PrivateLink, and transit gateway configurations for secure cross-account communication
Observability & Monitoring
Architected comprehensive telemetry pipelines using Datadog APM, custom metrics, and distributed tracing, correlating upstream and downstream service dependencies
Integrated New Relic for Java application performance monitoring, identifying memory leaks and thread contention, improving P99 latency by 40%
Created Grafana dashboards and Prometheus alerting rules for EKS cluster health, node utilization, and pod resource consumption
Configured Splunk log aggregation and Splunk dashboards for audit compliance and security event monitoring
Established synthetic monitoring and SLO dashboards in Datadog, providing real-time visibility into error budget consumption
Automation & Toil Reduction
Constructed robust CI/CD pipelines via Jenkins, GitHub Actions, and Argo CD embracing GitOps principles, enabling declarative, auditable deployments
Automated Vault registration, secret injection, and certificate lifecycle management, eliminating manual toil by 80%
Developed Ansible playbooks for node configuration, security patching, and compliance hardening across 200+ EKS worker nodes
Implemented automated batch job scheduling and monitoring using AWS Lambda and Event Bridge, ensuring timely completion of critical financial reconciliations
Created self-service portals for developers to provision namespaces, configure PDBs, and deploy applications without platform team intervention
Incident Response & Production Support
Spearheaded 24x7 on-call rotation for mission-critical payment gateway, serving as Incident Commander during high-severity outages
Drove incident response and RCA processes using Jira for ticket tracking and Confluence for documentation, reducing MTTR from 2 hours to 15 minutes (87% improvement)
Conducted chaos engineering experiments (pod kills, node failures, AZ outages) to validate resilience, identifying and remediating 12 single points of failure
Implemented automated remediation runbooks using AWS Lambda and Datadog webhooks, resolving common alerts without human intervention
Shielded $2M+ daily revenue exposure through rapid incident response and proactive monitoring of payment processing batch jobs
Service Reliability & Uptime Ownership
Custodian of availability for enterprise applications processing 10M+ daily transactions; architected SLIs (availability, latency, error rate, throughput) and SLOs (99.99%)
Managed error budgets and reliability roadmaps, balancing feature velocity with platform stability
Executed DR activities including route-aways, rehydration exercises, and failover testing, validating RPO < 5 minutes and RTO < 30 minutes
Implemented PDBs (Pod Disruption Budgets) and graceful termination policies, ensuring zero-downtime deployments and node rotations
Owned production management with rigorous metrics, presenting platform health dashboards to C-level executives monthly
Data & Messaging Infrastructure
Architected real-time payment processing using Kafka Streams, handling 5M+ events/day with exactly once processing semantics
Led Kafka upgrades from 2.8 to 3.5 across 10 clusters, performing rolling upgrades with zero downstream impact
Built consumer lag monitoring and dead letter queue automation, reducing failed transaction recovery time by 60%
Tuned Kafka producer/consumer configurations for optimal throughput and latency, achieving 100K+ messages/second per cluster
Managed Cassandra clusters for time-series data storage, implementing repair schedules and compaction strategies
Security & Compliance
Automated SSL certificate renewal and Vault integration using Venafi and custom scripts, eliminating certificate expiry incidents
Implemented IAM roles for service accounts (IRSA), RBAC policies, and network policies for EKS workload isolation
Configured AWS KMS encryption for EBS volumes, S3 buckets, and RDS databases, ensuring data at rest protection
Enforced Pod Security Standards and security contexts, preventing privileged container escalations
Maintained PCI-DSS compliance for payment processing infrastructure, passing annual audits with zero critical findings
LEIDOS HEALTH Healthcare Domain
Site Reliability Engineer / Platform Engineer / MLOps Engineer Oct 2023 – Aug 2024
Collaboration with Development Teams & Stakeholder Management
Synchronized with data engineers, ML practitioners, and clinical teams via Jira and Confluence to blueprint HIPAA-compliant platform architecture
Shepherded Agile sprint execution and release management for healthcare AI/ML applications serving 5M+ patient records
Facilitated cross-functional war rooms during critical incidents, coordinating upstream system owners and downstream consumers
Documented operational procedures, runbooks, and architecture decisions in Confluence, improving on-call readiness
Partnered with compliance officers to ensure HIPAA adherence in all infrastructure changes and batch job processing
Performance & Capacity Planning
Calibrated AKS autoscaling policies, PDBs, and resource quotas for ML training workloads requiring GPU/CPU optimization
Orchestrated capacity planning for batch and streaming inference pipelines, ensuring optimal resource utilization for enterprise applications
Implemented node pools with taints/tolerations for GPU workloads, isolating expensive resources from general workloads
Monitored Cassandra cluster performance and adjusted replication factors for read/write latency optimization
Forecasted storage growth for patient data and imaging workloads, provisioning Azure Blob Storage with lifecycle policies
Infrastructure & Cloud Engineering
Automated infrastructure delivery via Azure DevOps, Terraform, ARM templates, and Bicep for consistent AKS provisioning
Engineered AKS cluster topology with Azure Site Recovery, enabling cross-region failover for life-critical healthcare systems
Performed cluster upgrades for 8 AKS environments, validating Azure CNI compatibility and API server availability
Automated namespace creation with resource templates, network policies, and RBAC bindings for multi-tenant workloads
Configured Azure Private Link, VNet peering, and firewall rules for secure hybrid cloud connectivity
Observability & Monitoring
Forged comprehensive observability fabric using Datadog, Azure Monitor, Log Analytics, and New Relic
Crafted tailored dashboards for healthcare compliance workloads, guaranteeing SLA adherence and preemptive anomaly detection
Implemented Prometheus and Grafana for AKS cluster metrics, custom application metrics, and alerting
Configured Splunk log forwarding for security audit trails and PHI access monitoring
Established distributed tracing for microservices, correlating upstream API calls with downstream database queries
Automation & Toil Reduction
Streamlined CI/CD orchestration for application and ML pipelines via Jenkins and Azure DevOps, compressing deployment cycles by 65%
Implemented GitOps with ArgoCD for declarative AKS deployments, enabling automated rollbacks and drift detection
Automated batch job scheduling using Azure Data Factory triggers and Azure Logic Apps, ensuring timely data pipeline execution
Developed Ansible playbooks for AKS node hardening, log agent installation, and monitoring agent configuration
Created self-service namespace provisioning via Azure DevOps pipelines, empowering developers with guardrails
Incident Response & Production Support
Rotated through on-call duties for Contact Center and patient data platforms steered incident response and RCA via Jira
Documented resolution playbooks in Confluence, curtailing repeat incidents by 40% through knowledge institutionalization
Participated in DR activities and failover testing for Azure Site Recovery, validating RTO/RPO objectives
Implemented PDBs and graceful shutdown hooks for ML inference services, preventing request drops during deployments
Managed incident command during HIPAA-related security events, coordinating with legal and compliance teams
Service Reliability & Uptime Ownership
Steward of reliability for AWS Connect-based Contact Center serving millions of patients; sustained 99.9% uptime for voice/email/chat
Defined SLOs for patient appointment scheduling, prescription refills, and claims inquiry systems
Managed error budgets and reliability investments, prioritizing technical debt reduction based on SLO breach analysis
Conducted chaos engineering experiments on AKS workloads, validating resilience of patient-facing services
Owned production management metrics, presenting monthly reliability reviews to healthcare leadership
Data & Messaging Infrastructure
Supported Kafka clusters for event-driven patient data integration between upstream EHR systems and downstream analytics platforms
Managed Cassandra deployments for high-velocity time-series health metrics, implementing compaction and repair automation
Orchestrated batch job processing for nightly claims adjudication, eligibility checks, and reporting workflows
Configured Azure Data Factory pipelines for HIPAA-compliant data movement between on-prem and cloud stores
Optimized SQL queries for reporting batch jobs, reducing execution time from 4 hours to 45 minutes using DBeaver analysis
Security & Compliance
Implemented Azure Key Vault integration for secrets management, certificate storage, and key rotation policies
Configured RBAC and Azure AD integration for AKS, enabling SSO and conditional access policies
Enforced network policies and Azure Firewall rules for PHI protection and lateral movement prevention
Automated SSL certificate renewal via Azure Key Vault and custom automation, ensuring zero expiry events
Maintained HIPAA audit trails via Splunk and Azure Monitor, supporting compliance investigations and reporting
CIGNA Plano, TX & Connecticut Healthcare Domain
Senior SRE / Cloud Engineer / Kubernetes Architect Oct 2021 – Sept 2023
Partnered with 20+ engineering teams via Jira and Confluence to refine release workflows; facilitated blameless postmortems and AWS excellence sessions, driving cultural shift toward reliability engineering while ensuring HIPAA-compliant controls through security team engagement.
Designed EKS auto-scaling architecture orchestrating 1M+ daily healthcare claims; ensured peak performance during 300% enrollment surges through predictive scaling, optimized Cassandra clusters for claims data, and reduced SQL query latency by 45% via database tuning.
Transformed 200+ on-premises servers to AWS using CloudFormation/Terraform; executed zero-downtime migration of claims processing applications to EKS, performed cluster upgrades (1.21 1.27) across 6 production environments, and architected multi-region infrastructure with Route 53 failover and RDS read replicas.
Configured AWS CloudWatch, Datadog, Prometheus, Grafana, and Splunk for EKS clusters, EC2 fleets, and RDS databases; established SLO dashboards, custom metrics exporters, and distributed tracing (AWS X-Ray/Datadog APM), enabling rapid resolution through operational runbooks.
Consolidated 100+ Jenkins instances into containerized enterprise Jenkins on EKS, reducing maintenance overhead 60%; automated configuration governance via Ansible (75% effort reduction), implemented CI/CD pipelines enabling 30+ daily deployments with 99% success rate, and developed Terraform modules for self-service infrastructure provisioning.
Managed on-call rotation for healthcare ecosystems; reduced MTTR 50% through automated alerting and runbook automation, preserved HIPAA compliance during incident response, led war rooms for critical outages coordinating upstream/downstream systems, and achieved 99.95% availability through SLO monitoring and error budget management.
Configured IAM roles, RBAC, service mesh mTLS, and AWS KMS encryption for HIPAA compliance; administered Cassandra and Kafka clusters for claims processing, orchestrated batch job workflows, enforced Pod Security Policies, and maintained audit trails via CloudTrail and Splunk for compliance investigations.
VERIZON WIRELESS Irving, TX Telecom Domain
Production Support Engineer / Platform Engineer / Cloud Architect Apr 2017 – Oct 2021
Engaged network engineering and 20+ application teams via Jira and Confluence to orchestrate release management for 50+ microservices; coordinated upstream billing system integrations with downstream customer notification platforms, facilitated Agile ceremonies, and partnered with vendor teams for telco-specific escalations while documenting architecture decisions and SLO targets.
Architected AWS and Azure Kubernetes foundations via Terraform and CloudFormation, standardizing on EKS/AKS for 200+ microservices; executed zero-downtime cluster upgrades across 12 environments, implemented multi-cloud DR strategies with Azure Site Recovery and AWS Pilot Light, and automated namespace provisioning with guardrails, resource limits, and network isolation policies.
Deployed Datadog, Grafana, Prometheus, New Relic, and Splunk for holistic monitoring of containerized telecom services; established synthetic monitoring for customer-facing APIs, created real-time dashboards for upstream network events and downstream customer impact, and reduced MTTR by 55% through automated remediation, improved runbooks, and Datadog-driven diagnostics while maintaining 24x7 on-call coverage.
Pioneered CI/CD automation with Jenkins, Git, Docker, Ansible, and Argo CD, enabling 50+ daily releases with 99.5% success rate; automated Vault integration and certificate lifecycle management, implemented GitOps for declarative deployments with automated rollbacks, developed self-service portals for namespace provisioning, and automated batch job scheduling for billing cycles and usage calculations.
Maintained 99.9% uptime for mission-critical billing platforms and customer-facing services; administered Kafka clusters processing 100K+ mgs/sec for event-driven billing, managed Cassandra deployments for subscriber data, optimized SQL databases reducing execution time by 60%, defined SLIs/SLOs for 50+ microservices with error budget management, and enforced PCI-DSS compliance through IAM, RBAC, network policies, and automated security scanning in CI/CD pipelines.
YASH TECHNOLOGIES India Enterprise Domain
DevOps Engineer / Production Support Engineer Apr 2013 – Dec 2015
Migrated and stabilized large-scale platforms on AWS and Azure, standardizing on Kubernetes-first architectures.
Operated AKS/EKS clusters with integrated monitoring, alerting, and log aggregation using Datadog, Grafana, and Splunk.
Developed CI/CD automation with Jenkins, Git, Docker, and Ansible, enabling reliable and frequent releases.
Participated in on-call rotations and incident response, applying post-incident learnings to improve platform resilience.
Experience on Power Shell scripts to automate the Azure Cloud system, creation including end-to-end infrastructure, VMs, storage, firewall rules.
Release builds, repository management in Maven to share snapshots and releases of internal projects using JFrog Artifactory tool.
Experience in creating Docker containers leveraging existing Linux Containers and AMI's in addition to creating Docker containers from scratch.
Created additional Docker Slave Nodes for Jenkins using custom Docker Images and pulled them to ECR.
Worked on all major components of Docker like Docker Daemon, Hub, Images, Registry, Swarm.
Implemented Infrastructure automation through Ansible for auto provisioning, code deployments, software installation and configuration updates. Automated configuration management and deployments using Ansible playbooks.
CERTIFICATIONS
AWS Certified Solutions Architect – Professional
Certified Kubernetes Administrator (CKA)
EDUCATION
Master of Science – Computer Information Sciences New England College, Henniker, NH 2018