DevOps/SRE Engineer - AWS - Azure - K8s - Terraform

Location:

New York City, NY

Posted:

April 30, 2026

Contact this candidate

Resume:

SIRI CHANDANA BATHINI

DevOps / SRE Engineer AWS Azure Kubernetes Terraform CI/CD Python

228-***-**** ***************.****@*****.*** LinkedIn: www.linkedin.com/in/bchandana2309 PROFESSIONAL SUMMARY

DevOps / SRE Engineer with 4+ years building, automating, and operating production infrastructure across AWS, Azure, and GCP in regulated, multi-tenant environments. Hands-on expertise with Terraform and Pulumi IaC, Kubernetes cluster operations across EKS and AKS, and CI/CD pipeline automation using GitHub Actions, GitLab CI, Jenkins, and Azure DevOps. Proven track record reducing deployment risk, accelerating release velocity, and improving system reliability through observability, on-call incident response, and post-incident reviews. Strong scripting skills in Python, Bash, and PowerShell, with experience enforcing security best practices, compliance standards, and operational excellence across distributed cloud platforms. PROFESSIONAL EXPERIENCE

DevOps / SRE Engineer, Costco, USA July 2025 – Present

Own and maintain Terraform and Pulumi infrastructure as code across AWS and Azure, authoring reusable modules for EKS, ECS/Fargate, VPC, IAM, RDS, and S3, reducing environment provisioning time by 70% and enforcing consistent, auditable configurations across production and pre-production.

Design, implement, and maintain CI/CD pipelines using GitHub Actions, GitLab CI, Jenkins, and Azure DevOps, automating build, test, deployment, and rollback workflows across Git-based multi-environment delivery, cutting release cycle time by 40% and improving code quality gates.

Provision, upgrade, and operate Kubernetes clusters on EKS and AKS, managing workload orchestration, autoscaling, Blue-Green and Canary deployments, and rollback strategies for Dockerized applications, ensuring zero-downtime releases in 24/7 production environments.

Build and maintain observability systems using Prometheus, Grafana, Datadog, Splunk, ELK Stack, and CloudWatch, defining SLIs, SLOs, and error budgets, reducing mean time to detection by 50% and enabling proactive reliability management across distributed systems.

Lead on-call rotations and post-incident reviews using structured RCA methodology, driving long-term reliability improvements and implementing preventative fixes that reduced repeat incidents by 35%.

Enforce security and compliance controls including SAST/DAST pipeline integration, IAM least-privilege, secrets management, Zero Trust architecture, and HIPAA/NIST-aligned audit logging across multi-tenant cloud environments.

Automate operational workflows, diagnostics, and ChatOps tooling using Python, Bash, and PowerShell, reducing manual operational effort by 40% and building self-healing infrastructure patterns that improve engineering productivity and developer velocity. ML Data Associate – SupportOps, Amazon, Hyderabad August 2022 – November 2023

Supported AWS-hosted distributed production systems under strict SLA requirements in Agile/Scrum environments, performing RCA on infrastructure, networking, Kubernetes workload, and application-layer failures to restore service availability.

Automated operational diagnostics and workflows using Python and Bash, reducing manual troubleshooting effort by 30% and improving incident investigation consistency across distributed cross-functional engineering teams.

Monitored infrastructure health and application reliability using CloudWatch, Prometheus, and ELK Stack, proactively detecting PostgreSQL and RDS performance degradation and compute bottlenecks before customer impact.

Participated in incident postmortems and security audits, maintaining clear documentation of architecture decisions, deployment procedures, and operational runbooks to improve response consistency and audit readiness.

Collaborated with engineering and security teams on CI/CD pipeline improvements, deployment reliability, compliance validation, and cloud security hardening across AWS-hosted services.

S&R Specialist, Amazon, Hyderabad July 2018 – March 2022

Supported large-scale AWS and Azure cloud infrastructure and distributed systems, performing Linux system administration, networking troubleshooting, and application failure resolution during production incidents.

Assisted with deployments, rollbacks, patch management, and infrastructure updates following Agile change-management procedures, escalating critical incidents and coordinating cross-team resolution.

Improved operational runbooks, Confluence documentation, and troubleshooting procedures, building foundational expertise in cloud infrastructure, containerization, IaC, and SRE practices.

EDUCATION & CERTIFICATIONS

Master of Science in Computer and Information Science – University of Southern Mississippi Jan2024-May2025 Generative AI Badge – Databricks

TECHNICAL SKILLS

Cloud Platforms: AWS (EKS, ECS/Fargate, EC2, VPC, IAM, S3, RDS, Lambda, CloudWatch, Route 53, ELB, ASG), Azure (AKS, VMs, VNets, App Services, Azure SQL, Monitor, Storage, DevOps), GCP fundamentals

IaC and Config Management: Terraform (modules, remote state, multi-environment), Pulumi, CloudFormation, Bicep, ARM Templates, Ansible, Packer Containers and Orchestration: Kubernetes (EKS, AKS, GKE), Docker, Helm, ECS/Fargate, cluster provisioning, autoscaling, workload orchestration, rollout/rollback, namespace isolation

CI/CD and GitOps: GitHub Actions, GitLab CI/CD, Jenkins, Azure DevOps, CircleCI, Git, YAML pipelines, ArgoCD, Flux, GitOps principles, Blue-Green, Canary deployments, feature flagging

Observability: Prometheus, Grafana, Datadog, ELK Stack, Kibana, Splunk, New Relic, CloudWatch, OpenTelemetry, Sumo Logic, Azure Monitor, SLIs, SLOs, error budgets, alerting

Security and Compliance: IAM least-privilege, secrets management, SAST/DAST/SCA, Zero Trust architecture, HIPAA, NIST, ISO 27001, PCI DSS, GDPR, audit logging, compliance validation

Scripting and Dev: Python, Bash, PowerShell, YAML, Go fundamentals, Ruby fundamentals, REST API integration, workflow automation, operational diagnostics tooling

Databases: PostgreSQL, MySQL, Azure SQL, Snowflake, RDS, SQL, NoSQL fundamentals, database availability management SRE and Practices: On-call rotations, incident response, RCA, postmortems, disaster recovery, SLIs/SLOs/error budgets, Chaos Engineering, Agile/Scrum Tools and Platforms: Confluence, Jira, Atlassian Suite, Kafka, ChatOps (Slack, Teams automation), vendor management, cost optimization

Contact this candidate