Cloud Infrastructure Reliability Engineer

Location:

Texas City, TX

Posted:

May 23, 2025

Contact this candidate

Resume:

Sravya T

Email: ******.***@*****.*** Phone: 940-***-****

Summary:

Results-driven DevOps/Site Reliability Engineer with 8+ years of experience in designing, automating, and optimizing mission-critical deployments in cloud environments. Expertise in cloud infrastructure, CI/CD pipelines, observability, security, and incident management, ensuring high availability, scalability, and system reliability. Strong advocate of Infrastructure as Code (IaC), SRE best practices, and DevSecOps principles.

Professional Experience Highlights:

Designed and deployed cloud infrastructure using AWS, Terraform, Kubernetes, and modern DevOps practices.

Designed, deployed, and managed cloud-native architectures using AWS, Azure, Kubernetes (EKS, AKS), Terraform, and CloudFormation to improve scalability and cost efficiency.

Automated provisioning of cloud resources using Terraform, Ansible, Packer, Helm, and CloudFormation for repeatability and compliance.

Expertise in Docker, Kubernetes (EKS, AKS), ECS, OpenShift, Helm, and Istio for microservices deployment, scaling, and networking.

Developed end-to-end CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps, ArgoCD, and Spinnaker to enable fast and reliable software delivery.

Implemented real-time monitoring and logging solutions with Prometheus, Grafana, ELK Stack, CloudWatch, Datadog, Splunk, and New Relic, improving system uptime and incident response.

Integrated DevSecOps best practices, including IAM, AWS KMS, HashiCorp Vault, SSO, RBAC, Secrets Manager, CIS Benchmarking, and vulnerability scanning to secure cloud environments.

Automated system tasks and workflows using Python, Bash, Shell, PowerShell, YAML to optimize cloud operations and reduce manual effort.

Experience with Git, GitHub, GitLab, Bitbucket, SVN, ensuring efficient code management and collaboration across teams.

Expertise in Load Balancing, CDN, DNS, TCP/IP, TLS/SSL, and DDoS mitigation to improve application performance and security.

Led incident response, root cause analysis, and reliability engineering using SLOs, SLIs, SLAs, Chaos Engineering, Postmortems, and Runbooks, ensuring 99.99% uptime.

Hands-on experience with ServiceNow, Manage Engine, and HP Service Manager for efficient incident tracking and problem resolution.

Worked on multi-cloud and hybrid cloud strategies, integrating services across AWS, Azure and Kubernetes clusters.

Education:

Bachelors from JNTU, India.

Key Skills & Expertise:

Cloud & Infrastructure: AWS (EC2, S3, IAM, Lambda, API Gateway, Kinesis, Redshift, VPC, Route 53, CloudFormation, CloudWatch), Azure, Terraform, Kubernetes (EKS, AKS), Docker, Helm.

Automation & Configuration Management: Terraform, Ansible, Puppet, Chef, Packer.

CI/CD & DevOps Pipelines: Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps, ArgoCD, Spinnaker.

Observability & Monitoring: Prometheus, Grafana, ELK Stack, Datadog, Splunk, CloudWatch, New Relic.

Scripting & Programming: Python, Shell, Bash, YAML, PowerShell.

Version Control & Collaboration: Git, GitHub, GitLab, Bitbucket, SVN.

Security & Compliance: IAM, AWS KMS, Secrets Manager, HashiCorp Vault, SSO, RBAC, CIS benchmarking.

Networking & Performance Optimization: Load Balancers, CDN, DNS, TCP/IP, TLS/SSL, DDoS mitigation.

Incident Management & SRE Practices: SLOs, SLIs, SLAs, Chaos Engineering, Postmortems, Runbooks.

Containerization & Orchestration: Kubernetes (Helm, Istio, Calico), Docker, ECS, OpenShift.

Ticketing Tools: ServiceNow, Manage Engine and HP Service Manager.

Professional Experience:

Sr. DevOps/Site Reliability Engineer (SRE) Navy Federal Credit Union, VA, Remote May 2023 – Present

Design, deploy, and maintain highly available, scalable, and secure cloud infrastructure on AWS and Azure using Terraform, CloudFormation, Kubernetes (EKS/AKS), and Helm.

Automate cloud resource provisioning with Terraform, Ansible, and Packer for infrastructure as code (IaC) adoption.

Optimize cloud costs and performance while ensuring compliance with financial regulatory requirements (e.g., PCI-DSS, SOC 2).

Build and maintain CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps, and ArgoCD to ensure secure and efficient software delivery.

Implement GitOps workflows with ArgoCD and Spinnaker for declarative application management.

Automate software deployment strategies (Blue-Green, Canary, Rolling Updates) to minimize downtime in production.

Develop and manage real-time monitoring, logging, and alerting using Prometheus, Grafana, ELK Stack, CloudWatch, Splunk, and Datadog to ensure system reliability.

Define and monitor SLOs, SLIs, and SLAs for financial services, ensuring high uptime and performance.

Implement Chaos Engineering practices to proactively test system resilience and improve fault tolerance.

Lead incident response, root cause analysis, and postmortem analysis to improve system reliability.

Implement IAM, AWS KMS, Secrets Manager, HashiCorp Vault, RBAC, and SSO for securing cloud resources and access management.

Enforce CIS Benchmarks, vulnerability scanning, and compliance auditing to meet regulatory requirements in financial environments.

Secure CI/CD pipelines with code scanning, secret detection, and container security best practices.

Implement DDoS mitigation, TLS/SSL, and WAF (Web Application Firewall) to protect financial applications.

Deploy and manage containerized workloads using Kubernetes (EKS/AKS), Docker, Helm, Istio, and OpenShift.

Implement service mesh architectures (Istio, Calico) for secure and scalable microservices communication.

Optimize Kubernetes clusters for high availability, auto-scaling, and cost efficiency in production environments.

Design and manage high-performance networking solutions including Load Balancers, CDN, DNS, TCP/IP, TLS/SSL, and DDoS mitigation.

Optimize API Gateway and Kinesis data streaming for real-time financial transactions.

Implement caching strategies (Redis, CloudFront, Memcached) to enhance application performance.

Work with ServiceNow, Manage Engine, HP Service Manager for incident tracking, change management, and ticketing.

Develop automated runbooks and playbooks for incident response and disaster recovery.

Participate in on-call rotations and incident response handling to ensure 24/7 system availability.

Architect hybrid and multi-cloud solutions integrating AWS, Azure, and on-prem infrastructure.

Implement cross-cloud networking, authentication, and data synchronization strategies.

Manage data lakes, Redshift, and Big Data pipelines for financial analytics and reporting.

Work closely with development, security, and operations teams to enforce Agile, Scrum, and SRE best practices.

Improve developer productivity by automating test environments and integrating DevOps pipelines.

Conduct knowledge-sharing sessions, workshops, and training for DevOps and SRE best practices.

Design and implement disaster recovery (DR) and business continuity (BCP) strategies for financial applications.

Automate backup and failover mechanisms to ensure high availability during outages.

Test and refine incident response plans and DR procedures regularly.

Sr. DevOps Engineer Trusted Media Brands, Milwaukee, WI January 2021 – April 2023

Designed and automated CI/CD pipelines using Jenkins, GitHub Actions, GitLab CI/CD, and Azure DevOps, ensuring seamless application deployment and faster release cycles.

Provisioned and managed cloud environments in AWS, Azure, and Kubernetes (EKS, AKS) using Terraform, CloudFormation, and Ansible for scalable and repeatable deployments.

Built and maintained Docker containers and managed containerized applications using Kubernetes, Helm, and ECS to ensure microservices scalability and reliability.

Automated infrastructure provisioning with Terraform, Ansible, and Packer, ensuring consistency and compliance across environments.

Implemented monitoring, logging, and alerting solutions using Prometheus, Grafana, ELK Stack, Datadog, Splunk, and CloudWatch, improving incident response and system uptime.

Enforced security best practices, including IAM, AWS KMS, Secrets Manager, HashiCorp Vault, RBAC, and CIS benchmarking, to enhance system security and access control.

Optimized application performance through Load Balancers, CDN, DNS, TLS/SSL, and DDoS mitigation, ensuring high availability and reduced latency.

Maintained system reliability using SLOs, SLIs, SLAs, Chaos Engineering, Postmortems, and Runbooks, minimizing downtime and improving recovery processes.

Developed automation scripts using Python, Bash, Shell, and PowerShell to streamline deployments, configuration management, and system monitoring.

Managed repositories with Git, GitHub, GitLab, and Bitbucket, ensuring proper branching, version control, and team collaboration.

Used ServiceNow, Manage Engine, and HP Service Manager for tracking incidents, performing root cause analysis, and implementing permanent fixes.

Worked on multi-cloud and hybrid cloud strategies, integrating services across AWS, Azure, and Kubernetes clusters for business continuity and cost optimization.

Implemented GitOps practices with ArgoCD and Flux to streamline Kubernetes deployments and improve infrastructure consistency.

Worked on AWS Cost Explorer, Azure Cost Management, and Kubecost to optimize cloud spending and improve cost efficiency.

Designed and deployed AWS Lambda, API Gateway, Kinesis, and EventBridge for scalable and cost-effective serverless applications.

Configured Edge Locations and AWS Wavelength for improved latency and performance in financial applications.

Supported MLOps pipelines by integrating Kubeflow, SageMaker, and TensorFlow into DevOps workflows for financial predictive analytics.

SRE (Site Reliability Engineer) JPMC, NYC, NY June 2018 – December 2020

Identifying and resolving issues in production environments to ensure high availability and reliability.

Configuring and testing disaster recovery (DR) solutions to ensure business continuity.

Managing Unix/Linux & Windows environments, including server migration (V2V, VMware Converter) and virtual machine provisioning.

Configuring and managing DNS records, Domain Name Services (DNS), NIS, and LDAP for secure and efficient connectivity.

Developing Shell scripts for automating administrative tasks, patch management (RPM, YUM), and system monitoring.

Managing user accounts, access controls, and working with Microsoft SQL Server for database support.

Working with internal and external support teams for incident resolution, change management, and performance tuning.

Deploying and managing infrastructure using AWS, Azure, Kubernetes (EKS, AKS), Terraform, and Ansible.

Implementing CI/CD pipelines with Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD, and automating deployments using Infrastructure as Code (IaC).

Setting up monitoring solutions using Prometheus, Grafana, ELK Stack, Datadog, and CloudWatch to track SLOs, SLIs, and SLAs.

Implementing IAM, RBAC, AWS KMS, HashiCorp Vault, and CIS Benchmarking for secure access management.

Conducting root cause analysis, incident response, and postmortems to improve system reliability and prevent future failures.

Managing load balancers, CDN, DDoS mitigation, and optimizing TCP/IP, TLS/SSL configurations.

Implementing fault injection testing, automated failovers, and resilience testing to improve system stability.

DevOps Cloud Engineer Cigniti Technologies, Hyderabad, India October 2016 – April 2018

Worked on Managing the Private Cloud Environment using Chef.

Managed and optimize the Continuous Delivery tools like Jenkins.

Install, Configure, Administer Jenkins Continuous Integration Tool

Developed and implemented Software Release Management strategies for various applications according to the agile process.

Developed build and deployment scripts using Maven as build tools in Jenkins to move from one environment to other environments.

Branching, Tagging, Release Activities on Version Control Tool GIT.

Automated deployment of builds to different environments using Jenkins.

Built and Deployed Java/J2EE to a web application server in an Agile continuous integration environment and automated the whole process.

Used Jenkins for Continuous Integration and deployment into WebSphere Application Server.

Used Maven as build tool on Java projects for the development of build artifacts on the source code.

Developed build and deployment processes for Pre-production environments.

Used Subversion as source code repositories.

Developed automation scripting in Python (core) using Puppet to deploy and manage Java applications across Linux servers.

Managed SVN repositories for branching, merging, and tagging.

Continuous Delivery setups with Puppet by creating manifest and maintain templates for different environments. Migration of shell scripts into Puppet Manifests.

Wrote Puppet code to provision infrastructure.

Experience in developing puppet modules (blue prints) for installation, configuration and continuous integration (CI).

Involved in writing parent POM files to establish the code quality tools integration.

Installing, configuring and administering Jenkins CI tool on Linux machines.

Installed and Configured the Nexus repository manager for sharing the artifacts within the company.

Used Jira as ticket tracking and work flow tool.

Environment: JAVA, Shell Script, Git, Jenkins, Puppet, Artifactory, LINUX, Maven, Web sphere, JIIRA.

Contact this candidate