Cloud DevOps Engineer - Cloud Automation & Monitoring Expert

Location:

San Jose, CA

Posted:

December 01, 2025

Contact this candidate

Resume:

Chandrakanth

San Jose, CA 408-***-**** *********@*****.*** LinkedIn

Summary

Cloud DevOps Engineer with extensive experience in cloud infrastructure, CI/CD automation, and monitoring across AWS, GCP, and Azure. Skilled in orchestrating containerized applications with Kubernetes and Docker, while leveraging Terraform for scalable deployments. Proven track record of enhancing system performance and reliability through automation and robust observability solutions.

Skills

•Cloud Computing: AWS, GCP, Azure

•Containerization, Orchestration & IaC: Kubernetes, Docker, Terraform, Infrastructure as Code (IaC), CloudFormation

•CI/CD: Jenkins, Spinnaker, Azure DevOps, Gitlab, Github, Git, Release Management,ITIL

•Scripting, Automation & Administration: Bash, Python, Ansible, Unix/Linux Systems Administration

•Monitoring Tools: Grafana, Kibana, Nmsys, AlertManager, Prometheus, Nagios

•Tools & Database: Splunk, jFrog, Sql,Java,Springboot,REST api,Swagger

•AI Tools: Cursor, Replit, n8n, co-pilot

Work Experience

Apple

DevOps SRE

Nov 2021 - Mar 2025

Sunnyvale, CA

• Enhanced a CI/CD Spinnaker pipeline to reduce deployment times and boost efficiency, aligning with best practices in automated deployments.

• Developed and maintained Ansible playbooks to streamline automation processes, ensuring consistent and reliable deployments across cloud environments.

• Optimized existing Ansible playbooks for HSM appliance deployments across varied environments, improving infrastructure stability and performance.

• Authored Ansible playbooks for pre and post-deployment validations, significantly improving deployment verification procedures.

• Automated routine tasks with Bash and Python scripts to minimize manual effort and reduce errors in system operations.

• Designed and refined observability dashboards using Grafana to enhance system performance monitoring and expedite troubleshooting.

• Collaborated with Operations and Development teams to troubleshoot deployment-related issues, ensuring stable production environ- ments with minimal downtime.

• Configured and managed virtual machines for production and non-production environments, ensuring robust resource management and high performance in cloud setups.

• Utilized Git and Subversion for version control, managing code repositories and ensuring collaboration across development teams while maintaining code quality and integrity.

• Migrated virtual hosts from Apple infrastructure to AWS cloud, configured core service EC2, S3, VPC, RDS, and IAM.

• Configured VPC networks with public/private subnets, and established monitoring through CloudWatch.

• Coordinated with cross-functional teams to manage infrastructure services, supporting the rollout of new HSM features in line with agile practices.

• Documented comprehensive runbooks for first-level support teams, facilitating efficient knowledge transfer and operational readiness.

• Guided support team members on HSM feature updates and deployment activities, promoting continuous learning and process improvement.

• Engaged with third-party vendors to onboard HSM applications from development through infrastructure provisioning, ensuring seamless integration.

• Worked with onsite and offshore teams during Build, Test, and Deploy phases to consistently achieve project delivery timelines.

• Knowledge of deploying and automating HPC clusters using Kubernetes, Terraform, and Ansible. Mediakind Apr 2021 - Nov 2021

DevOps SRE Santa Clara, CA

• Managed a cloud-native application deployment on Microsoft Azure, optimizing resource allocation and ensuring high availability while implementing robust security protocols in compliance with industry standards.

• Automated infrastructure provisioning and application deployment using Bash and Python scripts, collaborating with QA teams to validate functionality and performance of set-top box applications in CI/CD pipelines within dynamic environments.

• Enhanced system observability by integrating Grafana and Prometheus for real-time monitoring and alerting, facilitating advanced log aggregation and data visualization to enable proactive incident response and system optimization.

• Provided on-call support and engaged directly with clients to formulate and execute remediation strategies post-incident, ensuring minimal downtime and improved service reliability.

• Facilitated regular sync meetings with clients and coordinated with field engineers to troubleshoot infrastructure issues, refining cloud deployment methodologies and enhancing overall system architecture.

• Delivered comprehensive client status reports and conducted delivery assurance reviews for CI/CD pipeline releases, leading to improved service reliability and faster incident resolution processes Ericsson

SRE

Feb 2020- Feb 2021

Santa Clara, CA

• Executed application deployments for each release cycle while managing the Method of Procedure (MOP), and conducted network trace analysis using Bastion to identify and resolve connectivity issues.

• Configured and optimized Prometheus for comprehensive monitoring, fine-tuning OpsGenie alerts to ensure timely notifications for critical application performance metrics.

• Performed log analysis with Google Stackdriver, establishing proactive alerting mechanisms for application monitoring, and imple- menting optimizations to troubleshoot and enhance system performance.

• Managed continuous integration and continuous delivery (CI/CD) pipelines for application deployments, and automated configuration management and orchestration.

• Developed a centralized team dashboard using Zendesk for incident tracking and utilized Jira for defect management and change tracking, enhancing team collaboration and workflow efficiency.

• Utilized Git, Maven, Docker, Kubernetes, and Jenkins to automate build and deployment processes, improving software delivery speed and reliability

• Ensured service reliability and availability by providing 24/7 operational support and effectively managing incidents on a rotational basis

Mediakind Apr 2018 - Feb 2020

SRE Santa Clara, CA

• Served as the Cloud Administrator for MediaKind services hosted in the cloud, managing IPTV and set-top box environments, including the configuration and maintenance of virtual machines, storage accounts, resource groups, and access control policies.

• Supported development teams in daily operations within the cloud environment, utilizing Prometheus and Nagios for server perfor- mance monitoring, and employing Grafana and Kibana for log aggregation and data analysis.

• Managed incident response and bug tracking processes, providing technical support, diagnostics, and follow-up with customers while collaborating closely with field engineers to resolve issues efficiently.

• Delivered comprehensive monthly and weekly reports to clients, conducted delivery assurance reviews for CI/CD pipeline releases, and implemented rapid response strategies to minimize service outages.

• Enhanced automation and streamlined processes by collaborating with the team to develop scripts in Bash and Python, improving operational efficiency and reducing manual intervention.

• Ensured continuous service availability by providing 24/7 on-call support on a rotational basis, while also mentoring both on-site and offshore team members to foster skill development and knowledge sharing. Ericsson Oct 2014 - Apr 2018

SRE Santa Clara, CA

• Directed Media First Service Requests, including user provisioning, client invites, environmental requests, and deployments, among many other client requirements.

• Facilitated high-priority bridge moderation, live service monitoring, heads-up displays, manual service checks, and customer escala- tions to proper resources, ensuring adherence to the service-level-agreement and minimizing downtime

• Refined and improved internal tools and processes to aid in-service availability and performance. Microsoft Corporation Apr 2013 - Oct 2014

Operations Engineer Mountain View, CA

• Monitored critical online services using Nagios, Cacti, Check, and SCOM, and managed customer incidents from initial call to resolution, improving server, network device, and application reliability and performance

• Applied crisis management skills to coordinate complex incidents and outages, document troubleshooting guides and recovery procedures for future service improvements.

• Oversaw and tracked change requests in the operations center while also suggesting enhancements for tools and automation. Education

Bengaluru University, India

Bachelor, Electronics & Telecommunications

Mar 2000

Contact this candidate