Senior SRE - DevOps & Incident Manager

Location:

Hyderabad, Telangana, India

Posted:

December 11, 2025

Contact this candidate

Resume:

RADHIKA K

Email: ************@*****.***

Phone: 407-***-****

LinkedIn: Radhika Kailasapu

PROFESSIONAL SUMMARY

Dedicated Site Reliability Engineer (SRE) & Support Analyst with over 9 years of IT experience, including 7 years of specialized expertise in DevOps and SRE roles, focused on ensuring the stability, performance, and high-availability of critical, high-volume production systems.

Expert in Incident Triage & Management, quickly identifying, isolating, and initiating resolution protocols for issues across applications, cloud infrastructure (AWS/Azure), and database tiers.

Strong technical problem solver with a disciplined, analytical approach to complex, ambiguous production failures, consistently driving issues to Root Cause Analysis (RCA) and permanent resolution.

Proficient in deep-dive analysis using advanced monitoring & logging tools (Splunk, AWS CloudWatch, Azure Monitor, Prometheus, Grafana) to extract insights from massive datasets and establish proactive alerting thresholds.

Skilled in continuous process improvement, specializing in developing automation scripts (Python, PowerShell, Bash) to eliminate recurring issues and reduce manual intervention in support operations.

Proven background in supporting enterprise CI/CD pipelines (Azure DevOps, Jenkins, GitHub Actions) and acting as the final technical gatekeeper for all production releases and deployments.

Extensive experience with Microsoft Azure (VMs, AKS, App Services, VNets) and AWS (EC2, S3, VPC, CloudFormation), enabling full-stack troubleshooting in hybrid and multi-cloud environments.

Strong technical aptitude for understanding and troubleshooting code-level issues in applications built on .NET and Java, facilitating effective escalation and collaboration with development teams.

Hands-on experience with configuration management tools (Ansible, Chef, Puppet) to ensure environment consistency and simplify the identification of configuration drift as a root cause of incidents.

Experienced in supporting database performance & connectivity issues across MySQL, Oracle, and SQL Server environments.

Demonstrated ability to handle high-severity incidents calmly and communicate clearly with both technical teams and business stakeholders.

Familiar with job scheduling concepts through experience with Autosys, aligning conceptually with Control-M requirements for batch process monitoring and troubleshooting.

Highly adaptable and comfortable learning diverse technical environments quickly, with a commitment to maintaining long-term stability, reliability, and system integrity.

Drive incident resolution efforts by coordinating restoral steps and ensuring timely recovery of services while maintaining focus on service-level objectives (SLOs) and error budgets.

TECHNICAL SKILLS:

Cloud Platforms

AWS (EC2, S3, ELB, VPC), Microsoft Azure (IaaS/PaaS, VMs, AKS, VNets, App Services, Traffic Manager)

DevOps & SCM

CI/CD (Azure DevOps Pipelines, Jenkins, IBM UDeploy), Containerization (Docker, Kubernetes), Git, TFS, Configuration Management (Ansible, Chef, Puppet)

Databases & Tools

MySQL, Oracle, DynamoDB, SQL Server, JIRA (Incident Tracking), Autosys (Job Scheduling)

Scripting & Automation

Python, Bash Shell, PowerShell, Perl, YAML, JSON, JavaScript, Infrastructure as Code (Terraform, ARM Templates)

Monitoring & Logging

Splunk, AWS CloudWatch, Grafana (Conceptual), Prometheus, Nagios, Datadog, AppDynamics, Azure Application Insights, Log Analytics

Production Support

Incident Management, Root Cause Analysis (RCA), Troubleshooting, Escalation Procedures, Stakeholder Communication, Change Management, Release Management

PROFESSIONAL EXPERIENCE

Client: Emigrant Bank, New York, NY March 2023 – Present

Role: DevOps/Cloud Engineer (Focus: Site Reliability & Automation)

Project Summary: Enterprise Application Stability and Cloud Migration Responsible for the stability and support of core banking applications during a major transition from on-prem to Azure cloud infrastructure. Primary focus was on minimizing user impact during deployments, implementing proactive monitoring, and acting as the first line of defense for production incidents, requiring deep technical troubleshooting across network, application, and database tiers.

Responsibilities

Led initial L2 Triage for over 90% of all production incidents related to Web Apps, APIs, and critical Azure Services, consistently achieving isolation and mitigation within the target resolution window.

Executed systematic Root Cause Analysis (RCA) for complex, high-severity incidents, documenting failure patterns across Azure network, VM, and application layers to prevent recurrence.

Implemented and managed Proactive Monitoring using App Insights, Log Analytics, and Azure SQL DB metrics, establishing performance baselines and configuring critical alerts to pre-emptively detect system degradation.

Streamlined the Escalation Process by providing detailed technical notes, necessary log snippets, and preliminary findings to L3 teams, reducing the hand-off time by an estimated 30%.

Maintained and supported robust Azure DevOps Pipelines, acting as the final quality control point and primary troubleshooter for all application releases to Dev, UAT, and Production environments.

Automated infrastructure provisioning using ARM Templates and PowerShell, reducing the manual steps in provisioning new environments by 40% and improving consistency for easier troubleshooting.

Performed Advanced Troubleshooting utilizing Azure CLI and PowerShell to analyse VM health, connectivity issues, and intricate Azure AD integration problems for critical banking applications.

Cross-Team Collaboration: Partner with ITIL process teams and diverse stakeholders to enhance environment stability and operational efficiency.

Ensured System Hardening by actively managing Network Security Groups (NSGs) and VNet configurations, ensuring secure and controlled application access while troubleshooting communication blocks.

Code-Level Insight: Apply working knowledge of Java and .NET frameworks to diagnose and resolve application-level problems.

Database Management: Utilize SQL and Oracle knowledge to optimize queries, maintain data integrity, and support application performance

Supported Java and .NET application stability by monitoring resource utilization in Kubernetes (AKS) and mitigating containerization-related performance bottlenecks post-deployment.

Windows & Linux Administration: Manage and maintain servers across distributed environments, ensuring performance, security, and uptime.

Application Flow Analysis: Understand distributed application behaviour and dependencies to troubleshoot complex issues effectively.

Coordinated critical change deployments and maintenance windows with business stakeholders, ensuring clear communication and minimal service interruption.

Environment & Tools: Microsoft Azure (ARM, AKS, VNets, NSGs, App Insights, SQL DB), Azure DevOps Server (CI/CD, Pipelines, Repos), Jenkins, IBM UDeploy, Docker, Kubernetes, SonarQube, Fortify, JFrog Artifactory, .NET, Visual Studio.

Client: United Health Group, Basking Ridge, NJ Dec 2020 – Feb 2023

Role: DevOps/AWS Engineer (Focus: Service Transition and Incident Resolution)

Project Summary: CI/CD Transformation and Multi-Cloud Support Provided expert-level support during a large-scale DevOps and CI/CD pipeline modernization initiative, migrating from older TFS versions to VSTS/Azure DevOps. Key responsibility included administering, stabilizing, and troubleshooting application deployments across both Azure and AWS platforms, ensuring system integrity and performance post-migration.

Responsibilities

Employed advanced AWS CloudWatch solutions and Azure Monitor to track, report, and analyze system performance metrics, successfully diagnosing intermittent latency issues across multi-cloud deployments.

Designed and implemented a robust SCM process using VSTS/TFS to track code and environment changes, which was critical for rapid rollback and accurate Root Cause Analysis (RCA) during service interruptions.

Consistently resolved integration errors, build failures, and deployment conflicts for both .NET and Java applications within the CI/CD pipeline, maintaining a 99.5% pipeline success rate.

Managed container runtime environments (Docker/Kubernetes), providing L2 support for containerized application deployment and networking issues to maintain service availability.

Authored and utilized PowerShell Script and Ansible Playbooks to automate repetitive server configuration and patch management, reducing manual effort and minimizing configuration-related support tickets.

Provided direct support for the migration of TFS instances (2013 to 2017 to VSTS), specializing in troubleshooting configuration and data integrity post-upgrade to ensure uninterrupted developer productivity.

Coordinated with development teams to troubleshoot code-related issues identified during the build and deployment stages, facilitating rapid hotfixes and patches.

Acted as a primary resource for managing the AWS environment, specifically focusing on troubleshooting EC2 instance health, S3 access policies, and ELB load balancing configurations.

Tracked and documented all production issues and resolutions in JIRA, maintaining a centralized knowledge base that reduced the time-to-resolve for common incidents by 15%.

Environment & Tools: Microsoft Azure (ARM, VMs, VNets, AAD, App Services, Site Recovery), Azure DevOps/VSTS/TFS (2013–2017), Jenkins, Docker, Kubernetes, Helm, Maven, Ansible, Octopus Deploy, PowerShell, Git/GitHub, .NET, SQL Server.

Client: Micro1, FL Jan 2019 – Nov 2020

Role: DevOps/SCM Engineer (Focus: Environment Standardization & Log Management)

Project Summary: Automation-Driven Operational Improvement Focused on standardizing server environments and automating manual administrative tasks to reduce human error and improve the reliability of the Test and Production environments. Played a critical role in integrating log analysis and monitoring tools to gain deeper insight into application behaviour and incident tracking.

Responsibilities

Led the implementation and administration of Splunk and Nagios for remote monitoring of Servers, Applications, and Databases, establishing centralized logging and enabling proactive alerting.

Used Puppet and Chef to enforce configuration management across Linux and Windows servers, eliminating configuration drift and simplifying troubleshooting efforts in critical production environments.

Provided L2 support for environment automation and scheduling using Autosys, acting as the primary point of contact for batch job failures and resolution.

Developed robust automation scripts using Perl and Shell for customizing the software release process and automating infrastructure tasks, significantly reducing deployment time and manual effort.

Involved in formal Incident and Release Management for handling troubleshooting requests across the environment and coordinating package releases under strict change control procedures.

Supported the use of Terraform in AWS VPC to automatically set up and modify cloud settings, ensuring environment repeatability and simplifying the recovery process.

Environment & Tools: Microsoft TFS 2013/2015, GIT, GitHub, Subversion, Jenkins, Octopus Deploy, Rundeck, Maven, Ant, Puppet, Chef, Ansible, Terraform, Perl, Shell, PowerShell, Autosys, AWS (VPC), Splunk, Nagios, JIRA, Linux, Windows, Oracle.

Client: Genesis Infocom, India December 2015 - Sep 2018

Role: Build and Release Engineer (Focus: Process Documentation & Support Escalation)

Project Summary: Standardizing ALM and CI/CD Practices Focused on establishing foundational Application Lifecycle Management (ALM) processes and standardizing CI/CD pipelines to create a repeatable, supportable framework for software delivery. Served as the subject matter expert (SME) for all TFS-related issues and documentation.

Responsibilities:

Served as the Primary Lead for TFS Administration and R&D Infrastructure support, providing immediate technical resolution for issues such as Code Merging, Access Forbidden errors, and build failures.

Created standard process and procedure documentation for all TFS Build and deployment processes, training developers and managers on best practices to reduce recurring support requests.

Implemented and maintained end-to-end automation pipelines using Jenkins and Octopus Deploy, ensuring scheduled builds and deployments were stable and reliable across all environments.

Customized TFS process templates and build scripts, serving as the subject matter expert to ensure the ALM system supported the business's workflow requirements and minimized user friction.

Performed backup/recovery operations for TFS and related systems, guaranteeing data integrity and minimizing downtime in the event of a system failure.

Environment & Tools: Microsoft TFS 2013, Jenkins, Octopus Deploy, Maven, Ant, Git, Shell Scripting, Windows Server, SQL Server, Agile/Scrum.

Education :

Bachelor of Engineering), Electrical and Electronics (EEE)

Sree Kavitha Engineering College, Khammam, Telangana — 2005–2009

Contact this candidate