Site Reliability Incident Management

Location:

Prosper, TX

Posted:

June 30, 2025

Contact this candidate

Resume:

ARUN KUMAR KODALI

Email: ************@*****.***

Mobile: 346-***-****

Experienced Lead SRE with DevOps expertise in building and maintaining highly reliable, scalable, and secure systems and managing teams. Proficient in incident management, observability on cloud, on-premises infrastructure environments, and designing, developing, and optimizing large-scale distributed systems. Adept at automating infrastructure, improving observability, and enhancing system resilience. Proven track record of reducing operational risks, optimizing performance, and ensuring seamless deployment in cloud-native environments. Demonstrated leadership in managing complex technical issues, mentoring engineers, and driving initiatives to enhance the reliability and stability of platforms. Adept at collaborating across teams to improve service reliability, track performance, and resolve critical issues, ensuring minimal business impact. Strong technical skills in cloud platforms, containerization, and observability tools, with a passion for optimizing performance and reducing toil.

Technical Skills

Cloud Platforms : AWS (Primary), Azure, GCP (Exposure)

Infrastructure As Code : Terraform, AWS CloudFormation, Ansible, GitHub Actions (Infrastructure Automation)

Containerization and Orchestration : Docker, Kubernetes, Amazon ECS

Build Tools : Jenkins, GitHub Actions, AWS CodePipeline

Deployment & Release Engineering : AWS CodeDeploy, CloudBees, Helm, Argo CD

DNS & Traffic Management : Cloudflare, Amazon Route 53, Azure DNS Zones, Fastly

Monitoring & Observability : Amazon CloudWatch, Grafana, Prometheus, Datadog, Splunk, Dynatrace, New Relic, Kibana, Azure Monitor, Loki

Logging & Telemetry : CloudWatch Logs, Application Insights, Splunk

APM Tools : App Dynamics, Raygun and Site 24x7

Databases : PostgreSQL, MySQL, Oracle, Microsoft SQL Server, Amazon RDS, DynamoDB, Redshift,

Hive, HBase, Vertica

Scripting & Automation : Bash, Python

Architecture & Design Tools : ERwin Data Modeler, Microsoft Visio

Version Control & Collaboration : Git, GitHub,

Big Data & Streaming Platforms : Hadoop (HDP, CDH, MapR), Apache Spark, Kafka, Flink, Amazon EMR

Networking & Security : TCP/IP, TLS, DNS, Load Balancers, VPC, IAM, Security Groups, Prisma Cloud, AWS Audit Manager, Azure Security Center, Ansible Tower, Cloudflare Security, Microsoft Defender, Azure Active Directory, AWS Config, Rapid7 InsightVM

Professional Experiance

Wati, Frisco, TX 06/2024 –

Role: Lead Site Reliability Engineer

Lead SRE initiatives, ensuring high availability, performance, and security of public cloud and on-premises environments.

Designed and Bild/implemented and maintained scalable and reliable platform infrastructure. Established operational standards and best practices for site reliability, ensuring high availability and performance. Developed and executed disaster recovery and business continuity plans.

Designed robust alerting workflows to enable timely incident response and minimize downtime. Analyzed operational data to identify areas for platform optimization.

Led the investigation and resolution of production incidents, reducing mean time to resolution (MTTR). Developed and maintained incident response playbooks and escalation procedures. Partnered with engineering teams to proactively mitigate risks and enhance system resilience.

Developed and implemented automated testing frameworks to ensure software quality and reliability. Identified process inefficiencies and introduced automation solutions to streamline operations. Championed continuous improvement initiatives to enhance system performance and reduce manual effort.

Led a platform reliability and Observability initiative that reduced downtime by 30% through automated failover mechanisms.

Expertise on installing and administering Kubernetes, Docker, Hadoop, Database clusters on Cloud and on-premises.

Orchestrated a cross-functional initiative to improve system latency by 30%, leading to an enhanced user experience for AWS's core services by integrating comprehensive monitoring tools such as Prometheus and Grafana.

Led the successful migration of critical systems to a Kubernetes-based architecture, resulting in a 20% improvement in deployment efficiency and a 15% reduction in resource costs by leveraging containerization.

Implemented a centralized logging solution using ELK Stack, allowing for real-time analysis of system logs and a 25% quicker response to anomalies.

Refined the deployment process by automating workflows with Jenkins, cutting down release cycle times by 30% and facilitating continuous integration and delivery.

Crafted robust backup and recovery strategies that reduced potential data loss risk by 50%, ensuring high availability of content streaming services.

Supervised DevOps, Operational Engineering teams, fostering career growth and knowledge sharing, which reduced churn by 15% and increased team productivity by 25%.

Designed and conducted resilience testing workshops, reducing critical system vulnerabilities by 45% and ensuring systems exceeded the targeted 99.99% uptime.

Automated infrastructure provisioning with Terraform, slashing time spent on manual processes by 60%, resulting in a more agile deployment cycle and rapid scalability.

Mphasis Corp, Frisco, TX 08/2022 – 05/2024

Role: Site Reliability Engineer

Led 24x7 production support for complex distributed systems, proactively identifying and resolving performance and availability issues across applications, infrastructure, and networks.

Served as incident commander, leading cross-functional war rooms involving executives (VPs/SVPs), SMEs, and engineers to resolve critical outages and degradations.

Utilized advanced observability tools (Splunk, AppDynamics, Grafana, 1000Eyes, Red Metrics) to proactively monitor system health and detect anomalies.

Correlated multi-source data including logs, alerts, dashboards, network traces, and recent deployment changes to rapidly identify root causes and restore service.

Performed diagnostics across the full stack: container platforms (Kubernetes, Docker), databases (Oracle, SQL Server, MongoDB, Cassandra), infrastructure and cloud environments.

Conducted synthetic monitoring and UEM configuration to improve end-user experience tracking and performance measurement.

Implemented self-healing and automated response playbooks using ServiceNow AIOps tools.

Contributed to a culture of reliability through rigorous postmortem analysis and continuous improvement of incident response processes.

eSimplicity Inc, Houston, TX 11/2021 to 12/2021

Role: Senior Platform Administrator

Managed infrastructure on AWS, including EC2, S3, RDS, and VPC, ensuring high availability, scalability, and security.

Automated infrastructure provisioning and configuration using Terraform and Ansible, reducing deployment time by 50% and improving infrastructure consistency.

Implemented centralized logging and log analysis using ELK Stack, improving troubleshooting and monitoring capabilities with splunk.

Worked closely with development, Production Support, Infrastructure teams to implement performance monitoring and optimization strategies.

Collaborated with security teams to implement and maintain security controls and ensure compliance with industry standards.

Conducted disaster recovery planning and testing exercises to ensure business continuity.

Spearheaded infrastructure automation using Terraform and Ansible, reducing deployment time by 40%.

Designed Kubernetes clusters for microservices, enhancing scalability and resource utilization.

Implemented robust CI/CD pipelines, increasing deployment frequency and reducing downtime.

Developed security automation scripts to enforce compliance and mitigate vulnerabilities.

Monitored application performance and optimized cloud resources to improve cost efficiency.

CVP Corp, Houston, TX 04/2021 to 10/2021

Role: Technologist

Architected, administered, installed, maintained, configured, customized, integrated, and supported the Ambari Hadoop cluster, overseeing both internal and external applications.

Managed the administration, provisioning, and configuration of EC2 instances, S3 buckets, IAM roles, policies, encryption, EMR clusters, Hive, Athena views, and Redshift DB.

Established and maintained a Docker container cluster managed by Kubernetes, utilizing Linux, Bash, GIT, and Docker. Employed Kubernetes and Docker for the CI/CD system's runtime environment to facilitate building, testing, and deployment.

Designed and deployed the existing data pipeline streaming process into the Databricks platform, resulting in a 30% reduction in data processing time and improved data accuracy.

Implemented cloud services such as IAAS, PAAS, and SAAS, including OpenStack, Docker, and OpenShift.

Utilized Kubernetes for orchestrating the development, scaling, and management of Docker containers.

Created Ansible Playbooks for deploying OS patching components as per requirements.

Conducted security scans of systems, addressing and remediating vulnerabilities. Automated system security scans using DevOps tools.

Optimized and fine-tuned Hive queries, integrated SAS with Hadoop, performed backups, documented processes, and analyzed costs to minimize expenses.

Provided stakeholders with secure, resilient, cost-efficient cloud solutions managed with operational excellence.

Provisioned EC2 instances using CloudFormation and Terraform for both master and slave nodes.

Monitored AWS CloudWatch, Datadog, Slack Channels, New Relic, and Ambari Dashboards for resource usage, job performance, and overall resource utilization.

Collaborated with application teams to install operating system and Hadoop updates, patches, and version upgrades when necessary.

Configured data encryption at rest, secured Ambari server web access with SSL/TLS and SSO.

Implemented data authorization policies by creating policies using Apache Ranger.

Managed CI/CD activities, conducted code reviews, and addressed issues reported by other organizations.

Contributed to sprint planning by creating user stories, defining tasks, and assigning story points.

ValueMomentum Inc, Pekin, IL 06/2018 to 05/2020

Role: Lead Hadoop Administrator

Estimated, planned, configured, and managed Hadoop environments.

Deployed and maintained large Hadoop clusters, configuring High Availability and Load balancing.

Commissioned and decommissioned slave nodes in the Hadoop cluster using Ambari.

Monitored Hadoop cluster connectivity and performance, managed and analyzed Hadoop log files, and oversaw the file system.

Implemented security measures using Kerberos, AD, LDAP, SSL, and SAML authentication.

Optimized Yarn Capacity Scheduler Queues for optimal resource utilization to meet SLA’s.

Setting up new user to the Hadoop cluster and assigning roles, Yarn queues.

Addressed Service Desk requests, problems, and incidents management tasks within SLA.

Conducted necessary capacity planning and tracked usage using Ambari metrics.

Designed, developed, and tested Hadoop solutions to meet business customer needs.

Provided Hadoop coding support to ensure efficient execution of Hadoop jobs.

Assisted users with inquiries related to Hadoop coding issues and provided guidelines for best practices.

Oversaw new and existing administration of Hadoop infrastructure and performed data migrations between secured Hadoop clusters.

Offered consulting, guidance, and mentoring to business areas utilizing Hadoop for their processes.

Maintained documentation for best practices, standards, and procedures.

Performed on-call routines to support the Hadoop environment and coordinated with offshore teams.

Troubleshot issues with Hadoop users and researched/resolved production support issues as needed.

Planned and executed upgrades for Hadoop clusters from 2.X to 3.1.4 and Ambari from 2.6.X to 2.7.4.

Conducted vulnerability remediation, including security patching and upgrades.

Configured Disaster Recovery setup from on-premise HDFS to AWS S3 and EMR, performing recovery drills.

Built servers using AWS cloud services, imported volumes, launched EC2 instances, created security groups, configured autoscaling, Load balancers, Route 53, SES, and SNS within the defined virtual private connection.

Configured KMS for Data at Rest Encryption, SSL/TLS, and SSO.

Implemented Data governance policies on the Hadoop Cluster with Atlas.

Proactively worked with management for sprint planning, creating user stories, assigning story points, and planning iterations.

Collaborated and communicated with management, offshore and onsite teams, and stakeholders

DXC Technologies 04/2005 to 07/2017

Role: SR Software Engineer

Led the supervision of Hadoop infrastructure implementation and ongoing administration, ensuring High Availability (HA) configuration.

Conducted Proof of Concepts (POCs).

Successfully implemented, configured, and maintained HDP and CDH clusters.

Managed databases including Vertica, Mongo, Cassandra, MySQL, and Postgres.

Installed and configured HDP 2.2, monitoring pre-existing clusters and setting up Hortonworks platform clusters.

Proficiently managed various Hadoop components, such as HDFS, MapReduce, YARN, Sqoop, Spark, Kafka, Flume, Hive, Pig, HBase, Impala, Nifi, Oozie, Tez, Ranger, Knox, and databases like Oracle, MS-SQL, Vertica, MySQL, Postgres, Mango, Cassandra.

Maintained active communication with development teams, participating in daily meetings, and addressing issues promptly.

Collaborated with data delivery teams to establish new Hadoop and Linux user accounts and conducted thorough testing of HDFS and Hive.

Deployed R-statistical tools for statistical computing and graphics, performed cluster maintenance, and managed node additions and removals.

Ensured Hadoop cluster connectivity and security through vigilant monitoring and comprehensive log file reviews.

Executed file system management and monitoring, provided support for HA and HDFS, and fine-tuned Hadoop clusters and Hadoop MapReduce routines.

Fostered collaboration with infrastructure, network, database, application, and business intelligence teams, emphasizing data quality and availability while contributing to project scoping across design, development, testing, and implementation phases.

Worked closely with application teams to configure new applications and oversee updates, patches, and version upgrades within the Hadoop environment.

Certifications

Certified Kubernetes Administrator.

Google Professional Cloud Architect.

AWS Certified SysOps Administrator.

Hortonworks Certified Hadoop Administrator (HDPCA).

Oracle Certified Data Base Administrator.

Contact this candidate