Observability engineer

Location:

Harrison, NJ

Salary:

100000

Posted:

March 24, 2025

Contact this candidate

Resume:

Kasthur Reddy Padarthi

Email:**********************@*****.*** Harrison, NJ, 07029 Ph: 646-***-****

Observability Specialist

Professional Summary:

Experienced Observability Engineer and SRE with over 8 years of expertise in designing, deploying, and managing scalable monitoring solutions for cloud-based and containerized environments. Proficient in observability tools such as Datadog, Prometheus, Grafana, ELK Stack, and OpenTelemetry, with extensive experience in container orchestration (Docker, Kubernetes, OpenShift) and CI/CD pipelines (Jenkins, ArgoCD, Terraform). Adept at defining and achieving Service-Level Objectives (SLOs) and Indicators (SLIs), developing real-time dashboards for actionable insights, and optimizing log management workflows. Proven track record of improving system reliability, automating monitoring pipelines, and fostering collaboration between DevOps and engineering teams to align observability strategies with business goals.

●Designed and implemented end-to-end observability solutions using tools like Datadog, ELK Stack, Prometheus, and Grafana to monitor multi-cloud and Kubernetes-based infrastructure.

●Developed real-time dashboards to track KPIs such as CPU, memory usage, request latency, and system health, providing actionable insights to stakeholders.

●Automated observability pipelines using Terraform, Ansible, and CI/CD tools like Jenkins, integrating Docker and Helm for streamlined deployments.

●Migrated legacy monitoring systems to cloud-native platforms, ensuring seamless scalability and visibility for mission-critical applications.

●Optimized log ingestion pipelines and distributed tracing with OpenTelemetry and Elastic Stack, improving troubleshooting and reducing resolution times.

●Defined SLOs and SLIs for critical applications, ensuring alignment with organizational reliability and performance objectives.

●Conducted RCA for critical incidents and implemented proactive monitoring enhancements to prevent recurrence, driving system resilience.

●Delivered training sessions on observability tools such as Datadog, Prometheus, and Grafana, promoting a culture of proactive monitoring across teams.

Technical Skills

Cloud Based Environments: - Amazon Web Services (AWS), Microsoft Azure

Configuration Management: - Chef, Puppet

Build Tools: - Maven,Teamcity

CI/CD Tools: - Jenkins, Kubernetes, Docker

Monitoring Tools: - Prometheus, Grafana, Datadog, Nagios

Container Tools: - Docker, Kubernetes, OpenShift

Scripting Languages: - Java, HTML, Python, YAML,

Version Control Tools: - GIT, Bit Bucket

Operating Systems: - Windows, UNIX,Linux

Databases: - SQL Server, MYSQL, Oracle, NoSQL, Cassandra, Postgres

Change Management: - Remedy, Service Now

Education:

Bachelors in computer science and engineering from Jnn institute of engineering, Chennai

Professional Experience:

State farm – Illinois May 2021– Present

Senior observability engineer

Responsibilities:

●Designed and deployed scalable observability solutions to monitor microservices, APIs, and cloud-based infrastructure in collaboration with cross-functional teams.

●Contributed to the integration of observability practices into DevOps workflows, enhancing system reliability and team efficiency.

●Implemented advanced log management strategies using tools like Splunk and the ELK Stack to centralize and analyze application logs effectively.

●Built and maintained real-time dashboards using Datadog, focusing on system health, key performance indicators (KPIs), application performance, and user experience metrics.

●Automated monitoring and alerting workflows with tools like Jenkins, ensuring timely detection and resolution of system issues.

●Collaborated with development and operations teams to standardize logging, tracing, and monitoring practices across the organization.

●Assisted in technical leadership during critical incidents, contributing to swift resolutions and documenting lessons learned for future improvements.

●Participated in evaluating and onboarding new observability tools, including OpenShift (Kubernetes), ArgoCD, and Helm, to improve system visibility and support organizational growth.

●Supported the implementation of end-to-end observability for mission-critical applications in a multi-cloud environment, utilizing Docker for containerized deployments.

●Played a significant role in migrating legacy systems to cloud-native observability platforms, ensuring seamless monitoring during the transition.

●Helped develop and promote best practices for monitoring, logging, and dashboarding, fostering collaboration between engineering and operations teams.

●Conducted training sessions to upskill teams on observability tools like Datadog, promoting a culture of proactive system monitoring and KPI tracking.

Myntra – BLR Aug 2018 – Apr 2021

APM Specialist/SRE

Responsibilities:

●Collaborated with teams to design and implement observability frameworks for Myntra’s large-scale e-commerce platform, ensuring system availability and optimal performance during peak sales events.

●Worked on deploying and maintaining tools like Prometheus, Grafana, Elastic Stack (ELK), and OpenTelemetry to collect and analyze application metrics, logs, and traces across distributed systems.

●Assisted in building and optimizing logging pipelines to handle high-volume transactional data, enabling real-time insights into application and infrastructure performance.

●Partnered with engineering and DevOps teams to define Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs) for critical services, ensuring alignment with business goals.

●Contributed to the integration of distributed tracing tools, including Jaeger and OpenTelemetry, to monitor and debug microservices within a Kubernetes-based architecture.

●Supported the deployment and management of scalable applications using Docker and Helm within containerized environments.

●Automated alerting workflows with tools like PagerDuty and Opsgenie, improving incident response and reducing resolution times.

●Helped create custom dashboards in Grafana and Datadog, providing visibility into system health, user behavior, and sales trends during key events.

●Participated in conducting Root Cause Analyses (RCA) for incidents, helping identify areas for improvement in monitoring coverage and system reliability.

●Assisted in monitoring cloud-based infrastructure on AWS, leveraging CloudWatch, Lambda, DynamoDB Insights, and tools like ELK Stack for resource tracking and cost optimization. .

L&T Finance– BLR May 2016 – Jul 2018

Cloud Monitoring Engineer

Responsibilities:

●Collaborated on the design and deployment of observability solutions for cloud infrastructure, enabling real-time monitoring, logging, and tracing across AWS.

●Configured and managed cloud-native monitoring tools, including AWS CloudWatch and Prometheus, to ensure system reliability and performance.

●Implemented Prometheus to monitor and collect metrics from cloud-based infrastructure components, such as AWS EC2 instances, Kubernetes clusters, databases, and networking, ensuring system health and performance.

●Developed and maintained real-time Grafana dashboards to monitor key performance indicators (KPIs) like CPU, memory, disk usage, and request latency across cloud infrastructure.

●Automated the deployment of observability pipelines using Terraform, Ansible, and CI/CD tools, integrating Docker to streamline processes and improve operational efficiency.

●Collaborated with teams to define Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs), ensuring cloud services met organizational performance and reliability goals.

●Optimized log ingestion and storage using tools like Elastic Stack (ELK) and Fluentd, ensuring efficient log management across multiple environments.

●Monitored and resolved performance bottlenecks in cloud systems, utilizing Prometheus and Grafana to ensure seamless scalability and reliability.

●Created custom dashboards in Grafana, Prometheus, and Datadog to visualize key metrics, providing actionable insights for stakeholders.

●Participated in conducting Root Cause Analyses (RCA) for incidents and implemented proactive monitoring enhancements to prevent recurrence.

Contact this candidate