Post Job Free
Sign in

SRE Leader - Observability, Toil Reduction, Automation

Location:
United States
Posted:
April 10, 2026

Contact this candidate

Resume:

Gregory Fletcher Jr.

+1-302-***-**** *********@*****.*** linkedin.com/in/gregeng1 github.com/mmxxtdmk

Site Reliability Engineering Leader

Site Reliability Engineering leader with nearly a decade of experience owning reliability for mission-critical systems at JP Morgan Chase & Co. Drove measurable improvements in observability, MTTR reduction, toil automation, and platform availability (up to 99.9%) across RPA and distributed Java environments. Completed JPMC’s accelerated Google SRE training pilot program. Excel at mentoring cross-functional teams, standardizing SRE practices, and delivering zero-downtime outcomes through automation and collaboration.

Areas of Expertise

Site Reliability Engineering (SRE) Observability & Monitoring Incident Response & Blameless Post-Mortems Toil Reduction and Automation High-Availability & Zero-Downtime Systems Error Budgets & SLOs Technical Mentorship & Knowledge Sharing Cloud & Distributed Systems DevOps & CI/CD Agile Practices

Professional Experience

JP Morgan Chase & Co.

Remote/Wilmington, DE

April 2018 – June 2025

Site Reliability Engineer: Robotics, UiPath/C#

May 2023 – June 2025

Reliability owner for 60+ production robotic process automations supporting critical operations. Drove SRE best practices across stability, observability, incident response, and toil reduction after completing JPMC’s accelerated Google SRE training pilot program with Google SRE trainers.

Owned observability and monitoring strategy for the RPA platform; collaborated with SRE teams to optimize tools and workflows, reducing MTTR by 20%, improving service resiliency, and minimizing operational risk for business-critical automations.

Developed and maintained run books plus Confluence knowledge base that accelerated team onboarding by 30%, fostered blameless postmortems, and strengthened knowledge sharing to enhance incident response and reduce repeat issues.

Mentored 5 cross-functional engineering and operations teams on SRE principles, including error budgets, toil automation, and operational excellence, boosting resiliency of mission-critical RPA systems, reducing toil by 35%, and enabling faster, more reliable delivery in dynamic production environments.

Spearheaded integration of enterprise observability tooling, standardized reliability patterns across business units, and promoted reusable services that improved platform availability to 99.9%, reduced downtime by 25%, and supported scalable adoption of RPA capabilities.

Site Reliability Engineer: Distributed Computing, Java

April 2019 – April 2023

Reliability owner for business-critical distributed applications. Focused on high-reliability system design, automation of operations, and enterprise-level observability to achieve zero-downtime goals. Completed JPMC’s accelerated Google SRE training pilot program and partnered with product teams to prioritize foundational reliability capabilities.

Standardized operating procedures for 25% of business-critical products, boosting operational efficiency by 20% through Agile processes and cross-functional collaboration.

Developed technical playbooks and centralized monitoring tools with incident response frameworks that unified practices across teams, accelerated SRE adoption by 40%, and strengthened platform availability, reducing repeat incidents by 25%.

Executed zero-downtime data center migrations, delivering 160+ hours of annual operational savings via automation and enhanced monitoring while validating technology feasibility for product roadmaps.

Site Reliability Engineer / DevOps Engineer: Java

April 2018 – April 2019

Designed and supported CI/CD pipelines, observability solutions, and high-availability environments for Java-based microservices serving global user bases. Collaborated with engineers and product owners to modernize legacy systems.

Pivoted operations from a service model to a product model, scaling and securing global business and technical processes while prioritizing platform backlogs and managing dependencies to enable 30% faster delivery cycles.

Designed and deployed Java microservices into cloud infrastructures, replacing legacy systems and saving $100k+ annually using Spring, REST and cloud services.

Automated DevOps workflows in Linux environments, saving 10+ developer hours weekly with Jenkins, Bitbucket, and CI/CD tools.

Reduced incident detection time by 15% through Splunk and Kibana dashboards, delivering reusable observability capabilities across teams.

Education

Bachelor’s Degree in Engineering

2010 – 2011

Virginia Polytechnic Institute and State University

Java Full Stack Development - Zip Code Wilmington

Built and delivered a mobile full-stack application in a team of five using Java, Spring, AngularJS, SASS and Agile practices.

Technical Skills

Languages & Frameworks:

Java, Python, C#, Spring, REST, Node.js, JavaScript

DevOps & Infrastructure:

Jenkins, Bitbucket, CI/CD, Terraform, Docker, Kubernetes, Ansible, Git, YAML, Infrastructure-as-Code

Observability & Monitoring:

Splunk, Grafana, Kibana, Logstash, FluentD

Cloud & Databases:

AWS, Apache Kafka, Tomcat, MongoDB, Cassandra, MS SQL Server, Oracle DB

Other:

Jira, Confluence, ServiceNow, Agile/Scrum, JUnit

Awards

Leadership of the Internal Big Data Community – JP Morgan Chase & Co., June 2022

Led cross-functional collaboration and best practice sharing to drive adoption of scalable data solutions across engineering teams.

Leadership of the Internal Machine Learning Community, July 2019

Founded the community and led knowledge sharing sessions and technical workshops to foster innovation and platform alignment.



Contact this candidate