Sr Manager, Site Reliability Engineering

Company:

47 Technology/IT

Location:

Chicago, IL

Posted:

May 22, 2025

Apply

Description:

Job overview and responsibilities

As the Senior Manager of Site Reliability Engineering, you are responsible for guiding a team dedicated to the instrumentation and analysis of vital business applications, ensuring their availability, and contributing to major incident resolution and root cause analysis. You hold accountability for devising the strategy, as well as the assessment, deployment, and management of IT operations tools and methodologies. Your leadership role involves steering technical experts who specialize in evaluating enterprise reliability and enhancing system efficiency. Furthermore, you are tasked with forging and upholding robust connections with digital technology and business executives at all tiers, leveraging your profound technical knowledge and outstanding leadership and analytical abilities to lead your team towards creating highly available applications, adhering to best practices, and promoting system optimization based on empirical evidence in partnership with development teams by leveraging modern DevOps practices.

Design, Develop & Drive Outcomes:

Understand the potential impact of system requirements and design choices across multiple cloud and on-premise technologies

Embrace the role of developing and mentoring the Site Reliability Engineering team, fostering expertise in this critical area

Guide the team to devise solutions that not only meet long-term objectives but also effectively address urgent technical debts

Position yourself as a prominent thought leader in Site Reliability Engineering Principles, influencing others through your knowledge and experience

Regularly disseminate best practices and champion process improvements, both within your team and in collaboration with other teams, to drive collective success

Program Management & Delivery:

Track the team’s progress on projects and key performance indicators, while also offering concrete, actionable suggestions for further enhancing or influencing product or project delivery

Encourage cross-functional collaboration and gather input from technology teams to promote ongoing program enhancement

Regularly provide insights on critical Site Reliability Engineering metrics to showcase the program’s achievements and identify potential areas for improvement

Keep an updated collection of materials to communicate the current status, including progress, obstacles, opportunities, and the program’s strategic direction to Digital Technology leaders

Effectively manage both internal and external relationships to foster and sustain beneficial strategic partnerships, thereby advancing the success of the Site Reliability Engineering Program Develop and roll out training initiatives to ensure that partners are well-equipped to fully utilize Observability programs

Oversee the 24/7 command center teams, ensuring they are adept at early detection, triage, and recovery for all applications and services, which contributes to a reduced mean time to recovery

Talent Management and People Development:

Initiate and facilitate the performance assessment process for your team, fostering an environment that encourages individuals at all performance tiers to excel

Establish and nurture relationships with team members to create a foundation of trust, recognizing areas where technical or analytical skills are lacking, devising strategies for improvement Regularly encourage team members to exchange expertise about Site Reliability Engineering practices and embrace new technologies

Lead and inspire teams to tackle intricate challenges and champion the use of open-source technologies and solutions

Organizational Effectiveness / People:

Possessing robust technical expertise and leadership qualities as you lead by example with a proven track record in Site Reliability Engineering

Your proficiency in driving the creation of multi-cloud infrastructure serves as a benchmark and motivates the team of developers and infrastructure engineers

Collaborate with your engineers to manage project dependencies, adeptly negotiate and plan for incremental delivery milestones with stakeholders, and achieve on-time project completion

Work closely with product teams to understand and address their performance and resilience concerns, and formulate sustainable strategies to resolve persistent challenges

Engineering Excellence and Practices:

Continuously work on enhancing the reliability, stability, and performance of our digital platforms, being at the forefront of promoting engineering excellence, implementing best practices, and overseeing the integration of fully automated telemetry within modern DevOps frameworks

Your work in advancing problem detection and service restoration processes is pivotal

Utilizing cutting-edge Site Reliability Engineering methods, coupled with automated alerting and self-healing mechanisms, you are instrumental in improving both cloud-based and on-premises systems, thereby fortifying our digital infrastructure’s robustness and efficiency

What’s needed to succeed (Minimum Qualifications):

Bachelor's degree in information technology, Business Administration, Computer Science or relevant field

7+ years of IT and business/industry work experience

5+ years of Site Reliability Engineering experience working with telemetry, observability, self-healing solutions, and platform automation

+5 years of experience leading projects and managing people

2 - 3 years of leadership experience in managing cross-functional teams or projects, and influencing senior level management and key stakeholders

2+ years of experience with leading DevOps practices and tools (CI/CD pipelines, Jenkins, GitHub)

Recognized expertise in field - in industry and/or within United

Proven expertise in leading and influencing technical staff or coordinating work across multiple technology teams

Proven experience with monitoring, logging and telemetry tools like Dynatrace, Splunk, Prometheus, AWS Cloudwatch, etc.

Proficiency with DevOps practices and tools (CI/CD pipelines, Jenkins, GitHub)

Ability to diagnose and troubleshoot issues effectively

Strong and effective communication skills and status reporting

Experience with AWS networking services like VPC, Route 53, and CloudFront, with understanding of cloud concepts like IaaS, PaaS, and SaaS

Experience with distributed storage technologies such as EC2 (Elastic Compute Cloud), S3 (Simple Storage Service), RDS (Relational Database Service), VPC (Virtual Private Cloud), Lambda, and CloudFormation

Experience in developing monitoring tools and log analysis tools to manage operations

Experience in one or more general purpose programming languages\: Python, JavaScript, shell scripting (Unix/Linux)

Dynatrace Associate Certification or AWS Certified DevOps Engineer is a plus

Must be legally authorized to work in the United States for any employer without sponsorship

Successful completion of interview required to meet job qualification

Reliable, punctual attendance is an essential function of the position

Apply

Sr Manager, Site Reliability Engineering

Description:

Report this job