Manager Site Reliability Engineering

Company:

Diversity Resource Staffing Inc

Location:

Sandy Springs, GA

Posted:

May 19, 2025

Apply

Description:

This is an exciting opportunity for a Manager in the Consumer Site Reliability Engineer (SRE) Team at IMT. IMT is a division of our client, which operates numerous financial and commodity marketplaces and exchanges, including the New York Stock Exchange (NYSE).

This position is for a hands-on technical manager to lead a team of SRE engineers, focused on providing resilient, secure, scalable and supportable services for mortgage borrowers and lenders. You will contribute to the strategy and delivery of the team, as well as managing the day-to-day workload. This role requires building a close relationship with our customer support, operations, engineering, database and product organizations.

You will be involved in the design of resilient systems, the definition and monitoring of SLI/SLOs, creating pro-active actionable alerts, and also drive production incidents. We operate in a hybrid multi-cloud environments supporting Windows, Linux and container-based applications.

Responsibilities

Provide thought-leadership; set the technical direction for the SRE Team

Define and manage projects to meet Team objectives.

Set individual goals and manage personal growth of team members.

Manage and troubleshoot a diverse set of SaaS Applications and internal services

Serve as the face of a team responsible for the overall health, performance, and capacity of our business applications

Develop sustainable SRE practices around simplification and standardization

Drive of the cultural standard for SRE including defining ways of working, runbooks and accountability across people, processes and technology

Lead Incident Response and Root Cause Analysis.

Partner with other SRE teams and lead by example Knowledge and Experience

3+ years of managing high-performance teams in

10+ years of Application/Systems engineering in 24x7 Production Services environments

BS in Computer Science, Computer Engineering, Math, or equivalent professional experience

Experience in designing, deploying and operating SaaS applications and cloud infrastructure (AWS or equivalent & On-Premise virtualized environments)

Excellent troubleshooter spanning systems, networks and code, utilizing a systematic problem-solving approach

Proven track record decreasing MTTR (Meant-Time-To-Recovery), increasing MTTF (Mean-Time-To-Failure), and improving overall service quality

Demonstrate the ability to lead Incident Response and root cause analysis (RCA)

Fluency with one or more current generation scripting language used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET development

Strong communication skills

Apply

Manager Site Reliability Engineering

Description:

Report this job