Infrastructure Reliability Engineer (Operational Excellence)

Company:

STACK Infrastructure

Location:

Elk Grove, CA

Pay:

$170k - $190k

Posted:

June 22, 2025

Apply

Description:

Infrastructure Reliability Engineer

THE COMPANY:

STACK INFRASTRUCTURE (STACK) provides digital infrastructure to scale the world’s most innovative companies. We are an award-winning industry leader in building, owning, and operating highly efficient, cost-effective wholesale, colocation, and cloud data centers. Each of our national facilities meets or exceeds the highest industry standards in all operational categories of availability, security, connectivity, and physical resilience.

STACK offers the scale and geographic reach that rapidly growing hyperscale and enterprise companies need. The world runs on data. Data runs on STACK.

THE POSITION:

STACK is looking for an Infrastructure Reliability Engineer who will act as a key member of STACK’s Critical Operations team. This position will play a vital role in ensuring the ongoing performance, resiliency, and evolution of infrastructure systems across STACK’s portfolio. This role requires deep technical fluency in data center power and cooling systems, a forensic mindset for failure analysis, and a proactive approach to risk reduction.

RESPONSIBILITIES:

Lead deep-dive root cause analyses (RCAs) for critical incidents—connecting technical failures to design, process, and operational contributors.

Inform and influence the design review and turnover process by identifying gaps in infrastructure handoffs, system limitations, or commissioning practices.

Develop system-level failure mode mitigation strategies that improve uptime performance and reduce repeat incidents.

Partner with Operations, Engineering and Construction to identify design improvements needed to enhance operational reliability

Engage Original Equipment Manufacturers (OEMs) and vendors to challenge technical assumptions and advocate for long-term improvements.

Support the evolution of maintenance standards and asset strategy for high-risk or complex systems (e.g., power distribution, cooling).

Collaborate with Learning and Development to enhance technical training for site teams based on lessons from event investigations.

Contribute to availability reporting, event response improvement, and risk trend monitoring to ensure service level agreements (SLA) commitments are met.

THE DETAILS:

Location: Chicago (CHI) or Dallas-Fort Worth (DFW)

Compensation: $170K - $190K plus 10% bonus potential

Travel: 25% domestically

Must be eligible to work in the United States

Must pass a comprehensive background screening

MUST-HAVE QUALIFICATIONS:

Bachelor’s degree in engineering or equivalent experience with high technical competency.

5–8 years of experience in critical infrastructure environments (e.g., data centers, substations, power generation, or utility systems).

Strong technical fluency in electrical and/or mechanical systems—power distribution, uninterruptible power supply (UPS), generators, control systems, and heating, ventilation, and air conditioning (HVAC).

Hands-on experience with root cause analysis and reliability methodologies (e.g., failure more and effects analysis (FMEA), revenue cycle management (RCM).

Demonstrated ability to work across disciplines to resolve complex technical issues.

Expertise with commissioning (Cx) and infrastructure design review processes.

Ability to analyze performance data and translate findings into practical improvements.

THIS MIGHT BE RIGHT FOR YOU IF:

You’re the person people call when “something went wrong”—and you love figuring out why.

You bring rigor and precision to every failure analysis and don’t settle for surface-level fixes.

You want to engineer reliability, not just react to issues.

You enjoy working cross-functional and are a collaborator who builds trust and consensus.

You’re driven by impact, not ego—and you measure success by improved system resilience.

You thrive in the space between design intent and operational reality.

PREFERRED QUALIFICATIONS:

Experience reviewing or developing engineering specifications.

Background in vendor/OEM engagement and technical contract negotiation.

Familiarity with computerized maintenance management system (CMMS), data center infrastructure management (DCIM), or reliability-centered asset programs.

Understanding of availability metrics and SLA management frameworks.

Technical training or mentoring experience in field operations environments.

WHY STACK?

We offer a competitive compensation package with strong benefits, including medical, dental, and vision insurance, a 401K program, flexible spending accounts – even a cell phone subsidy.

We foster a culture of appreciation, including peer-to-peer recognition and rewards programs.

Fun is part of our DNA, with events, game nights, happy hours, and barbecues.

We’re growing – this is a great time to join and make an impact!

STACK is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity and expression, age, national origin, mental or physical disability, genetic information, veteran status, or any other status protected by federal, state, or local law.

Note to external agencies: we are not accepting any blind submissions or resumes/cvs from recruitment agencies. Any candidates sent to STACK Infrastructure will not be accepted or considered as a submission without a signed agreement in place.

#LI - LW

Apply

Infrastructure Reliability Engineer (Operational Excellence)

Description:

Report this job