Post Job Free
Sign in

Major Incident Manager

Company:
Robert Half
Location:
Clinton Township, OH, 43224
Posted:
May 21, 2025
Apply

Description:

This will be a contract-to-hire position, only available on a W2 basis.

What You’ll Do:

Incident Command & SRE Lead P1/P2 bridges for ML/LLM and batch pipelines. Drive root-cause analysis, publish blameless post-mortems, and ensure fixes are automated—not repeated.

DevSecOps Automation Patch CI/CD jobs, Helm charts, and Python utilities as part of incident follow-up. Embed vulnerability scans, rollback logic, and change-ticket integration.

Reliability Governance Define & track MTTR, change-failure rate, and repeat-incident rate. Report trends to leadership in clear, metrics-first language.

People Leadership Mentor engineers, set sprint priorities, and foster an SRE mindset in the offshore pod. Participate in hiring and onboarding.

Partnerships Work daily with Solution Engineering, Platform Enablement, and Architecture to harden AWS deployments, review HA/DR designs, and close security gaps.

Must-Have Qualifications

Hands-on incident response in a machine-learning or data-platform environment (you have debugged Python code at 2 a.m.).

Strong Python & Bash; comfortable editing pipelines, writing quick-fix scripts, and reviewing pull requests.

AWS practitioner: IAM roles, ECR, EKS, S3 versioning, CloudWatch alarms.

Hands-on with Docker and Kubernetes.

Track record converting Sev-1 incidents into durable controls (can share concrete examples).

Experience leading or coaching blended on-shore/offshore teams.

Familiar with DevSecOps practices—static scans (Snyk/Trivy), container runtime controls (Aqua Enforcer), SBOM generation.

Skilled in creating clear, actionable post-mortems for management audiences.

Nice to Have

Exposure to large language model operations (Bedrock, Vertex AI, or similar).

Financial services or other regulated industry background.

Terraform and Helm chart authoring.

Apply