About Our Team
MLOps organization partners with data-science squads across the bank. We do not write models; we make them run safely, quickly, and continuously.
2025 Focus:
Velocity - shorten the path from "approved model" to production deployment.
Trust & Security - embed DevSecOps controls so every model meets bank-grade risk standards. Why This Role Exists?
We need a hands-on leader who can own 24 7 production health, turn incidents into permanent improvements, and coach an 8-10 engineer team (mix of on-site and offshore) without losing touch with the code.
Key Responsibilities
Incident Command & SRE Lead P1/P2 bridges for ML/LLM and batch pipelines. Drive root-cause analysis, publish blameless post-mortems, and ensure fixes are automated-not repeated.
DevSecOps Automation Patch CI/CD jobs, Helm charts, and Python utilities as part of incident follow-up. Embed vulnerability scans, rollback logic, and change-ticket integration.
Reliability Governance Define & track MTTR, change-failure rate, and repeat-incident rate. Report trends to leadership in clear, metrics-first language.
People Leadership Mentor engineers, set sprint priorities, and foster an SRE mindset in the offshore pod. Participate in hiring and onboarding.
Partnerships Work daily with Solution Engineering, Platform Enablement, and Architecture to harden AWS deployments, review HA/DR designs, and close security gaps. Must-Have Qualifications
Hands-on incident response in a machine-learning or data-platform environment (you have debugged Python code at 2 a.m.).
Strong Python & Bash; comfortable editing pipelines, writing quick-fix scripts, and reviewing pull requests.
AWS practitioner: IAM roles, ECR, EKS, S3 versioning, CloudWatch alarms.
Hands-on with Docker and Kubernetes.
Track record converting Sev-1 incidents into durable controls (can share concrete examples).
Experience leading or coaching blended on-shore/offshore teams.
Familiar with DevSecOps practices-static scans (Snyk/Trivy), container runtime controls (Aqua Enforcer), SBOM generation.
Skilled in creating clear, actionable post-mortems for management audiences. Nice to Have
Exposure to large language model operations (Bedrock, Vertex AI, or similar).
Financial services or other regulated industry background.
Terraform and Helm chart authoring. How We Work
On-Call: Manager engages on all P1/P2 events; ICs rotate night/weekend coverage.
Culture: Every incident is an unplanned investment-root causes must harden code, docs, or infrastructure.
Collaboration: Teams for triage, Azure DevOps for work tracking, ServiceNow for change control.
Location: Hybrid model; Three days a week in our Columbus office
If you thrive on making complex ML systems reliable and secure-and you enjoy coaching engineers as much as solving hard technical problems-we'd like to meet you.
#LI-DD1