Description:
Design, deploy and configure various customer facing infrastructures, application, and services
Design and manage Cloud infrastructure and services that meet enterprise grade SLA standards
Resolve customer escalations and help prevent reiteration of those incidents by creating processes, procedures, and automations
Monitor, diagnose, and resolve urgent production issues during period potentially off normal business hours
Create and deploy scalable monitoring systems for massively growing global infrastructure
Write, augment, and maintain production documentations
Technical Skills
Incident Response & Production Support: Skilled in triaging live outages, assessing blast radius, and driving rapid mitigation.
Observability & Monitoring (Datadog): Experienced with metrics, logs, traces, alert tuning, and noise reduction.
Kubernetes Operations: Troubleshooting pods, deployments, restarts, resource constraints, and service health.
AWS Fundamentals: Proficient in EC2, IAM, networking basics, and queues/events.
Incident Management Platforms: Working knowledge of ServiceNow, Jira, and Incident.Io for incident lifecycle tracking.
Root Cause Investigation: Systems thinking across distributed services, dependencies, and failure modes.
Automation Mindset: Ability to script or build lightweight solutions to reduce repetitive operational toil.
AI First Tool Adoption: Leveraging Bedrock/Claude to accelerate analysis, documentation, and operational workflows.
Professional Skills
Bridge Call Leadership: Confident in running live incident calls with structure, calmness, and urgency.
Clear Written Communication: Strong in incident updates, stakeholder messaging, and postmortem documentation.
Autonomy in Ambiguity: Effective in dynamic, remote, multi timezone environments.
Cross Team Collaboration: Skilled at engaging partner engineering teams and driving alignment during response.
Operational Ownership: Accountable beyond mitigation, ensuring fixes, learnings, and improvements are implemented.
Service Desk Discipline: Familiar with high volume ticket workflows and structured incident processes.