Mandatory Skills:
• Must have hands on experience in Grafana OR Sumo Logic OR Splunk OR Kibana to develop E2E dashboard.
• Should have strong knowledge in APM tools.
• Experience in developing automation & observability.
• Willing to work in 24/7 support environment.
Job responsibilities:
• To lead 24x7 production support team
• To debug prod issues, performance issue, data connectivity/concurrency issue, check run time error for Java based application etc.
• Drive the work for L1 & L2 team.
• Implement SRE features like automation, observability etc
• Incident Management: Respond to and resolve critical incidents, minimizing system downtime and impact on business operations. Perform root cause analysis and implement preventive measures to avoid recurrence.
• System Reliability: Work with teams to design, implement, and maintain highly available and scalable systems to ensure optimal performance and reliability.
• Monitoring and Alerting: Work across teams to develop and maintain monitoring systems, ensuring comprehensive coverage of infrastructure, applications, and services. Create and fine-tune alerts to promptly detect and address potential issues.
• Observability and Predictive Actions driven Operations: Able to drive Observability Metrics, Logs and Traces to act in advance for critical situations, preventing major incidents and outages.
• Service Mapping: Understanding of Service Mapping Concept and visualizing CI dependencies between various components in IT Landscape including Applications, Servers, Databases, Networks etc.
• Automation and Tooling: Continuously improve operational efficiency by automating repetitive tasks, creating scripts, incident response, to minimize toil and increase efficiency.
• Performance Optimization: Identify performance bottlenecks, conduct capacity planning, and implement optimizations to enhance system response times, throughput, and scalability.
• Collaboration: Collaborate with development teams to provide guidance on building reliable and scalable software systems. Work closely with DevOps and security teams to ensure adherence to best practices and compliance requirements.
• Documentation and Knowledge Sharing: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides. Share knowledge and provide training to teammates on new tools, technologies, and processes.
• Advanced Incident Forensics: Serve as a lead engineer on our incident RCA swat team, working to identify the most complex issues across our full stack environment.
Full time