Post Job Free
Sign in

Java production Support SRE

Company:
Cognizant Technology Solutions Asia Pacific Pte.
Location:
Gurgaon, Haryana, India
Posted:
May 12, 2024
Apply

Description:

Mandatory Skills:

• Must have hands on experience in Grafana OR Sumo Logic OR Splunk OR Kibana to develop E2E dashboard.

• Should have strong knowledge in APM tools.

• Experience in developing automation & observability.

• Willing to work in 24/7 support environment.

Job responsibilities:

• To lead 24x7 production support team

• To debug prod issues, performance issue, data connectivity/concurrency issue, check run time error for Java based application etc.

• Drive the work for L1 & L2 team.

• Implement SRE features like automation, observability etc

• Incident Management: Respond to and resolve critical incidents, minimizing system downtime and impact on business operations. Perform root cause analysis and implement preventive measures to avoid recurrence.

• System Reliability: Work with teams to design, implement, and maintain highly available and scalable systems to ensure optimal performance and reliability.

• Monitoring and Alerting: Work across teams to develop and maintain monitoring systems, ensuring comprehensive coverage of infrastructure, applications, and services. Create and fine-tune alerts to promptly detect and address potential issues.

• Observability and Predictive Actions driven Operations: Able to drive Observability Metrics, Logs and Traces to act in advance for critical situations, preventing major incidents and outages.

• Service Mapping: Understanding of Service Mapping Concept and visualizing CI dependencies between various components in IT Landscape including Applications, Servers, Databases, Networks etc.

• Automation and Tooling: Continuously improve operational efficiency by automating repetitive tasks, creating scripts, incident response, to minimize toil and increase efficiency.

• Performance Optimization: Identify performance bottlenecks, conduct capacity planning, and implement optimizations to enhance system response times, throughput, and scalability.

• Collaboration: Collaborate with development teams to provide guidance on building reliable and scalable software systems. Work closely with DevOps and security teams to ensure adherence to best practices and compliance requirements.

• Documentation and Knowledge Sharing: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides. Share knowledge and provide training to teammates on new tools, technologies, and processes.

• Advanced Incident Forensics: Serve as a lead engineer on our incident RCA swat team, working to identify the most complex issues across our full stack environment.

Full time

Apply