Post Job Free
Sign in

Lead Site Reliability Engineer

Company:
Coforge
Location:
Noida, Uttar Pradesh, India
Posted:
April 10, 2024
Apply

Description:

Description:

thought leader in the SRE space to help design a strategy and roadmap to help us mature as an organization

and translate business requirements to technical requirements, solution designing with commercial viability, and build business cases.

the sales team on shaping the opportunity. Respond to RFP / RFI with a technical solution along with costing SRE technical capabilities

an assessment of current state and summarize status for IT OPS Senior Leadership

and define what our north star should look like in the SRE space and how it fits into the organization model.

a series of recommendations and action plans that outline how customers can mature to get that north star.

a run book/roadmap that outlines what a recommended series of next steps should be

an executive/Capability deck and present to IT OPS senior leadership an overview of current state and the recommended plan.

a communications plan for our senior leadership to help them sell the program to customers

to pair up and prototype what an SRE engagement looks like with an actual application/Infra team

Job purpose:

The Site Reliability Engineer Lead (SRE Lead) will manage a team of SRE's to proactively ensure the stability, resilience and scale of our services by automation, testing and engineering. To build on expertise from systems/operations, cloud infrastructure (AWS), build and release engineering, software development and stress/load testing to make sure our services are available, cost optimised and fit for purpose early in the development lifecycle. The SRE Lead will also work alongside the development, architecture and service management teams, to ensure technical solutions are aligned to the Coforge's architectural principles, designs and NFRs, that deliver value to our customers as well as ensuring consistent monitoring, logging and alerting. The SRE Lead reports into the Head of AWS & DevOps Practice and is responsible for building capability and maturing operational ways of working across multiple cross-function delivery teams, with focus on technical excellence and a high-performance culture.

Key accountabilities:

leadership and guidance across the SRE team, acting as a subject matter expert and leading best practice techniques.

lead the SRE team in ensuring technical assurance in significant projects, for the delivery of quality technical deliverables, which may involve several teams or technologies.

oversee the SRE team to ensure they are involved in every step of the application software development lifecycle, development, testing, and transition into operation.

coaching and mentoring to the SRE team to improve their skillset, increase knowledge and set the benchmark of quality and precision engineering.

the implementation of service transition and change and release process changes, ensuring that processes are reviewed and improved with onus on optimisation.

to the documentation of strategic DevOps operating model changes and to the definition of future/new 3rd party technical partnerships and subsequent onboarding/integration into the organisation.

upon in-depth understanding of the organisation and ability to leverage existing relationships, provide implementation support to the Heads of AWS & DevOps during the stages of transition to new strategic operating models.

sharing and education of team members to enable them to contribute to backlog items related to infrastructure provisioning, monitoring and best practice

risks and defects, analysing specifications, and customising applications for specific customer needs.

Technology teams promote a culture of service support, identifying new ways of working e.g. out of hours/critical event support and any potential impact this may have on the workforce.

as the 'change lead' for DevOps, quality assurance practice implementation and project initiatives. • Continuously review capabilities and roles critical to evolving DevOps and quality assurance practices and be responsible for the acquisition, development and maturity of these.

on processes and projects as directed by the Heads of AWS & DevOps, related to integrating new DevOps ways of working and initiatives within the wider organisation.

a focus on agile methodologies, test automation, test data management, and continuous integration, the SRE Lead will oversee the improvements and enhancements of the delivery and deployment process.

for producing and maintaining documentation relating to application design, integration processes, testing procedures, and deployment approach as well as working with teams to create operational run and playbooks.

work with stakeholders in the Enterprise, Solution and Development teams to produce and maintain standards, guidelines, and pattern catalogue.

with technical roles across the department to drive evolution of the dev-ops toolchain, promoting improvements to streamline the software delivery process and showing improvements through metrics.

innovative prototypes and lead development teams to develop quality solutions, by translating architectural designs into lower level implementation details, helping implement user stories if required.

take highly complex and manual processes and work to simplify and automate them.

lead and influence teams to ensure quality and operational excellence, and to ensure teams are aligned to design patterns and design collateral

Skills, qualifications, and experience:

significant experience in DevOps implementation and in evolving practices and ways of working through multi-disciplinary teams, business frameworks and culture.

strong project management background and experience in leading technology change programmes.

individual who can perform highly in a multi-faceted role – facets that include a very strong technical knowledge, and awareness of emergent trends.

leadership skills to ensure scrum teams and co-workers are motivated and engaged to deliver against a roadmap.

very strong communicator, able to lead and facilitate discussions across many tiers at Coforge, including functions like architecture, technical specialists, business analysis, team leaders, senior management group, and executives.

Experience working with Windows and Linux Containers.

proficient with Kubernetes, Ansible, Terraform and AWS.

to get up to speed with domain knowledge.

in administration of ServiceNow, Jira

in Windows C# & Java build tools and practices including MS Build.

in Git and GitOps philosophy.

in Logging and Monitoring tools (AppD, Splunk, ELK, Prometheus, Grafana), incorporating frameworks and instrumentations into C# & Java code.

experience in maintaining applications.

1. Independently designs, implements, productionizes and maintains site reliability guidelines, processes and systems

2. Service Level Definition, Configuration and Measurement:

SLIs, SLOs & SLAs specific to each application or system:

of monitoring & alerting tools suitable for each product and/or platform team

reliability & resilience (through pre-defined SLIs & SLOs) utilizing monitoring/alerting tools to drive continuous improvement based on data analysis

3. Incident Management

of incident response through the engagement of various teams and stakeholders, while providing robust communication and visibility to the organization during service interruptions

Root Cause Analysis for failures

with a modern incident management platform to effectively drive incident response and problem resolution

4. Monitoring & Alerting

defects as well as develop dashboards using modern monitoring tools (e.g. AppDynamics, Splunk, etc) to enable a reduction in mttd (detection time) & mttr (resolution time). Experience with Pega platform will be plus.

monitors and alerts designed to manage SLAs, optimize performance, and minimize outages

E2E customer journey dashboards and alerts for customized transactions and applications

5. Automates reliability requirements into system and application implementations and updates; including the implementation of self-healing solutions (ansible, terraform, etc).

6. Work with product management team to contribute to 1) the identification of reliability features & requirements and 2) level of effort estimates

Full time

Apply