Site Reliability Engineer

Company:

Smash

Location:

Hospital, San Jose, 10103, Costa Rica

Posted:

April 23, 2024

Apply

Description:

SMASH, Who we are?

We are agents for tech professionals in Costa Rica and Colombia that help them build careers in the United States.

We believe in long-lasting relationships with our talent. We invest time getting to know them as individuals and understanding what they are looking for as their professional next step.

We aim to find the perfect match. As agents, we make sure to pair our talent with our US clients, not only by their technical skills but as a cultural fit. Our core competency is to find the right talent, fast.

We purposefully move away from the “contractor” or “outsourcing” type of relationship. Our clients don’t want contractors or “just a service.” Neither does our talent.

Our Benefits

Work from Home

English Academy for Employees and Relative

Business Skills Coach – Certifications

Discounts with Tech Universities

Events and additional Perks

Job Description

The Site Reliability Engineer is responsible for keeping all member-facing and internal production systems running smoothly. As an SRE engineer you will work with multiple teams to encourage SRE principles, maintain the availability and reliability of systems, establish SLIs/SLO’s, and develop tools and monitoring for operational visibility. SRE engineers are members of the scrum teams and work closely with quality and software engineers to support services prior to general availability through activities such as launch reviews, reviewing performance and validating logging in dev environments. Responsible for ensuring quality releases to production environments. The SRE engineer participates in an on-call rotation, working with internal and vendor teams to manage, troubleshoot and resolve production issues.

To be effective, an individual must be able to perform each job duty successfully.

Keep current with emerging testing techniques and technologies, as well as emerging development practices.

Assist in diagnosing, finding the root cause, reporting, and tracking production and non- production issues.

Continually researching new ways of improving and scaling systems and services.

Lead initiatives to improve the reliability, scalability, and availability of production applications.

Build out tools, platform, and processes to enable these goals.

Lead and contribute to design, develop, and improve SRE practices and procedures.

Create and maintain health dashboards, identifying and measuring health indicators, SLI’s/SLO’s and providing tools for operational visibility of production systems.

Participate in and contribute to improving our incident response acting as an escalation point for production incidents.

Perform root cause analysis (RCA), troubleshoot, and debug issues across our applications and services to identify and fix root cause.

Enhance and maintain the software release procedures and processes.

A strong desire and aptitude for system automation to eliminate manual work with day-to- day operations.

Skilled with application monitoring practices and tools (New Relic, Azure Monitor, DataDog, Splunk, etc.)

Understanding of and experience with SRE and DevOps principles. Demonstrated experience working in Agile teams leveraging Scrum, Kanban, or other methodologies and/or understanding of Agile development concepts.

Meets the needs of the end user in a quality, consistent, and professional manner, using independent judgment where appropriate.

Mentors less experienced engineers.

Excellent communication skills (verbal and written) are critical, along with exceptional problem-solving skills, and exceptionally professional behavior when interacting and responding with other technical teams throughout the organization.

Take part in an on-call rotation.

Performs additional duties and responsibilities as assigned.

Experience

Minimum 4 years of professional experience in site reliability engineering, software development, or systems administration

Experience monitoring or troubleshooting web applications.

Experience with Scrum and associated tools such as Azure DevOps or Jira

Experience with some of the following tool sets:

Application monitoring tools (New Relic, DataDog, Splunk, etc.)

Automation tools (Pega, Microsoft Power Platform, Logic Apps, etc.) o API tools (Rest#, Postman, Swagger, etc.)

Front end tools (Selenium, Page Object Model, etc.)

Backend tools (SQL Server, Entity Framework, Dapper, etc.)

Build tools (Node, Docker, Azure Pipelines, etc.)

Infrastructure as Code(Terraform, Ansible, Chef,etc.)

Experience with automating, monitoring, and\or alerting on some of the following:

Web applications in Angular and React

Internal support tools

3rd party integrations

Database and API connections (Rest and SOAP)

Cloud Solutions (AWS, Azure, or others)

Experience working in an agile CI/CD or rapid software testing environment.

Experience understanding of Git and source control concepts.

Apply

Site Reliability Engineer

Description:

Report this job