Site Reliability Engineering - (SRE)

Company:

Snaphunt Pte Ltd

Location:

Colombia

Posted:

February 24, 2026

Apply

Description:

Job Description

TOP REQUIREMENTS:

Kubernetes (EKS)

Terraform

Intermediate to Advanced AWS

Github Actions

NewRelic/Datadog

Containers (docker)

Ideally some form of developer background - with an eye for automating recurring tasks

Excellent communicator

Above all else, someone who takes ownership, is a self-starter, and is someone that loves learning new skills and has a growth mindset.

What can you expect from this position?

Participate in initiatives involving system design and provisioning, reliability, observability and monitoring, self-service tool development, cost optimization, incident response, chaos engineering and build and release

Hands-on design, analysis, development and troubleshooting of large-scale distributed systems

Build tools and automation to eliminate repetitive tasks, minimize downtime, achieve human free operations, and provide self-service solutions to product development teams

Work to improve the observability and monitoring of our systems. Proactively monitor capacity, performance, and cost metrics to ensure quality and identify opportunities for improvement

Share an on-call rotation with your team where you will respond to incidents, lead triage efforts, and conduct blameless postmortems

Partner with engineering, security, and product teams to keep our services reliable, available, fast, and cost efficient

Be a champion of the customer’s voice and ensure our solutions are built with customer empathy at the forefront

Promote SRE best practices within your team to ensure quality, stability, performance, resiliency, and maintainability of your solutions

Explore new technologies and solutions to push our capabilities forward

What can you bring to the role?

4+ years combined experience as a Software Engineer, Site Reliability Engineer or DevOps Engineer

Proven technical abilities in the areas of reliability, monitoring, self-service tool development, incident response, and build and release

Experience in one of these languages: Python, Go or Java. Prior software development experience preferred

Strong experience with Linux environments

Demonstrated expertise designing, building, and triaging highly scaled production infrastructure in AWS

Experience with infrastructure automation technologies like Terraform

Experience in container/container-fleet-orchestration technologies like AWS ECS or EKS

Approach your job with an automation and software engineering mindset

Passion for uptime, observability, and full stack monitoring

Experience participating in a team’s 24x7 incident response efforts

Experience building ci/cd pipelines that are fast, informative, drive quality and achieve zero downtime releases

Ability to work across functional and domain boundaries to improve system reliability and deliver solutions on time and with quality

Common Technologies In Our Ecosystem Include

Java, Go, Node, PHP

DataBricks experience would be ideal

Linux-based, some Windows

Apache Web, Nginx, IIS, Apache Tomcat, Jetty

Docker, AWS ECS, AWS EKS, and home-grown Kubernetes

ELB, CloudFront, S3, EC2s, RDS, IAM, SQS, SES, SNS, Lambda, API Gateway, Kinesis, Lambda, ElasticCache, ElasticSearch, SSM, Control Tower, and much more

MySQL, Oracle, PostgreSQL, SQL Server

Artifactory, GitHub Enterprise, CircleCI, Jenkins, GitHub Actions, SonarQube, Jfrog X-Ray, Control Tower

Terraform (preferred), CloudFormation

Packer, Puppet, Ansible

New Relic, CloudWatch, PagerDuty

As an education innovation company, we're proud to play our part by inspiring learners around the world. If you bring your curiosity, we'll help you grow in a collaborative environment where everyone shares a passion for success.

Apply

Site Reliability Engineering - (SRE)

Description:

Report this job