Post Job Free
Sign in

Senior SRE Engineer (Kubernetes)

Company:
GTN Technical Staffing
Location:
McKinney, TX
Pay:
$ - $160000
Posted:
June 05, 2025
Apply

Description:

Senior Site Reliability Engineer –Kubernetes

Employment Type: Full-Time, Onsite

Location: McKinney, TX

Role Overview

We’re seeking a seasoned Site Reliability Engineer with deep expertise in Kubernetes to lead the design, deployment, and ongoing operations of a scalable cloud-native platform. This role focuses on two key areas: ensuring the overall stability and observability of the system, and architecting a flexible traffic management layer within a multi-tenant SaaS environment.

Key Responsibilities Cloud Platform Reliability & Scaling

Operate and maintain highly available, multi-region Kubernetes environments powering real-time applications.

Ensure 99.99%+ uptime across complex, latency-sensitive workloads.

Design comprehensive observability solutions across infrastructure, networking, and application layers.

Seamlessly integrate Kubernetes with underlying infrastructure built on OpenStack technologies.

Evolve platform architecture to support growing demand while maintaining robust performance and security.

Reliability Engineering & Uptime Assurance

Take primary ownership of service availability and performance targets.

Develop fault-tolerant Kubernetes deployments spanning multiple availability zones and regions.

Lead proactive scaling and capacity planning initiatives.

Establish strong incident response procedures, including automation and root cause analysis workflows.

Implement structured change control processes to minimize update-related disruptions.

Conduct resilience testing and disaster recovery simulations regularly.

Observability & Monitoring

Build and maintain a metrics-driven observability stack for system, network, and app-level insights.

Set up alerting systems using golden signals and establish automated remediation paths.

Maintain dashboards and tracing tools to surface issues quickly and inform root cause analysis.

Define and enforce logging strategies and retention practices.

Kubernetes Ownership & Operations

Manage the full lifecycle of Kubernetes clusters—provisioning, upgrades, migrations, autoscaling, and hardening.

Champion GitOps practices using tools such as ArgoCD, Flux, Helm, and Terraform.

Lead incident investigations and platform stability efforts.

Optimize resource usage and implement cost governance strategies within Kubernetes environments.

Traffic Management & Control Plane Design

Design a Kubernetes-native control system to enforce tenant-level policies around bandwidth and session limits.

Use CRDs and custom controllers to support dynamic, usage-based enforcement logic.

Extend Kubernetes policies for global fairness and tenant isolation.

Leverage service mesh tools for routing, security, and traffic observability.

Network Telemetry & System Resilience

Deploy and operate telemetry pipelines using agents such as Cilium Hubble, WireGuard exporters, and Prometheus.

Feed flow data into OpenTelemetry systems and visualize metrics via Grafana.

Create event-triggered remediation workflows using real-time metrics.

Conduct chaos engineering to test system behavior under stress.

Leadership & Collaboration

Shape cloud-native infrastructure strategy and contribute to architectural decisions.

Mentor peers and drive best practices in Kubernetes and SRE disciplines.

Collaborate cross-functionally with product, security, and engineering teams.

Contribute knowledge to the broader tech community through documentation, talks, or open-source engagement.

Required Skills & Experience

Proven expertise managing high-availability Kubernetes platforms across multiple regions.

Strong background in observability tools (Prometheus, Grafana, OpenTelemetry, etc.).

Demonstrated ability to maintain uptime targets of 99.9% or higher for critical systems.

Deep understanding of Kubernetes internals, including CRDs, controllers, and operator patterns.

Hands-on experience integrating Kubernetes with OpenStack components (Nova, Neutron, Ceph).

Knowledge of CNI technologies, ideally Cilium, and software-defined network enforcement.

Linux networking fundamentals: tc, nftables, conntrack, iptables, WireGuard.

Proficiency in Go, with scripting skills in Python or Bash.

Experience with GitOps and infrastructure-as-code tools.

Familiarity with overlay networking protocols and secure network architectures.

Preferred Qualifications

Familiarity with service mesh technologies (Istio, Linkerd).

Exposure to multi-cluster management tools (e.G., Cluster API, Fleet, Rancher).

Experience running Kubernetes-based edge computing environments.

Experience building platform abstractions for internal development teams.

Implementation of chaos testing frameworks like Litmus or Chaos Mesh.

Background in network function virtualization or SDN.

Experience managing stateful workloads (e.G., databases, queues) within Kubernetes.

Contributions to open-source or active engagement in tech communities.

What We Value

Strategic mindset with ability to execute at scale.

High sense of ownership and accountability.

Clear, calm communicator—especially under pressure.

Deep curiosity about systems reliability and performance.

Effective collaborator across technical and non-technical teams.

Strong troubleshooting skills, especially in complex networking scenarios.

Apply