Sr. Site Reliability Engineer

Company:

Talent Groups

Location:

McKinney, TX, 75070

Posted:

June 28, 2025

Apply

Description:

Senior Site Reliability Engineer (Contract to Hire) Location: McKinney, TX (Hybrid, 2–3 days onsite) Must be authorized to work in the U.S.

Overview: Our client is seeking a Senior Site Reliability Engineer to lead platform reliability and traffic enforcement in a Kubernetes-hosted SASE (Secure Access Service Edge) environment.

This role ensures high availability, observability, and fair multi-tenant traffic handling across distributed systems.

Key Responsibilities: Platform Reliability & OperationsOwn uptime (target: 99.99%) and stability of multi-region Kubernetes environments.Architect resilient, scalable infrastructure with proactive capacity planning and automated remediation.Lead incident response, root cause analysis, disaster recovery, and change management.

Observability & MonitoringBuild a full-stack observability pipeline (Prometheus, OpenTelemetry, Grafana, etc.).Implement golden signals, tracing, and alerting to drive real-time performance insights.Develop automation for issue detection and resolution.

Kubernetes & InfrastructureManage full Kubernetes lifecycle (upgrades, autoscaling, GitOps automation).Integrate and optimize OpenStack-based infrastructure beneath Kubernetes.Enforce security compliance, resource efficiency, and FinOps best practices.

Traffic Enforcement & NetworkingDesign a Kubernetes-native traffic control layer for per-tenant/session enforcement.Implement CRDs, custom controllers, and service mesh (e.g., Istio, Linkerd) for dynamic policy management.Operate SDN telemetry agents (Cilium Hubble, WireGuard) and integrate with observability stack.

Leadership & StrategyContribute to infrastructure architecture and reliability strategy.Mentor team members and promote Kubernetes best practices.Partner cross-functionally across engineering, security, and product teams.

Required Skills:Kubernetes in production across multi-region architectures.Observability tools: Prometheus, OpenTelemetry, Grafana, Jaeger, Loki.Strong Linux networking (tc, nftables, WireGuard, iptables).Infrastructure automation: Helm, Terraform, ArgoCD/Flux (GitOps).Programming: Go (preferred), Python/Bash scripting.Familiarity with OpenStack (Nova, Neutron, Ceph) and CNI (Cilium preferred). Preferred Experience:Service mesh deployment (Istio, Linkerd), multi-cluster tools (Fleet, Rancher).Chaos engineering frameworks (Chaos Mesh, Litmus).Developer platform abstraction on Kubernetes.FinOps cost optimization practices.Edge Kubernetes and NFV/SDN background.Active participation in the Kubernetes community.

Apply

Sr. Site Reliability Engineer

Description:

Report this job