Job Summary
The Resiliency Engineer will play a key role in ensuring the reliability, performance, and scalability of enterprise applications.
This role involves designing and executing load and performance tests, analyzing system and infrastructure behavior, and supporting reliability engineering practices across cloud-native environments.
The engineer will contribute to internal toolsets, support performance investigations across Kubernetes and cloud platforms, and collaborate with cross-functional teams to drive system-wide reliability improvements.
Key Responsibilities
Design, develop, and execute load and performance tests across enterprise services.
Contribute to internal tool development that includes API endpoints, analytics, and reporting functionality.
Apply object-oriented programming principles and coding best practices.
Build and maintain pipelines using Azure DevOps and Kubernetes.
Monitor and analyze performance using Datadog, Grafana, and Splunk.
Track performance and reliability metrics including resource utilization, latency, and error rates.
Diagnose issues across infrastructure layers, especially Kubernetes environments.
Investigate interactions between infrastructure components to identify anomalies and reliability risks.
Provide recommendations to Application Development and Cloud teams regarding Kubernetes resource sizing and optimization.
Partner with DevOps and SRE teams to performance test features and support product development efforts.
Collaborate with teams to understand the impact of backend changes and verify feature reliability.
Document and interpret complex architecture interactions with team members.
Required Qualifications
3+ years of experience in performance engineering, software development, SRE, or DevOps roles.
Strong coding and scripting experience with JavaScript, TypeScript, and Node.js (other languages acceptable based on proficiency).
Hands-on experience with Kubernetes and familiarity with containers; exposure to Terraform or Helm is a plus.
Experience working with cloud platforms; Azure preferred, with acceptable exposure to AWS or Google Cloud.
Working knowledge of observability tools such as Datadog, Grafana, or Splunk.
Understanding of core performance and reliability metrics including utilization, response times, and error rates.
Demonstrated ability to learn new technologies quickly.
Strong communication and interpersonal skills for cross-functional collaboration.
Preferred Qualifications
Experience with performance testing tools such as LoadRunner DevWeb, LoadRunner Cloud, Locust, k6, or similar tools.
Strong Python experience.
Experience with Power BI or similar reporting tools.
Exposure to or interest in Chaos Engineering methodologies.
Understanding of or interest in using AI tools for workflow automation.