Role Overview
Our partner is a top academic research lab focused on advancing AI agents in real-world system environments. We're seeking high-performing software engineers based in Five Eyes countries to rigorously evaluate and improve terminal-based agents through the Terminal-Bench 2.0 benchmark suite. This is a short-term, high-intensity contract ideal for engineers with deep systems-level expertise and a passion for hands-on problem-solving. Due to the complexity of the tasks, high engagement and consistent weekly availability are critical.
Key Responsibilities:
Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
Provide detailed, evidence-based reasoning grounded in code structure and terminal behavior
Synthesize information across files and configurations to assess end-to-end architecture
Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics
Ideal Qualifications:
2+ years of hands-on experience at top-tier tech companies, quant firms, or elite startups
Bachelor’s or Master’s in Computer Science or related field from a top 50–100 global university
Based in one of the Five Eyes countries: United States, United Kingdom, Canada, Australia, or New Zealand
Deep familiarity with terminal workflows, Linux environments, and shell scripting
Strong knowledge of Docker, Git, Python, and distributed systems concepts
Demonstrated ability to trace, debug, and explain complex system behaviors across multiple files
Commitment to intellectual honesty, clarity, and rigorous methodology
Application Process:
Submit your resume and brief experience summary through Mercor
Qualified applicants will be invited to complete a short-form technical assessment
We typically follow up within 3–5 business days with next steps
-m-