Job Description
Senior Platform/Infrastructure Engineer
Location: Fully remote (HQ Cambridge, MA)
Hours: 9–5 EST, with 2-day on-site visits every 6 weeks
You’ll be responsible for designing, scaling, and maintaining the infrastructure and internal developer platforms that power a real-time learning AI at a seed-stage startup. The role blends infrastructure ownership with platform engineering to enable AI/product teams to ship quickly and reliably.
Key Responsibilities - Infrastructure
Maintain production health: performance, reliability, cost efficiency, and security.
Manage GCP Kubernetes clusters (GKE), networking, storage, and compute resources.
Handle scaling, resource allocation, and high availability for growing customer demand.
Refine observability: logs, traces, metrics, dashboards, and alerts.
Perform security hardening and cost optimization.
Kep Responsibilities - Platform Engineering
Build internal tooling and abstractions for developer productivity.
Design CI/CD pipelines using GitHub Workflows and ArgoCD.
Provide self-service environments, internal portals, and deployment systems.
Collaboration & Communication
Work closely with AI and full-stack teams to optimize system architecture.
Explain technical concepts and trade-offs clearly to engineers and non-engineers.
Troubleshoot issues across multiple systems (Python, JavaScript, SQL).
Requirements
5+ years in production cloud environments at scale.
Strong familiarity with GCP (primary) and some AWS experience.
Experience with Kubernetes (GKE), node pools, and memory-intensive jobs.
Working knowledge of CI/CD systems (GitHub Workflows + ArgoCD).
Exposure to observability tools (Datadog), databases (Cloud SQL, ClickHouse, Bigtable), and cloud services.
Skills & Qualities
Strong analytical and problem-solving ability.
Clear, collaborative communication.
Curiosity and ownership mentality.
Fluent in reading/debugging code across Python, JavaScript, SQL.
Technical Stack
Cloud: GCP (primary), AWS (secondary)
Kubernetes: GKE, multiple node pools
CI/CD: GitHub Workflows + ArgoCD
Data: Cloud SQL, ClickHouse, Bigtable, GCS, Dataflow
Networking: Cloudflare Workers, Durable Objects, WebSocket communication
Monitoring: Datadog
Environments: Production, Staging, Integration, Development
Full-time
Fully remote