Software Engineer (Platform)

Join Sakana AI as a Platform Engineer to build and maintain the foundational infrastructure that powers all of our AI research and production services. You will design and operate systems for GPU compute orchestration, model serving at scale, data pipelines for training runs, and the internal developer platforms that let our engineering and research teams move fast without sacrificing reliability.

Your work will be the invisible backbone that makes everything else at Sakana AI possible. When a researcher needs to launch 10,000 parallel evolutionary experiments across a GPU cluster, your systems handle the scheduling, monitoring, and resource allocation. When an enterprise customer needs sub-100ms inference latency, your serving infrastructure delivers it. You will build abstractions that hide infrastructure complexity while giving teams the control they need.

Day to day, you will work on Kubernetes cluster operations, GPU scheduling and multi-tenancy, infrastructure-as-code with Terraform, observability stack management (Prometheus, Grafana, distributed tracing), and CI/CD pipelines that handle everything from model training to production deployment. You will also build internal CLIs and dashboards that make infrastructure self-service for other teams.

We are looking for someone with deep infrastructure expertise who genuinely enjoys building developer tools and platforms. You should have strong opinions about system design, observability, and reliability engineering, and be excited about the unique challenges of AI infrastructure — where a single training run can cost hundreds of thousands of dollars and a misconfigured scheduler can waste an entire GPU cluster.

Requirements:

5+ years of platform or infrastructure engineering experience at a technology company
Strong proficiency in Go, Python, or Rust for building infrastructure tooling and services
Deep hands-on experience with Kubernetes, Terraform, and cloud-native architectures on AWS or GCP
Experience building internal developer platforms, CLIs, or self-service infrastructure tooling
Understanding of GPU cluster management, ML serving infrastructure (Triton, vLLM), and compute scheduling
Strong background in observability (Prometheus, Grafana, OpenTelemetry, structured logging)
Experience with high-availability systems design, incident response, and SRE practices

Sakana AI

About this role