JobsByCulture
Sakana AI

Sakana AI

Nature-Inspired AI Research

Join a world-class team building AI that draws inspiration from nature — evolutionary algorithms, swarm intelligence, and collective behavior. Founded by the co-inventor of the Transformer architecture.

Tokyo, Japan ~150 employees $412M raised $2.65B+ valuation
Back to job details
Sakana AI Full-time Tokyo, Japan

Software Engineer (Platform)

Join Sakana AI as a Platform Engineer to build and maintain the foundational infrastructure that powers all of our AI research and production services. You will design and operate systems for GPU compute orchestration, model serving at scale, data pipelines for training runs, and the internal developer platforms that let our engineering and research teams move fast without sacrificing reliability.

Your work will be the invisible backbone that makes everything else at Sakana AI possible. When a researcher needs to launch 10,000 parallel evolutionary experiments across a GPU cluster, your systems handle the scheduling, monitoring, and resource allocation. When an enterprise customer needs sub-100ms inference latency, your serving infrastructure delivers it. You will build abstractions that hide infrastructure complexity while giving teams the control they need.

Day to day, you will work on Kubernetes cluster operations, GPU scheduling and multi-tenancy, infrastructure-as-code with Terraform, observability stack management (Prometheus, Grafana, distributed tracing), and CI/CD pipelines that handle everything from model training to production deployment. You will also build internal CLIs and dashboards that make infrastructure self-service for other teams.

We are looking for someone with deep infrastructure expertise who genuinely enjoys building developer tools and platforms. You should have strong opinions about system design, observability, and reliability engineering, and be excited about the unique challenges of AI infrastructure — where a single training run can cost hundreds of thousands of dollars and a misconfigured scheduler can waste an entire GPU cluster.

Requirements:

  • 5+ years of platform or infrastructure engineering experience at a technology company
  • Strong proficiency in Go, Python, or Rust for building infrastructure tooling and services
  • Deep hands-on experience with Kubernetes, Terraform, and cloud-native architectures on AWS or GCP
  • Experience building internal developer platforms, CLIs, or self-service infrastructure tooling
  • Understanding of GPU cluster management, ML serving infrastructure (Triton, vLLM), and compute scheduling
  • Strong background in observability (Prometheus, Grafana, OpenTelemetry, structured logging)
  • Experience with high-availability systems design, incident response, and SRE practices

Your application

Fields marked with * are required.

Your information is kept confidential.