Back to jobs

Member of Technical Staff - ML Infrastructure Engineer

Black Forest Labs
San Francisco, CA
Full-time
AI tools:
Stable Diffusion
Applications go directly to the hiring team

Full Description

About Black Forest Labs

We're the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We're creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we're just getting started.

Headquartered in Freiburg, Germany with a growing presence in San Francisco, we're scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity.

Why This Role

You'll design, deploy, and maintain the ML infrastructure backbone that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds, whether inference stays fast enough for production, whether researchers can iterate quickly or wait hours for resources.

What You'll Work On

You'll be the person who:

* Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on

* Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale

* Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things

* Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production

* Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short

* Ensures security best practices across the ML infrastructure stack without creating friction that slows down research

* Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used

What We're Looking For

You've built and managed ML infrastructure at scale and understand that supporting AI research is fundamentally different from traditional cloud infrastructure. You've been paged because a training run failed. You've debugged why storage became the bottleneck. You know the difference between infrastructure that works in demos and infrastructure that works when researchers depend on it for months-long experiments.

You likely have:

* Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services—you know which services matter and which are marketing

* Extensive experience with Kubernetes and Slurm cluster management in production environments

* Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.) and the discipline to actually use them

* Proven track record managing and optimizing network-based cloud file systems and object storage for ML workloads

* Experience with CI/CD tools and practices (CircleCI, GitHub Actions, ArgoCD, etc.) in ML contexts

* Strong understanding of security principles and best practices in cloud environments—without making security the enemy of velocity

* Experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.) that help you understand what's actually happening

* Familiarity with ML workflows and GPU infrastructure management—you understand what researchers need

* Demonstrated ability to handle complex migrations and breaking changes in production environments without losing data or breaking experiments

We'd be especially excited if you:

* Have experience building custom autoscaling solutions for ML workloads that standard tools can't handle

* Bring knowledge of cost optimization strategies for cloud-based ML infrastructure (because GPU hours add up)

* Are familiar with MLOps practices and tools

* Have experience with high-performance computing (HPC) environments

* Understand data versioning and experiment tracking for ML

* Know network optimization techniques for distributed ML training

* Have worked with multi-cloud or hybrid cloud architectures

* Are familiar with container security and vulnerability scanning tools

How We Work Together

We're a distributed team with real offices that people actually use. Depending on your role, you'll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We'll cover reasonable travel costs to make this possible. We think in-person time matters, and we've structured things to make it accessible to all. We'll discuss what this will look like for the role during our interview process.

Everything we do is grounded in four values:

* Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.

* Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.

* Bold. We take the ambitious bet. We ship, we do not wait for conditions to be perfect.

* Kind. People over politics. We treat each other with genuine warmth. Agency without empathy creates chaos.

If this sounds like work you'd enjoy, we'd love to hear from you.

Base Annual Salary: $180,000–$300,000 USD

We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Applications go to the hiring team directly