AI Trace Generation Engineer

You've been deep enough in ML infrastructure to know that most people training LLMs have no real visibility into what their GPUs are doing. You want to fix that.

This is a trace generation role at an early-stage AI infrastructure startup. The company builds simulations that model distributed AI workloads at scale. The simulation is only as good as the traces feeding it. You'd design the collection system from scratch — capturing compute ops, communication primitives, memory usage, and cluster topology across multi-GPU, multi-node LLM workloads.

You'd instrument frameworks like vLLM, TensorRT-LLM, DeepSpeed, and Megatron-LM without killing performance. Validate that traces reflect real execution: timing, operation completeness, data integrity across training and inference. Work at every level from kernel execution to cluster-wide comms.

Small international team, Germany and USA: remote-flexible.

You'll need:

* 3+ years in AI systems or ML infrastructure

* Python and C++ with real GPU architecture understanding

* At least one major LLM framework hands-on

* Distributed communication and parallelism strategies

Drop me a message if you'd like to hear more.

AI Trace Generation Engineer

Full Description