AI Infrastructure Engineer (Model Training & Inference)
XcedeFull Description
The Mission
My client is moving beyond the theoretical. While many are reading papers about generative AI, this team is building the foundational substrate that makes those models a reality. Having recently secured a colossal Seed funding round, the organisation is scaling its capacity to train frontier-level generative models for video and audio.
The mission is clear: solve the "hard tech" problems of the training stack. This isn't about high-level API calls; it’s about understanding what is happening at the hardware level to squeeze every possible TFLOP out of the cluster.
The Role
As an AI Infrastructure Engineer, you will own the full training stack. The role is designed for a systems expert who thrives on profiling GPU behaviour, debugging complex training pipelines, and designing the architecture that allows models to iterate at scale.
Primary Objectives:
* Architect SLURM Clusters: Design, deploy, and maintain large-scale ML training clusters using SLURM for distributed workload orchestration.
* Hardware-Level Optimization: Use tools like Nsight and stack trace viewers to profile single and multi-GPU operations, identifying bottlenecks in the memory hierarchy and compute capabilities.
* Strategy & Parallelism: Determine the ideal training strategies—including parallelism approaches and precision trade-offs—for various model sizes and compute loads.
* End-to-End Pipeline Performance: Optimize the entire lifecycle, from efficient data storage (VAST/blob storage) and high-throughput loading to checkpointing and artifact saving.
* PyTorch Mastery: Dive into the code to optimize PyTorch operations, dealing with both memory-bound and compute-bound constraints.
The Profile
The team is looking for a "builder" with deep technical intuition regarding how data moves through a GPU.
* Implementation-First Mindset: You have a track record of implementing effective techniques in training and inference optimization, not just researching them.
* GPU Fluency: You possess a deep understanding of GPU memory hierarchy and the specific constraints that prevent hardware from achieving its theoretical peak.
* Distributed Expertise: Experience with SLURM at scale and an understanding of attention algorithms and their performance characteristics.
* The "Nice to Haves": If you have implemented custom GPU kernels or have specific experience with diffusion and autoregressive model optimization, you will be a standout candidate.
Why Consider This?
* True Ownership: My client offers genuine autonomy. Your ideas will directly shape the foundation of the company's generative models from day one.
* Pivotal Growth: Join a team that has the capital and the traction to win, at a stage where your individual contribution can fundamentally change the company's trajectory.
* Complex Problem Solving: Work on the specific challenges of high-performance storage and video/audio data pipelines—some of the most demanding workloads in the industry.
* Competitive Rewards: Expect a strong compensation package and equity that ensures you share in the success you help create.
How to Apply
If you are ready to stop building at the surface and start building at the hardware level, reach out for a confidential conversation about the team and their roadmap.