Back to jobs

Machine Learning Engineer

Pilotcrew AI
San Francisco Bay Area
Full-time
AI tools:
PyTorch
TensorFlow
LLMs
Applications go directly to the hiring team

Full Description

Machine Learning Engineer- Applied Research

Location: Hybrid

Company: Pilotcrew AI

Type: Full-Time

Experience: 3-5 Years

About Pilotcrew AI

Pilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing.

Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.

Role Overview

We are hiring an Applied Research Engineer to bridge cutting-edge AI research with production-grade systems for evaluating LLMs and AI agents.

In this role, you will read, interpret, and implement ideas from the latest research across large language models, multimodal systems, and agent architectures. You will translate these insights into scalable evaluation pipelines, new benchmarking methodologies, and improved model performance.

You will work closely with engineering and product teams to turn research concepts into real-world systems used for measuring, debugging, and improving AI agents.

This is a research-driven, execution-heavy role requiring strong fundamentals, curiosity, and the ability to operate in a fast-paced startup environment.

Key Responsibilities

• Read and synthesize research papers in LLMs, multimodal AI, and agent systems

• Implement and adapt state-of-the-art methods into production-ready systems

• Design and improve evaluation methodologies (benchmarking, grading, scoring)

• Build experimental pipelines to test model behavior, robustness, and generalization

• Analyze model performance, failure modes, and edge cases

• Develop novel metrics for reliability, reasoning quality, and tool usage

• Contribute to adversarial testing and stress-testing frameworks

• Work on multimodal systems (text, vision, tool interactions) where relevant

• Collaborate with engineering teams to productionize research ideas

• Document findings and communicate insights clearly to technical stakeholders

Required Skills

• Strong Python programming skills

• Solid foundation in machine learning and deep learning

• Hands-on experience with PyTorch or TensorFlow

• Experience working with LLMs, transformers, or multimodal models

• Ability to read and understand research papers and implement them effectively

• Strong analytical thinking and experimentation skills

• Experience designing experiments and interpreting results

• Familiarity with evaluation metrics and benchmarking methodologies

Preferred Skills

• Experience with LLM evaluation, benchmarking, or alignment

• Familiarity with agent architectures (ReAct, tool-calling, planning systems)

• Experience with multimodal models (vision-language systems, CLIP, etc.)

• Knowledge of RLHF, reward modeling, or preference learning

• Experience with retrieval systems, search, or re-ranking

• Exposure to distributed systems or large-scale experimentation pipelines

• Background in applied ML research (industry or academia)

What We Value

• Strong curiosity and research mindset

• Ability to translate theory into practical systems

• Ownership and bias toward execution

• Comfort working with ambiguity and evolving problem spaces

• Clear and structured technical communication

• Ability to thrive in a fast-paced startup environment with high ownership

Why Join Pilotcrew AI

• Work on cutting-edge problems in AI evaluation and reliability

• Bridge research and real-world AI systems

• High ownership and autonomy in a fast-moving team

• Opportunity to shape how AI agents are evaluated at scale.

* Exposure to both research-driven innovation and production systems

Applications go to the hiring team directly