Machine Learning Engineer
Pilotcrew AIFull Description
Machine Learning Engineer- Applied Research
Location: Hybrid
Company: Pilotcrew AI
Type: Full-Time
Experience: 3-5 Years
About Pilotcrew AI
Pilotcrew AI builds infrastructure for AI Agent Evaluation. We benchmark large language models, run automated agent evaluations, power human-in-the-loop assessments, and host AI arenas for competitive testing.
Our mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.
Role Overview
We are hiring an Applied Research Engineer to bridge cutting-edge AI research with production-grade systems for evaluating LLMs and AI agents.
In this role, you will read, interpret, and implement ideas from the latest research across large language models, multimodal systems, and agent architectures. You will translate these insights into scalable evaluation pipelines, new benchmarking methodologies, and improved model performance.
You will work closely with engineering and product teams to turn research concepts into real-world systems used for measuring, debugging, and improving AI agents.
This is a research-driven, execution-heavy role requiring strong fundamentals, curiosity, and the ability to operate in a fast-paced startup environment.
Key Responsibilities
• Read and synthesize research papers in LLMs, multimodal AI, and agent systems
• Implement and adapt state-of-the-art methods into production-ready systems
• Design and improve evaluation methodologies (benchmarking, grading, scoring)
• Build experimental pipelines to test model behavior, robustness, and generalization
• Analyze model performance, failure modes, and edge cases
• Develop novel metrics for reliability, reasoning quality, and tool usage
• Contribute to adversarial testing and stress-testing frameworks
• Work on multimodal systems (text, vision, tool interactions) where relevant
• Collaborate with engineering teams to productionize research ideas
• Document findings and communicate insights clearly to technical stakeholders
Required Skills
• Strong Python programming skills
• Solid foundation in machine learning and deep learning
• Hands-on experience with PyTorch or TensorFlow
• Experience working with LLMs, transformers, or multimodal models
• Ability to read and understand research papers and implement them effectively
• Strong analytical thinking and experimentation skills
• Experience designing experiments and interpreting results
• Familiarity with evaluation metrics and benchmarking methodologies
Preferred Skills
• Experience with LLM evaluation, benchmarking, or alignment
• Familiarity with agent architectures (ReAct, tool-calling, planning systems)
• Experience with multimodal models (vision-language systems, CLIP, etc.)
• Knowledge of RLHF, reward modeling, or preference learning
• Experience with retrieval systems, search, or re-ranking
• Exposure to distributed systems or large-scale experimentation pipelines
• Background in applied ML research (industry or academia)
What We Value
• Strong curiosity and research mindset
• Ability to translate theory into practical systems
• Ownership and bias toward execution
• Comfort working with ambiguity and evolving problem spaces
• Clear and structured technical communication
• Ability to thrive in a fast-paced startup environment with high ownership
Why Join Pilotcrew AI
• Work on cutting-edge problems in AI evaluation and reliability
• Bridge research and real-world AI systems
• High ownership and autonomy in a fast-moving team
• Opportunity to shape how AI agents are evaluated at scale.
* Exposure to both research-driven innovation and production systems