Atella
Systematic evaluation should be a prerequisite to deployment, not an afterthought. The stakes are too high for anything less. Frontier models have made huge leaps in conversational ability, but safety evaluation hasn’t kept up. Most benchmarks and one-and-done red teaming miss the reality of how these systems are used: multi-turn, high-stakes conversations in areas like crisis support, domestic violence, child abuse reporting, and mental health. STELLA (Simulation-based TEsting for LLM Applications) is built as a clinical-grade test harness to close the gap between “it works” and “it’s safe”. Developed by a multidisciplinary team from Mass General Brigham, Harvard Medical School, Oxford, and Cornell, STELLA adds a simulation-based validation layer that stress-tests LLMs with realistic patient cohorts and clinically grounded, multi-turn scenarios. Why teams use STELLA: + Beyond benchmarks: failure modes often appear after 5+ turns. + Model-agnostic: works via API with GPT-4, Claude,