GAMMA Lab researchers collaborated with Apple Machine Learning Research to develop AMUSE (Audio-Visual Benchmark and Alignment framework for Agentic Multi-Speaker Understanding), a new benchmark designed to evaluate and improve multimodal AI systems operating in complex, real-world conversational settings.
AMUSE focuses on agentic multi-speaker reasoning — requiring models to track who is speaking over time, ground dialogue in visual context, and generate coherent multimodal summaries. The benchmark reveals significant limitations in existing multimodal large language models when reasoning across audio, vision, and language simultaneously.