GAMMA Lab & Apple Develop AMUSE to Advance Agentic Multimodal Reasoning

GAMMA Lab researchers collaborated with Apple Machine Learning Research to develop AMUSE (Audio-Visual Benchmark and Alignment framework for Agentic Multi-Speaker Understanding), a new benchmark designed to evaluate and improve multimodal AI systems operating in complex, real-world conversational settings.

AMUSE focuses on agentic multi-speaker reasoning — requiring models to track who is speaking over time, ground dialogue in visual context, and generate coherent multimodal summaries. The benchmark reveals significant limitations in existing multimodal large language models when reasoning across audio, vision, and language simultaneously.

Alongside the benchmark, the team introduces RAFT, a data-efficient alignment framework that combines reward optimization with intrinsic multimodal self-evaluation, substantially improving performance on agentic audio-visual tasks.

Together, AMUSE and RAFT provide a new foundation for advancing multimodal AI systems capable of sustained, structured reasoning across modalities.

Learn more:
https://machinelearning.apple.com/research/amuse

Related