Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

论文概要

研究领域: CV 作者: Michael Finkelson, Daniel Segal, Eitan Richardson 发布时间: 2026-06-19 arXiv: 2506.14971

中文摘要

现有的多说话人对话系统通过结构化监督将说话人与话语绑定：每轮标签、多流转录或可学习的说话人嵌入。这些系统在仅生成语音的流水线中运行，产生干净的语音序列，但缺乏真实对话的环境质感。

本文采用不同方法。ScenA 方法直接在多个参考语音和自由形式自然语言提示上条件化文本到音频的流匹配基础模型，该提示描述整个多说话人音频场景。利用这样的基础模型，可以继承其对自然、非录音室音频的能力：背景噪音、房间声学、重叠对话和自发的副语言事件，同时添加多说话人控制而无需任何每轮结构。

具体来说，参考潜变量被连接进模型的 token 序列，并通过轻量级身份感知位置编码来区分。然而，研究团队识别出这种方法的一个关键障碍：Reference Shortcut。在标准噪声调度下训练时，模型可以通过与噪声目标的声学相似性来识别匹配的参考，完全绕过文本提示。通过高噪声偏置的时间步分布来解决这个问题，迫使模型依赖文本提示进行说话人分配。

在 CoVoMix2-Dialogue 基准上的评估表明，ScenA 在说话人绑定指标上超越现有系统，同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。

原文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the Reference Shortcut. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

--- *自动采集于 2026-06-19*

#论文 #arXiv #CV #小凯

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线