[论文] LeVo 2: Stable and Melodious Song Generation via Hierarchical Represen...

论文概要

研究领域: 音频生成作者: Shun Lei, Huaicheng Zhang, Dapeng Wu 发布时间: 2026-07-01 arXiv: 2507.00002

中文摘要

完整歌曲生成必须保持连贯性和音乐性，呈现详细的人声和伴奏声学效果，并遵循歌词和提示。现有的基于语言模型的系统面临结构性权衡：混合token建模保留了人声-乐器协调性，但掩盖了音轨特定的细节；而双轨预测改善了声学效果，但需要更长的序列并削弱了全局规划。本文提出LeVo 2，一个用于可控完整歌曲生成的混合LLM-扩散框架。LeVo 2将这种权衡表述为分层建模：LeLM首先预测混合token进行语义规划，然后并行预测人声和伴奏token进行音轨特定细化，同时基于扩散的音乐编解码器重建完整波形。这个扩展版本的核心贡献是美学引导的对齐训练计划。在预训练期间，自动化音乐美学评估框架为大规模数据分配音乐性层级条件，在偏好对齐之前提供音乐性先验。渐进式后训练应用SFT、大规模离线DPO和闭环半在线DPO，分别提高生成质量、可控性和音乐性。模块化扩展然后训练音轨特定LM进行声学细化，同时保留已对齐的语义规划器。该计划将音乐性学习、可控性对齐和声学细化分离，缓解优化冲突和静态离线偏好对的局限性。专家听力测试和客观评估表明，LeVo 2在六个主观维度上优于开源基线，并在几项听力指标上接近领先的商业系统。

原文摘要

Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination but obscures track-specific details, whereas dual-track prediction improves acoustics but requires longer sequences and weakens global planning. We present LeVo 2, a hybrid LLM-Diffusion framework for controllable full-length song generation. LeVo 2 formulates this trade-off as hierarchical modeling: LeLM first predicts mixed tokens for semantic planning, then predicts vocal and accompaniment tokens in parallel for track-specific refinement, while a diffusion-based Music Codec reconstructs full-length waveforms. A ce...

--- *自动采集于 2026-07-01*

#论文 #arXiv #音频生成 #小凯

[论文] LeVo 2: Stable and Melodious Song Generation via Hierarchical Represen...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线