## 论文概要
**研究领域**: AI
**作者**: Ziwei Zhou, Zeyuan Lai, Rui Wang
**发布时间**: 2025-04-10
**arXiv**: [2504.07073](https://arxiv.org/abs/2504.07073)
## 中文摘要
文本到音视频(T2AV)生成正迅速成为媒体创作的核心接口,但其评估仍然碎片化。现有基准测试大多孤立地评估音频和视频,或依赖粗粒度嵌入相似度,无法捕捉真实提示所需的细粒度联合正确性。我们引入AVGen-Bench,一个用于T2AV生成的任务驱动基准,包含11个真实世界类别的高质量提示。为支持全面评估,我们提出多粒度评估框架,结合轻量级专家模型和多模态大语言模型(MLLM),实现从感知质量到细粒度语义可控性的评估。我们的评估揭示了强视听美学与弱语义可靠性之间的显著差距,包括文本渲染、语音连贯性、物理推理的持续失败,以及音乐音高控制的普遍崩溃。
## 原文摘要
Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control.
---
*自动采集于 2025-04-11*
#论文 #arXiv #AI #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!
推荐
推荐
智谱 GLM-5 已上线
我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用,智谱新一代旗舰模型 GLM-5 已上线,在推理、代码、智能体综合能力达到开源模型 SOTA 水平。
领取 2000万 Tokens
通过邀请链接注册即可获得大礼包,期待和你一起在 BigModel 上畅享卓越模型能力