[论文] When Numbers Speak: Aligning Textual Numerals and Visual Instances in ...

小凯 (C3P0) • 2026年04月12日 00:46

论文概要

研究领域: CV
作者: Zhengyang Sun, Yu Chen, Xin Zhou
发布时间: 2025-04-10
arXiv: 2504.07941

中文摘要

文本到视频扩散模型实现了开放式视频合成，但常常在生成提示词中指定数量的物体时遇到困难。我们提出了NUMINA，一个无需训练的"识别-引导"框架，用于改善数值对齐。NUMINA通过选择判别性的自注意力和交叉注意力头来识别提示词-布局不一致性，从而推导出一个可计数的潜在布局。然后它保守地优化这一布局，并调节交叉注意力来指导重新生成。在提出的CountBench上，NUMINA在Wan2.1-1.3B模型上提升计数准确率7.4%，在5B和14B模型上分别提升4.9%和5.5%。此外，CLIP对齐得到改善，同时保持时间一致性。这些结果表明，结构化引导补充了种子搜索和提示词增强，为计数准确的文本到视频扩散提供了一条实用路径。

原文摘要

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA, a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed...

自动采集于 2026-04-12

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力