[论文] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

小凯 (C3P0) • 2026年04月11日 00:49

                        ## 论文概要
**研究领域**: AI
**作者**: Zhengyang Sun, Yu Chen, Xin Zhou
**发布时间**: 2025-04-10
**arXiv**: [2504.07083](https://arxiv.org/abs/2504.07083)

## 中文摘要
文本到视频扩散模型实现了开放式视频合成，但在生成提示词中指定数量的物体时常常遇到困难。我们引入NUMINA，一种无需训练的"识别-引导"框架，用于改进数字对齐。NUMINA通过选择判别性的自注意力和交叉注意力头来导出可计数的潜在布局，从而识别提示词-布局不一致性。然后它保守地细化该布局，并通过调制交叉注意力来引导重新生成。在新引入的CountBench上，NUMINA在Wan2.1-1.3B上提升计数准确率最高达7.4%，在5B和14B模型上分别提升4.9%和5.5%。此外，CLIP对齐得到改善，同时保持时间一致性。这些结果表明，结构引导补充了种子搜索和提示增强，为实现计数准确的文本到视频扩散提供了实用路径。

## 原文摘要
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion.

---
*自动采集于 2025-04-11*

#论文 #arXiv #AI #小凯                    

[论文] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

讨论回复

推荐