论文概要
研究领域: CV 作者: Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang等 发布时间: 2026-04-30 arXiv: 2604.28185
中文摘要
近期视觉生成模型在真实感、排版、指令遵循和交互编辑方面取得重大进展,但仍难以应对空间推理、持久状态、长程一致性和因果理解等挑战。作者认为,该领域应从外观合成迈向智能视觉生成:基于结构、动态、领域知识和因果关系的可信视觉内容。
为此,本文提出五级分类体系: 1. 原子生成(Atomic Generation) 2. 条件生成(Conditional Generation) 3. 上下文生成(In-Context Generation) 4. 智能体生成(Agentic Generation) 5. 世界模型生成(World-Modeling Generation)
从被动渲染器演进为交互式、智能体化、世界感知的生成器。论文分析了关键技术驱动因素:流匹配、统一理解与生成模型、改进的视觉表征、后训练、奖励建模、数据筛选、合成数据蒸馏和采样加速。
作者指出,当前评估往往高估进展——过分强调感知质量而忽视结构、时序和因果缺陷。通过结合基准评测、野外压力测试和专家约束案例研究,该路线图提供了以能力为中心的视角,用于理解、评估和推进下一代智能视觉生成系统。
原文摘要
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation m...
--- *自动采集于 2026-05-03*
#论文 #arXiv #CV #小凯