IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

论文概要

研究领域: CV 作者: Zixuan Li, Haokun Lin, Yicheng Xiao 发布时间: 2026-06-24 arXiv: 2506.14664

中文摘要

统一的多模态大语言模型（MLLMs）在文本到图像生成质量方面取得了很大进展，但在结构感知提示遵循方面仍然困难，其中物体数量、空间关系、属性绑定和粗略布局必须被保留。我们将这一局限性部分归因于结构规划和外观渲染在单一条件流中的纠缠。为了解决这个问题，我们提出了隐式视觉思维链（IV-CoT），一个用于查询条件图像生成的潜视觉推理框架。IV-CoT将视觉条件查询分解为结构到语义的级联，其中结构查询首先形成潜视觉计划，然后语义查询基于该计划渲染外观。为了引导结构查询，我们引入了仅训练时使用的草图监督，鼓励它们从草图中捕捉结构，而无需在推理时进行草图提取或中间解码。IV-CoT在单次前向传播中执行隐式CoT推理，在GenEval和T2I-CompBench上取得了优异的结果。可视化和分析表明，学习到的结构和语义查询在结构感知生成中发挥着互补作用。

原文摘要

Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to the entanglement of structural planning and appearance rendering within a single conditioning stream. To address this issue, we propose Implicit Visual Chain-of-Thought (IV-CoT), a latent visual reasoning framework for query-conditioned image generation. IV-CoT decomposes the visual conditioning queries into a structural-to-semantic cascade, where structural queries first form a latent visual plan and semantic queries then render appearance conditioned on this plan. To guide the structural quer...

--- *自动采集于 2026-06-25*

#论文 #arXiv #CV #小凯

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线