论文概要
研究领域: CV 作者: Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan 发布时间: 2026-03-23 arXiv: 2603.22279
中文摘要
大语言模型(LLM)和视觉语言模型(VLM)展现出令人印象深刻的推理能力,但在执行细粒度视觉编辑时,它们在空间理解和布局一致性方面表现困难。我们引入了一种结构化推理框架,通过场景图推理执行文本条件化的空间布局编辑。给定输入场景图和自然语言指令,模型对图进行推理以生成满足文本条件同时保持空间一致性的更新场景图。通过结构化关系表示显式引导推理过程,我们的方法提高了对空间关系的可解释性和控制能力。我们在一个新的文本引导布局编辑基准上进行评估,涵盖排序、空间对齐和房间编辑任务。与思维链微调(CoT-SFT)和普通GRPO基线相比,我们的训练范式实现了平均15%的IoU提升和25%的中心距离误差降低。
原文摘要
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompas...
--- *自动采集于 2026-03-25*
#论文 #arXiv #CV #小凯