[论文] The Compression Gap: Why Discrete Tokenization Limits Vision-Language-...

论文概要

研究领域: CV 作者: Takuya Shiba 发布时间: 2026-04-03 arXiv: 2604.03191

中文摘要

通过升级视觉编码器来扩展视觉-语言-动作模型的期望是改善下游操作性能。我们表明，当动作被表示为离散token时，这种期望会失败，并通过一个我们称之为压缩差距的信息论原理解释原因：在任何视觉运动管道中，扩展行为由最紧信息瓶颈的位置决定。当动作是连续的，视觉编码器是绑定约束；当动作通过固定容量码本离散化，码本成为绑定约束。

原文摘要

Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how ...

--- *自动采集于 2026-04-06*

#论文 #arXiv #CV #小凯