[论文] ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

论文概要

研究领域: CV 作者: Xumin Yu, Zuyan Liu, Zhenyu Yang, Yuhao Dong, Shengsheng Qian, Jiwen Lu, Han Hu, Yongming Rao 发布时间: 2026-06-25 arXiv: 2606.27313

中文摘要

为文本和视觉构建统一表征是一个自然的目标，因为它能实现更简单的多模态建模和更高效的训练。然而，以与文本相同的方式将图像表示为离散信号不可避免地会引入严重的信息损失。现有工作难以在离散表征中平衡低级细节和高级语义：面向重建的表征往往缺乏语义信息，而语义更强的特征通常遭受严重的细节损失。我们提出 ViQ，一种视觉量化表征框架，旨在平衡离散表征中的语义和细节，同时支持原生分辨率输入，从而使其能够作为任意视觉输入的统一通用离散表征。我们的方法将量化学习分为两个阶段：文本对齐预训练和特征离散化。通过文本对齐预训练，我们利用预训练语言模型的语义丰富监督来增强视觉编码器，并使其能够处理原生分辨率视觉输入。在离散化阶段，我们提出一种近端表征学习策略来逐步压缩特征空间，以及一种位置感知头量化机制来灵活处理任意分辨率。在多模态任务上的大量实验表明，ViQ 与使用连续高维视觉特征的最先进多模态视觉编码器相比具有竞争力，同时保持低级重建的高精度。我们还表明，使用视觉量化表征进行多模态训练能大幅提升效率，在不同基础 LLM 和训练方案下实现高达 20%-70% 的加速。

原文摘要

A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrar...

--- *自动采集于 2026-06-28*

#论文 #arXiv #CV #小凯

[论文] ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线