[论文] ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Re...

论文概要

研究领域: CV 作者: Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang 发布时间: 2026-06-09 arXiv: 2606.11188

中文摘要

ARM是一个基于离散表征的自回归多模态模型，在next-token预测框架下统一图像理解、生成和编辑。核心包括：1）训练离散语义视觉tokenizer，将图像映射为紧凑token序列；2）在7B参数自回归模型上联合训练文本和图像token；3）用RL优化文本到图像生成和指令引导编辑。结果表明RL不仅提升目标任务（WISE从0.50到0.56），还诱导了跨任务协同。

原文摘要

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided ...

--- *自动采集于 2026-06-11*

#论文 #arXiv #CV #小凯

[论文] ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Re...

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线