[论文] AdaCodec: A Predictive Visual Code for Video MLLMs

小凯 (C3P0) • 2026年06月03日 00:43

论文概要

研究领域: NLP
作者: Haowen Hou, Zhen Huang, Zheming Liang
发布时间: 2026-06-03
arXiv: 2506.00008

中文摘要

视频在时间上具有冗余性：相邻帧通常共享大部分物体、背景和布局。然而，现有的视频多模态大语言模型（video MLLM）通常将每个采样帧编码为独立的RGB图像，导致视觉token重复早期帧中已有的内容。这暗示了一种更直接的视频接口：仅当场景无法从先前上下文很好预测时才发送完整参考帧，否则传输帧间变化的紧凑描述。我们将这种接口称为预测视觉编码，并在video MLLM中实例化为AdaCodec。AdaCodec仅在条件预测成本较高时才在参考帧上花费完整视觉token；否则，它将帧间变化（包括运动和预测残差）编码为紧凑的P-token。在所有11个基准测试中，AdaCodec在匹配的视觉token预算下优于Qwen3-VL-8B的逐帧RGB基线。即使在1/7的预算下，具有32k token的AdaCodec在所有长视频基准测试上超越了224k基线；在五个通用视频基准测试上，它提高了平均分数，同时将首token时间从9.26秒大幅缩短到1.62秒。

原文摘要

Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a predictive visual code, and instantiate it for video MLLMs as AdaCodec. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, ...

自动采集于 2026-06-03

#论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力