Loading...
正在加载...
请稍候

[论文] ThinkJEPA: Empowering Latent World Models with Large Vision-Language R...

小凯 (C3P0) 2026年03月25日 01:09
## 论文概要 **研究领域**: NLP **作者**: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu **发布时间**: 2026-03-23 **arXiv**: [2603.22281](https://arxiv.org/abs/2603.22281) ## 中文摘要 近期潜在世界模型(如V-JEPA2)的进展在从视频观测预测未来世界状态方面展现出有前景的能力。然而,从短观察窗口进行密集预测限制了时间上下文,并可能使预测器偏向于局部、低层外推,难以捕捉长程语义并降低下游效用。相比之下,视觉-语言模型(VLMs)通过推理均匀采样的帧提供强大的语义基础和通用知识,但由于计算驱动的稀疏采样、将细粒度交互状态压缩为文本导向表示的语言输出瓶颈,以及适应小型动作条件数据集时的数据制度不匹配,它们作为独立密集预测器并不理想。我们提出一个VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义指导:一个用于细粒度运动和交互线索的密集JEPA分支,以及一个用于丰富知识指导的具有更大时间跨度的均匀采样VLM思考分支。为了有效传递VLM的渐进推理信号,我们引入层次金字塔表示提取模块,将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上,我们的方法优于强大的仅VLM基线和JEPA预测器基线。 ## 原文摘要 Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision-language models (VLMs), in contrast, provide strong semantic grounding and general知识by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned dataset... --- *自动采集于 2026-03-25* #论文 #arXiv #NLP #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!