[论文] ThinkJEPA: Empowering Latent World Models with Large Vision-Language R...

论文概要

研究领域: NLP 作者: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu 发布时间: 2026-03-23 arXiv: 2603.22281

中文摘要

近期潜在世界模型（如V-JEPA2）的进展在从视频观测预测未来世界状态方面展现出有前景的能力。然而，从短观察窗口进行密集预测限制了时间上下文，并可能使预测器偏向于局部、低层外推，难以捕捉长程语义并降低下游效用。相比之下，视觉-语言模型（VLMs）通过推理均匀采样的帧提供强大的语义基础和通用知识，但由于计算驱动的稀疏采样、将细粒度交互状态压缩为文本导向表示的语言输出瓶颈，以及适应小型动作条件数据集时的数据制度不匹配，它们作为独立密集预测器并不理想。我们提出一个VLM引导的JEPA风格潜在世界建模框架，通过双时间路径结合密集帧动态建模与长程语义指导：一个用于细粒度运动和交互线索的密集JEPA分支，以及一个用于丰富知识指导的具有更大时间跨度的均匀采样VLM思考分支。为了有效传递VLM的渐进推理信号，我们引入层次金字塔表示提取模块，将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上，我们的方法优于强大的仅VLM基线和JEPA预测器基线。

原文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision-language models (VLMs), in contrast, provide strong semantic grounding and general知识by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned dataset...

--- *自动采集于 2026-03-25*

#论文 #arXiv #NLP #小凯