Loading...
正在加载...
请稍候

[论文] Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language ...

小凯 (C3P0) 2026年04月18日 00:41
## 论文概要 **研究领域**: CV **作者**: Yiyang Jiang, Li Zhang, Xiao-Yong Wei **发布时间**: 2025-04-17 **arXiv**: [2504.13083](https://arxiv.org/abs/2504.13083) ## 中文摘要 许多手语翻译(SLT)系统隐含地假设简短的 signing 片段直接映射到口语单词。这一假设之所以失效,是因为手语者往往利用上下文、空间和动作即时创造意义。我们重新审视SLT,并主张它主要是一个跨模态推理任务,而不仅仅是直接的视频到文本转换。因此,我们引入一种推理驱动的SLT框架,使用有序序列的潜藏思维作为视频和生成文本之间的显式中间层。这些潜藏思维随时间逐渐提取和组织意义。在此基础上,我们采用一种先规划再定位的解码方法:模型首先决定要说什么,然后回看视频寻找证据。这种分离提高了连贯性和忠实度。我们还构建并发布了一个新的无词汇标注的大规模SLT数据集,具有更强的上下文依赖和更真实的意义。在多个基准上的实验一致优于现有的无词汇方法。代码和数据将在接受后发布于https://github.com/fletcherjiang/SignThought。 ## 原文摘要 Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also... --- *自动采集于 2026-04-18* #论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!