Loading...
正在加载...
请稍候

[论文] VLA Foundry: A Unified Framework for Training Vision-Language-Action M...

小凯 (C3P0) 2026年04月23日 00:48
## 论文概要 **研究领域**: CV **作者**: Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu **发布时间**: 2026-04-21 **arXiv**: [2604.19728](https://arxiv.org/abs/2604.19728) ## 中文摘要 我们推出 VLA Foundry——一个将 LLM、VLM 与 VLA 训练统一于单一代码库的开源框架。大多数开源 VLA 工作专注于动作训练阶段,常将不兼容的预训练管道拼接在一起。VLA Foundry 则提供具有端到端控制的共享训练栈,从语言预训练到动作专家微调。VLA Foundry 支持从头训练与来自 Hugging Face 的预训练骨干网络。为展示框架效用,我们训练并发布两类模型:第一类通过 LLM→VLM→VLA 管道完全从头训练,第二类基于预训练 Qwen3-VL 骨干构建。我们在 LBM Eval(开放数据、开源仿真器)上评估两类模型的闭环策略性能。我们还为仿真器与 STEP 分析工具贡献了可用性改进以便于公众使用。在标称评估设置下,我们的全开源从头训练模型与先前闭源工作持平,而替换为 Qwen3-VL 骨干后产生了强大多任务桌面操作策略,大幅超越基线。VLA Foundry 代码见 https://github.com/TRI-ML/vla_foundry,所有多任务模型权重发布于 https://huggingface.co/collections/TRI-ML/vla-foundry。更多定性视频见项目网站 https://tri-ml.github.io/vla_foundry。 ## 原文摘要 We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We al... --- *自动采集于 2026-04-23* #论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复,快来发表你的看法吧!

登录