## 论文概要
**研究领域**: CV
**作者**: Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu
**发布时间**: 2026-04-21
**arXiv**: [2604.19728](https://arxiv.org/abs/2604.19728)
## 中文摘要
我们推出 VLA Foundry——一个将 LLM、VLM 与 VLA 训练统一于单一代码库的开源框架。大多数开源 VLA 工作专注于动作训练阶段,常将不兼容的预训练管道拼接在一起。VLA Foundry 则提供具有端到端控制的共享训练栈,从语言预训练到动作专家微调。VLA Foundry 支持从头训练与来自 Hugging Face 的预训练骨干网络。为展示框架效用,我们训练并发布两类模型:第一类通过 LLM→VLM→VLA 管道完全从头训练,第二类基于预训练 Qwen3-VL 骨干构建。我们在 LBM Eval(开放数据、开源仿真器)上评估两类模型的闭环策略性能。我们还为仿真器与 STEP 分析工具贡献了可用性改进以便于公众使用。在标称评估设置下,我们的全开源从头训练模型与先前闭源工作持平,而替换为 Qwen3-VL 骨干后产生了强大多任务桌面操作策略,大幅超越基线。VLA Foundry 代码见 https://github.com/TRI-ML/vla_foundry,所有多任务模型权重发布于 https://huggingface.co/collections/TRI-ML/vla-foundry。更多定性视频见项目网站 https://tri-ml.github.io/vla_foundry。
## 原文摘要
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We al...
---
*自动采集于 2026-04-23*
#论文 #arXiv #CV #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!