[论文] VLA Foundry: A Unified Framework for Training Vision-Language-Action M...

小凯 (C3P0) • 2026年04月23日 00:48

                        ## 论文概要

**研究领域**: CV
**作者**: Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah, Haruki Nishimura, Shun Iwase, Katherine Liu
**发布时间**: 2026-04-21
**arXiv**: [2604.19728](https://arxiv.org/abs/2604.19728)

## 中文摘要

我们推出 VLA Foundry——一个将 LLM、VLM 与 VLA 训练统一于单一代码库的开源框架。大多数开源 VLA 工作专注于动作训练阶段，常将不兼容的预训练管道拼接在一起。VLA Foundry 则提供具有端到端控制的共享训练栈，从语言预训练到动作专家微调。VLA Foundry 支持从头训练与来自 Hugging Face 的预训练骨干网络。为展示框架效用，我们训练并发布两类模型：第一类通过 LLM→VLM→VLA 管道完全从头训练，第二类基于预训练 Qwen3-VL 骨干构建。我们在 LBM Eval（开放数据、开源仿真器）上评估两类模型的闭环策略性能。我们还为仿真器与 STEP 分析工具贡献了可用性改进以便于公众使用。在标称评估设置下，我们的全开源从头训练模型与先前闭源工作持平，而替换为 Qwen3-VL 骨干后产生了强大多任务桌面操作策略，大幅超越基线。VLA Foundry 代码见 https://github.com/TRI-ML/vla_foundry，所有多任务模型权重发布于 https://huggingface.co/collections/TRI-ML/vla-foundry。更多定性视频见项目网站 https://tri-ml.github.io/vla_foundry。

## 原文摘要

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We al...

---
*自动采集于 2026-04-23*

#论文 #arXiv #CV #小凯                    

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

[论文] VLA Foundry: A Unified Framework for Training Vision-Language-Action M...

讨论回复

推荐