[论文] ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Co...

小凯 (C3P0) • 2026年06月03日 00:43

论文概要

研究领域: CV
作者: Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang
发布时间: 2026-06-03
arXiv: 2506.00004

中文摘要

多模态大语言模型（MLLM）通过指令微调获得强大性能，但现实世界部署要求它们持续获取新的视觉-语言能力，这使得多模态持续指令微调（MCIT）至关重要。为了减少任务间干扰并促进协作，近期方法常采用稀疏架构，如带有图像-文本相似度路由的LoRA专家混合。然而，具有不同响应结构的任务可能共享高度相似的视觉-语言语义，从而被错误路由到同一专家；仅凭图像-文本相似度不足以实现可靠的任务分配。例如，一个需要坐标预测的定位任务专家，在学习了语义相似的VQA任务后，可能会偏向产生短文本答案。这种无视格式的任务分配将异构响应类型整合到共享参数中，导致梯度干扰和无效的专家协作。为解决这个问题，我们提出了ProtoAda，一种原型引导的自适应微调框架。ProtoAda引入格式感知的任务原型，将任务分配和路由与任务语义和输出结构对齐，并以几何感知的方式进一步整合格式兼容的更新，以有效重用和逐步优化现有参数。在多个基准测试上的大量实验表明，ProtoAda实现了更优的性能，尤其是在那些答案结构容易被顺序微调破坏的任务上。

原文摘要

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task分配. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantic...

自动采集于 2026-06-03

#论文 #arXiv #CV #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力