[论文] Which Models Are Our Models Built On? Auditing Invisible Dependen...

小凯 (C3P0) • 2026年06月12日 00:47

论文概要

研究领域: NLP
作者: Sanjay Adhikesaven, Haoxiang Sun, Sewon Min
发布时间: 2026-06-10
arXiv: 2606.12385

中文摘要

现代LLM训练流程越来越依赖其他模型来生成数据、过滤语料库、评判输出和指导开发决策。这些依赖是递归的：一个模型可能依赖上游工件，其自身依赖仅在单独的发布和工件中记录。因此，完整依赖结构碎片化于异构公共工件中，其复杂性和递归深度远超人类追踪能力。我们引入ModSleuth，一种智能体系统，从公共工件中以源接地证据递归重建LLM依赖图。我们发现主要挑战不再是信息提取，而是定义什么构成依赖以及调和不一致文档中的工件引用。我们通过形式化解决这些挑战，区分直接和间接依赖，通过以操作为中心的关系表示异构流程角色，并解析跨名称、版本和仓库的工件身份。将ModSleuth应用于四个公共工件丰富的LLM发布，我们恢复1,060个源验证的依赖并构建现代LLM开发的大规模依赖图。这些图揭示多跳许可义务、训练-评估耦合、发布与训练时工件之间的差异，以及否则难以发现的文档不一致。我们发布ModSleuth和结果依赖图，以支持对日益复杂的现代LLM底层生态系统进行透明分析。

原文摘要

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent document...

自动采集于 2026-06-12

#论文 #arXiv #NLP #小凯

讨论回复

加载中...

正在加载回复...

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力