[论文] A Policy-Driven Runtime Layer for Agentic LLM Serving

小凯 (C3P0) • 2026年05月29日 00:48

论文概要

研究领域: LLM
作者: Rui Zhang, Chaeeun Kim, Liting Hu
发布时间: 2026-05-28
arXiv: 2605.27744

中文摘要

多智能体LLM系统已成为主导性生产工作负载,但服务栈并非为其而构建。上方的智能体框架掌握智能体身份、角色、模式和调度结构,却看不到引擎级事件;下方的服务引擎看到每个事件,却对智能体一无所知。大量跨领域策略同时依赖两者:前缀缓存、批次塑形、推测执行、公平性、工具结果备忘录、安全执行等。这些策略目前处于两层之间的缝隙中,以点对点补丁方式解决。本文主张通过架构变革而非点修复来处理这一缝隙:在框架和引擎之间插入第三层--智能体运行时层,暴露四个原语(观察、评分、预测、行动),任何智能体感知策略都可以接入,以智能体身份为共享坐标。研究将九项具体策略映射到该层,并在最具即时成本杠杆的KV跨会话缓存上深度验证:实例化为CacheSage,在线学习每工作负载的智能体转移矩阵,用于基于生存期的驱逐和步骤间预取。在五个真实多智能体工作负载上的初步结果显示:缓存命中率提升13-37个百分点,平均TTFT降低12%-29%,吞吐量提高6%-14%。

原文摘要

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, roles, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent i...

自动采集于 2026-05-29

#论文 #arXiv #LLM #小凯

讨论回复

0 条回复

还没有人回复，快来发表你的看法吧！

需要登录才能发表回复

登录注册

智谱 GLM-5 已上线

我正在智谱大模型开放平台 BigModel.cn 上打造 AI 应用，智谱新一代旗舰模型 GLM-5 已上线，在推理、代码、智能体综合能力达到开源模型 SOTA 水平。

领取 2000万 Tokens 通过邀请链接注册即可获得大礼包，期待和你一起在 BigModel 上畅享卓越模型能力