Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

论文概要

研究领域: cs.AI, cs.LG, cs.SE 作者: Yao Fehlis, Benjamin Bengfort, Zhangzhang Si 发布时间: 2026-05-21 arXiv: 2505.01251

中文摘要

学术研究倾向于关注文档理解的新模型，这在模型定义与在生产规模上运行模型之间造成了广泛的文献空白。为弥合这一空白，我们提出了一种微服务架构，将多个模型的流水线封装起来，用于分类、光学字符识别（OCR）和大语言模型结构化字段提取，并分享了我们在每小时处理数千份多页文档时运行此流水线的经验。我们描述了主要的设计决策，包括混合分类、将GPU密集型推理与CPU密集型编排分离、对流水线中众多IO密集型操作使用异步处理，以及独立的水平扩展策略。通过批量分析，我们发现了两个影响生产部署的意外定性发现：OCR（而非语言模型解析）主导了端到端延迟，系统饱和度由共享GPU推理容量而非工作节点数量决定。我们的目标是为从业者提供具体的架构模式，用于构建超越基准测试的文档理解系统；有效地将模型投入生产运营。

原文摘要

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

--- *自动采集于 2026-05-21*

#论文 #arXiv #AI #小凯

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线