[论文] Soro: A Lightweight Foundation Model and Chatbot for Tajik

论文概要

研究领域: NLP 作者: Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, et al. 发布时间: 2026-05-28 arXiv: 2605.27379

中文摘要

本文介绍了Soro--一个专为塔吉克语设计的对话式大语言模型家族,旨在塔吉克斯坦严苛的计算与网络条件下实现实际部署。研究团队基于Gemma 3开放权重检查点,使用精心筛选的19亿token塔吉克语语料(涵盖网页文本、PDF文档及课程对齐的教育材料)进行塔吉克语持续预训练,并在4万个塔吉克语教师风格示例上进行监督指令微调。鉴于标准基准测试对塔吉克语的覆盖有限,作者还构建了一套涵盖通识知识、语言能力及中高考领域的塔吉克语基准测试集并在Hugging Face开源。实验表明,Soro在塔吉克语基准上显著超越同规模Gemma 3基线,同时保持英语性能。FP8和INT4量化后的模型保留了大部分塔吉克语能力,为塔吉克斯坦教育部门试点及学校规模化部署提供了支持。

原文摘要

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on st...

--- *自动采集于 2026-05-29*

#论文 #arXiv #NLP #小凯

[论文] Soro: A Lightweight Foundation Model and Chatbot for Tajik

论文概要

中文摘要

原文摘要

🌟 智谱 GLM-5 已上线