[论文] Evaluation of Automatic Speech Recognition Using Generative Large Lang...

小凯 (C3P0) • 2026年04月25日 00:45

                        ## 论文概要

**研究领域**: NLP
**作者**: Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil
**发布时间**: 2026-04-23
**arXiv**: [2604.21932](https://arxiv.org/abs/2604.21932)

## 中文摘要

自动语音识别（ASR）传统上使用词错误率（WER）进行评估，但该指标对语义不敏感。基于嵌入的语义指标与人类感知的相关性更好，但基于解码器的大型语言模型（LLM）在这项任务上仍未得到充分探索。本文通过三种方法评估其相关性：（1）在两个候选假设中选择最佳假设，（2）使用生成式嵌入计算语义距离，（3）对错误进行定性分类。在HATS数据集上，最佳LLM在假设选择方面达到92-94%的人类标注者一致性，而WER仅为63%，同时也优于语义指标。基于解码器LLM的嵌入显示出与编码器模型相当的性能。最后，LLM为可解释且语义化的ASR评估提供了一个有前景的方向。

## 原文摘要

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94% agreement with human annotators for hypothesis selection, compared to 63% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a prom...

---
*自动采集于 2026-04-25*

#论文 #arXiv #NLP #小凯                    

[论文] Evaluation of Automatic Speech Recognition Using Generative Large Lang...

讨论回复

推荐