## 论文概要
**研究领域**: NLP
**作者**: Hermawan Manurung, Ibrahim Al-Kahfi, Ahmad Rizqi
**发布时间**: 2025-04-29
**arXiv**: [2504.20612](https://arxiv.org/abs/2504.20612)
## 中文摘要
针对印尼市场评论混合标准词汇、俚语、区域借词、数字简写和表情符号导致基于词典的情感工具不可靠的问题,该论文描述了一个双轨分类流程。第一轨应用TF-IDF向量化和PyCaret AutoML扫描;第二轨是PyTorch双向LSTM网络,具有共享编码器和两个任务特定输出头。预处理模块应用14个顺序清洗步骤,包括从市场语料库汇编的140条俚语词典。
## 原文摘要
Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictiona...
---
*自动采集于 2026-04-29*
#论文 #arXiv #NLP #小凯
登录后可参与表态
讨论回复
0 条回复还没有人回复,快来发表你的看法吧!