LI Baolin,LIU Yutao.Research on Word Segmentation of Normative Text based on Re-Perceptron-CRF[J].Journal of Chengdu University of Information Technology,2023,38(03):298-305.[doi:10.16836/j.cnki.jcuit.2023.03.008]
基于Re-Perceptron-CRF的规范类文本分词研究
- Title:
- Research on Word Segmentation of Normative Text based on Re-Perceptron-CRF
- 文章编号:
- 2096-1618(2023)03-0298-08
- 关键词:
- 管理科学与工程; 文本分析; 中文分词; Re-Perceptron-CRF; 词性标注
- Keywords:
- management science and engineering; textusl analysis; Chinese words segmentation; Re-Perceptron-CRF; part-of-speech tagging
- 分类号:
- TP391.1
- 文献标志码:
- A
- 摘要:
- 通过Re-Perceptron-CRF组合方法,利用规范类文档特点,对关键词进行切分。分别采取Viterbi、Perceptron、CRF和Re-Perceptron-CRF 4种算法分别对规范类文本进行分词研究。具体为基于句法分析对规范类文本使用正则表达式进行标准化处理,得到适合分析的预处理文本,并通过Perceptron与CRF的双重算法返回各自的最优结果。实验表明,Re-Perceptron-CRF算法明显提高分词效果,在准确率和召回率上均有良好表现,其准确率和召回率分别达到94.36%和97.02%。该方法为规范类文本中文分词相关工作提供一定的研究思路,为后续应用提供好的数据支撑。但由于数据量较小,该方法仅适用于特定领域,如建筑检测领域。
- Abstract:
- The Re-Perceptron-CRF combination method was used to segment key words by using the characteristics of specification documents. In this paper, four algorithms including Viterbi, Perceptron, CRF and Re-Perceptron-CRF are used to split the canonical text into words. Specificallyas follows: regular expressions are used to standardize the canonical text based on syntactic analysis, and the preprocessed text suitable for analysis is obtained. The optimal results are returned by the dual Perceptron and CRF algorithms. The experiment showed that the Re-Perceptron-CRF algorithm has good performance in the accuracy and recall rates of 94.36% and 97.02% respectively. This method provides some ideas for Chinese word segmentation related to standardized text, and provides good data support for subsequent applications. However, due to the small amount of data set, this method is only applicable to specific fields, such as building inspection.
参考文献/References:
[1] 许峰,张雪芬,忻展红.基于深度神经网络模型的中文分词方案[J].哈尔滨工程大学学报,2019,40(9):1662-1666.
[2] Chomsky N.Syntactic Structures[M].The Hague:Mouton de Gruyter,2002.
[3] 宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008:129-130.
[4] Li Hongqiao,Huang Chang-Ning.The use of SVM for Chinese new word identification[A].In:Proceedings of the 1st International Joint Conference on Natural Language Processing(lJCNLP2004)[C].Hainan Island,2004:723-732.
[5] 王厚峰,戴大为.汉语句法结构标注的研究[J].计算机研究与发展,1997(3):77-82.
[6] 魏欧,吴健,孙玉芳.基于统计的汉语词性标注方法的分析与改进[J].软件学报,2000(4):473-480.
[7] 杨尔弘,方莹,刘冬明,等.汉语自动分词和词性标注评测[J].中文信息学报,2006(1):44-49.
[8] 奉国和,郑伟.国内中文自动分词技术研究综述[J].图书情报工作,2011,55(2):41-45.
[9] BRILL E.A corpus-based approach to language learning[D].Philadelphia:University of Pennsylvania,1993.
[10] 李华栋,贾真,尹红风,等.基于规则的汉语兼类词标注方法[J].计算机应用,2014,34(8):2197-2201.
[11] Baum L E,Eagon J A.An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology[J].Bulletin of the American Mathematical Society,1967(73):360-363.
[12] Lafferty J,McCallum A,Pereira F.Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[J].In Proceedings of the18th International Conf on machine Learning,2001:282-289.
[13] RATNAPARKHI A.A maximum entropy model for part-of-speech tagging[C].Proceedings of the 1996.
[14] 梁喜涛,顾磊.中文分词与词性标注研究[J].计算机技术与发展,2015,25(2):175-180.
[15] 周强.规则和统计相结合的汉语词类标注方法[J].中文信息学报,1995(3):1-10.
[16] Lample G.Neural architectures for named entity recognition[C].Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(NAACL-HLT),2016:260-270.
[17] 刘伟,黄锴宇,余浩,等.基于语境相似度的中文分词一致性检验研究[J].北京大学学报(自然科学版),2022,58(1):99-105.
[18] 凤丽洲,杨贵军,徐雪,等.基于N-gram的双向匹配中文分词方法[J].数理统计与管理,2020,39(4):633-643.
[19] Liu Junxin,Wu Fangzhao,Wu Chuhan,et al.Neural Chinese word segmentation with dictionary[J].Neurocomputing,2019:338.
[20] Gan Leilei,Zhang Yue.Investigating Self-Attention Network for Chinese Word Segmentation[J].CoRR,2019.
[21] Si Huihui,Ning Xin. Research and Implementation of Chinese Automatic Word Segmentation System Based on Complex Network Features[J]. Wireless Communications and Mobile Computing,2022.
[22] Yan Hang,Qiu Xipeng,Huang Xuanjing. A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing[J].Transactions of the Association for Computational Linguistics,2020:8.
[23] 徐飞,孙劲光.中文分词切分技术研究[J].计算机工程与科学,2008(5):126-128.
[24] Jelinek F,Self-Organized Language Modeling for Speech Recogntion[J].Reading in Speech Recognition.Morgan Kaufann Publishers ins 1990:450-506.
[25] Bengio Y,Ducharme R,Vincent P.3(Feb):2003:1137-1155.
[26] Forney GD Jr.The Viterbi algorithm[J].Proceedings of the lEEE,1973,61(3):268-278.
[27] Rosenblatt F.The perceptron:Probabilistic model for information storage and organization in the brain.Psychological Review,1958,65(6):386-408.
[28] Mikolov T,Chen K,Corrado G,et al.Efficient Estimation of Word Representations in Vector Space[C].IIICLR 2013,2013.
[29] Michalewicz Z.Genetic Algorithms+ Data Structures evolution programs[M].(3rd ed),New York:Springer-Verlag,1996.
[30] 刘克.实用马尔可夫决策过程[M].北京:清华大学出版社,2004.
[31] ROBINSON J.Dependency structures and transformational rules[J].Language,1970,46(2):259-285.
[32] Kleene,S C.Representation of Events in Nerve Nets and Finite Automata[M].1951.
[33] 邵艳秋,穗志方,韩纪庆,等.基于依存句法分析的汉语韵律层级自动预测技术研究[J].中文信息学报,2008(2):116-123.
[34] 陈强,何炎祥,刘续乐,孙松涛,彭敏,李飞.基于句法分析的跨语言情感分析[J].北京大学学报(自然科学版),2014,50(1):55-60.
备注/Memo
收稿日期:2022-07-05
基金项目:四川省科技服务业示范资助项目(2021GFW015); 四川省电子商务与现代物流研究中心重点资助项目(DSWL21-3)