LI Min,TAO Hongcai.Research on Automatic Digest Algorithm of Web Blog based on Keyword Extraction[J].Journal of Chengdu University of Information Technology,2020,35(02):158-162.[doi:10.16836/j.cnki.jcuit.2020.02.006]
基于关键词抽取的网络博客自动文摘算法的研究
- Title:
- Research on Automatic Digest Algorithm of Web Blog based on Keyword Extraction
- 文章编号:
- 2096-1618(2020)02-0158-05
- Keywords:
- automatic digest; TextRank; keywords; BM25; ROUGE
- 分类号:
- TP391.1
- 文献标志码:
- A
- 摘要:
- TextRank算法基于图论,考虑文本的整体结构,而关键词与文本主题紧密关联。网络博客作为一种新兴的出版方式,与新闻、专业论文等文本不同,其编辑方式更为随意,没有传统意义上的一般格式。将关键词抽取与TextRank算法结合起来,提出一种适用于博客文本的基于关键词抽取的自动文摘算法。首先通过TextRank算法抽取文本关键词,用BM25算法计算句子相似度。然后,以句子相似度为权重构建带权图,迭代计算获取TextRank评分。将TextRank评分与关键词评分相加得到句子最终得分,选择评分最高的前i个句子,按照句子在原文中的顺序输出得到自动文摘。通过ROUGE工具的测评,设计对比实验证明算法效果良好。
- Abstract:
- The TextRank algorithm is based on graph theory, considering the overall structure of the text.Keywords are closely related to the text theme.As a new publishing method,online blogs are different from texts such as news and papers,and their editing methods are more casual.There is no general format in the traditional sense. This paper combines keyword extraction with TextRank algorithm,and proposes an automatic abstracting algorithm based on keyword extraction suitable for blog text. First,keywords are extracted through the TextRank algorithm, and sentence similarity is calculated using the BM25 algorithm. The sentence similarity is used to construct weighted graphs for weights, and iterative calculations are used to obtain TextRank scores.The TextRank score and the keyword score are added to the final sentence score, and the top i sentences with the highest score are selected and output in the order of the original text to obtain automatic digests.Through the evaluation of ROUGE tools, comparison experiments show that the algorithm works well.
参考文献/References:
[1] 刘家益,邹益民.近70年文本自动摘要研究综述[J].情报科学,2017,35(7):154-161.
[2] 百度百科.博客[DB/OL].https://baike.baidu.com/item/%E5%8D%9A%E5%AE%A2/124?fr=aladdin,2019-12-20.
[3] 刘海燕,张钰.基于LexRank的中文单文档摘要方法[J].兵器装备工程学报,2017,38(6):85-89.
[4] 付玲,张晖.结合LDA和谱聚类的多文档摘要[J].计算机工程与应用,2013,49(16):142-145.
[5] 郑义.多媒体信息自动摘要及其相关技术研究[D].上海:复旦大学,2003.
[6] Luhn H P.The automatic creation of literature abstract[J].IBM Journal of Research and Development,1958,2(2):159-165.
[7] EDMUNDSON H P.New Methods in Automatic Extracting[J].1969,16(2):264-285.
[8] 张静静.基于知网文本相似度的文摘自动评测方法研究[D].北京:中国石油大学,2011.
[9] Mihalcea R,Rada,Tarau,Paul.TextRank:Bringing Order into Texts[J].Unt Scholarly Works,2004:404-411.
[10] PAGE, L.The PageRank Citaition Ranking:Bringing Order to the Web,Online manuscript [J].Stanford Digital Libraries Working Paper.1998,9(1):1-14.
[11] 曹洋.基于TextRank算法的单文档自动文摘研究[D].南京:南京大学,2016.
[12] Robertson S E,Walker S.Beaulieu M,et al.Okapi at TREC5[J].1996.
[13] CSDN.TextRank 关键词提取算法[DB/OL].https://blog.csdn.net/qq_34333481/article/details/85705039,2019-12-21.
[14] 张瑾,王小磊,许洪波.自动文摘评价方法综述[J].中文信息学报,2008(3):81-88.
[15] CSDN.自动文档摘要评价方法:Edmundson,ROUGE[DB/OL].https://blog.csdn.net/weixin_33712987/article/details/93248861,2019-12-21.
相似文献/References:
[1]李 楠,陶宏才.一种新的融合BM25与文本特征的新闻摘要算法[J].成都信息工程大学学报,2018,(02):113.[doi:10.16836/j.cnki.jcuit.2018.02.002]
LI Nan,TAO Hong-cai.A Novel News Summary Algorithm Combining BM25 and Text Features[J].Journal of Chengdu University of Information Technology,2018,(02):113.[doi:10.16836/j.cnki.jcuit.2018.02.002]
备注/Memo
收稿日期:2019-12-26 基金项目:国家自然科学基金资助项目(61806170)