WANG Zhen,ZHANG Haiqing,PENG Li,et al.Information Extraction and Classification Method of Medical Data based on Singular Value Decomposition[J].Journal of Chengdu University of Information Technology,2020,35(05):537-541.[doi:10.16836/j.cnki.jcuit.2020.05.010]
基于奇异值分解的医疗数据信息提取及分类方法
- Title:
- Information Extraction and Classification Method of Medical Data based on Singular Value Decomposition
- 文章编号:
- 2096-1618(2020)05-0537-05
- Keywords:
- medical data set; missing value imputation; singular value decomposition; k nearest neighbor algorithm
- 分类号:
- TP312
- 文献标志码:
- A
- 摘要:
- 当医疗数据存在缺失和冗余信息的情况下如何提高预测准确率一直是一个极具挑战的问题。为解决这一挑战,大多数预测模型要么直接删除缺失和冗余的实例,要么使用均值或其他方式对缺失数据进行填补。基于加权KNN算法(weightedk-nearest neighbor,WKNN),提出一种改进的医疗数据分类方法,该方法首先利用KNNI(k-nearest neighbor imputation,KNNI)对包含缺失数据的数据集进行预填补,然后采用奇异值分解(singular value decomposition,简称SVD)对填补后完整的数据进行有效信息提取,最后使用修订权重的WKNN算法进行分类预测。实验表明,在对数据进行填补和信息提取后,显著提高了分类准确率。在5个医疗数据集上,相较于传统的KNN算法分类准确率提升10%左右。在8个医疗数据集上均使用随机森林算法、朴素贝叶斯算法和支持向量机算法进行实验对比,算法分类准确率均取得较好效果。
- Abstract:
- Missing values imputation and redundant information reduction have been proved to be significant challenge of improving the prediction accuracy in medical data sets. The traditional prediction models tend to delete the missing instances directly from the data sets or use mean values to fill the missing values, which cannot deeply analyze the internal complex relationships among objects. In order to solve these problems, in this paper, we proposed an improved medical data classification method based on the Weighted k-Nearest Neighbor(WKNN)algorithm. The proposed method firstly pre-filling the incomplete dataset with k-Nearest Neighbor Imputation(KNNI),and then extracting the effective information of the complete data set with Singular Value Decomposition(SVD),finally the revised weighted WKNN algorithm is used to conduct classification prediction.The classification accuracy of 5 medical datasets by this method is higher than that of the traditional KNN algorithm by approximately 10%.The classification accuracy based on experiment performance is higher than the benchmark methods of Random Forest algorithm,Naïve Bayesian algorithm, and Support Vector Machine algorithm(on 8 medical datasets).
参考文献/References:
[1] 周梦丽.基于广义线性模型的中国主要疾病死亡率统计分析[D].成都:西南财经大学,2014.
[2] 高福,杨宏钧.推动精准医疗和伴随诊断产业创新发展[J].生物产业技术,2018(2).
[3] 刘星毅,农国才.几种不同缺失值填充方法的比较[J].南宁师范高等专科学校学报,2007,24(3):148-150.
[4] Guo G,Wang H,Bell D A,et al.An kNN Model-based Approach and Its Application in Text Categorization[C].Computational Linguistics and Intelligent Text Processing, 5th International Conference,CICLing 2004,Seoul,Korea,2004:15-21.
[5] Zhang Zhihua.Multivariate Time Series Analysis in Climate and Environmental Research[M].Springer 2018.
[6] 陈曦,张坤.一种基于树增强朴素贝叶斯的分类器学习方法[J].电子与信息学报,2019,41(8).
[7] Hearst M A,Dumais S T,Osman E,et al.Support vector machines[J].IEEE Intelligent Systems,1998,13(4):18-28.
[8] Wenfeng Hou,Daiwei Li,Haiqing Zhang,et al.An Advanced k Nearest Neighbor Classification Algorithm Based on KD-tree[C].2018 IEEE International Conference of Safety Produce Informatization(IICSPI).2019:902-905.
[9] Ayyadevara,V Kishore.Pro Machine Learning Algorithms Random Forest[M].2018:105-116.
[10] Wang Fengmei,Hu Lixiao.A missing data filling method based on nearest neighbor rule[J].Computer engineering,2012,38(21):53-55.
[11] 李璐.基于R语言的缺失值填补方法[J].统计与决策,2012,(17):72-74.
[12] Zhang Haiqing,Li Daiwei,Wang Tao,等.Hesitant extension of fuzzy-rough set to address uncertainty in classification[J].Journal of Intelligent & Fuzzy Systems,2018,34(4):2535-2550.
[13] Golub G H.Singular value decomposition and least squares solutions[J].Numerische Mathematik,1970,14(5):403-420.
[14] Epps B P,Krivitzky E M.Singular value decomposition of noisy data: noise filtering[J]. Experiments in Fluids,2019,60(8).
[15] 徐锋,刘云飞.基于EMD-SVD的声发射信号特征提取及分类方法[J].应用基础与工程科学学报,2014(6):1238-1247.
[16] Leif E.Peterson.K-nearest neighbor[J].scholarpedia,2009,4(2):1883.
备注/Memo
收稿日期:2020-05-21