YOU Feng,LI Daiwei,ZHANG Haiqing,et al.A Random Forest Approach for Missing Data Imputation based on Normalized KNNI[J].Journal of Chengdu University of Information Technology,2021,36(01):32-40.[doi:10.16836/j.cnki.jcuit.2021.01.006]
基于归一化KNNI的随机森林填补算法
- Title:
- A Random Forest Approach for Missing Data Imputation based on Normalized KNNI
- 文章编号:
- 2096-1618(2021)01-0032-09
- Keywords:
- incomplete information system; missing data imputation; normalization k nearest neighbor imputation; random forest imputation; evaluation of imputation method
- 分类号:
- TP301.6
- 文献标志码:
- A
- 摘要:
- 随机森林填补算法在对不完备信息系统填补时具有可靠的填补性能,同时由于填补时需要多次进行随机森林建模导致算法计算量大。为了缩短算法的运行时间,提出了NKNNI-RFI(normalization k nearest neighbor imputation-random forest imputation)缺失数据填补算法。通过改变RFI算法中预填补,即使用填补更为准确的归一化KNNI(normalization k nearest neighbor imputation,NKNNI)作为预填补,为RFI算法中使用随机森林模型预测填补值提供了更接近于原始数据集的数据,使RFI算法能够在更短的时间内完成填补任务且保持良好的填补效果。实验中使用10个UCI标准数据集,将提出的算法与RFI、NKNNI、SVMI和ROUSTIDA算法进行比较并使用NRMSE、PFC和ART填补评价方法对算法效果进行评价。实验结果表明:提出算法的NRMSE和PFC与RFI算法相同,NRMSE比NKNNI、SVMI和ROUSTIDA算法约低0.02~0.8,PFC比NKNNI、SVMI和ROUSTIDA算法约低0.01~0.6,ART相比RFI算法最大减少程度达53%。
- Abstract:
- The random forest imputation algorithm has reliable imputation performance when imputes incomplete information systems. At the same time, it needs to carry out random forest modeling for many times, which results in heavy computation. In order to shorten the running time of the algorithm, the NKNNI-RFI(normalization knearest neighbor imputation-random forest imputation)algorithm is proposed. By changing the pre-imputation in RFI, normalized KNNI(NKNNI)with more accurate is used as the initial imputation, which provides data closer to the original data set for the prediction of the imputation value using the random forest model in RFI, enabling RFI to complete the imputation task in a shorter time and maintain a good effect. In the experiment, 10 UCI standard data sets were used to compare the proposed algorithm with algorithms including RFI, NKNNI, SVMI and ROUSTIDA, and the effectiveness of the algorithm was evaluated using NRMSE, PFC and ART evaluation methods for imputation methods. The experimental results show that the NRMSE and PFC of this algorithm are the same as RFI. NRMSE is 0.02-0.8 lower than NKNNI, SVMI and ROUSTIDA, and PFC is0.01-0.6 lower than NKNNI, SVMI and ROUSTIDA.ART has a maximum reduction of 53% compared to RFI.
参考文献/References:
[1] Garc,A-laencina P J,Sancho-G,et al.K nearest neighbours with mutual information for simultaneous classification and missing data imputation[J].Genetika,1999,72(7):1483-1493.
[2] 王凤梅,胡丽霞.一种基于近邻规则的缺失数据填补方法[J].计算机工程,2012,38(21):53-55.
[3] 金勇进.缺失数据的插补调整[J].数理统计与管理,2001,20(6):47-53.
[4] 吴小姣,李高明,易大莉,等.基因表达谱的非参缺失森林填补算法研究[J].中国卫生统计,2016,33(6):1068-1070.
[5] 谷峪,于戈,李晓静,等.基于动态概率路径事件模型的RFID数据填补算法[J].软件学报,2010,21(3):438-451.
[6] Hartley H O.Maximum Likelihood Estimation from Incomplete Data[J].Biometrics,1958,14(2):174-194.
[7] Stekhoven D J,Buhlmann P.MissForest-non-parametric missing value imputation for mixed-type data[J].Bioinformatics,2012,28(1):112-118.
[8] Troyanskaya O,Cantor M,Sherlock G,et al.Missing value estimation methods for DNA microarrays[J].Bioinformatics,2001,17(6):520-525.
[9] Wang X,Li A,Jiang Z,et al.Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme[J].Bmc Bioinformatics,2006,7(1):32.
[10] Weihua Zhu,Wei Zhang,Yunqing Fu.An incomplete data analysis approach using rough set theory[C].2004 International Conference on Intelligent Mechatronics and Automation,2004.Proceedings.Chengdu,China:IEEE,2004:332-338.
[11] Arruda M P,Brown P J,Lipka A E,et al.Genomic Selection for Predicting Head Blight Resistance in a Wheat Breeding Program[J].The Plant Genome,2015,8(3):1-12.
[12] Rutkoski J E,Poland J,Jannink J L,et al.Imputation of unordered markers and the impact on genomic selection accuracy[J].G3 Genesgenetics,2013,3(3):427-439.
[13] Dixon,John K.Pattern Recognition with Partly Missing Data[J].IEEE Transactions on Systems,Man and Cybernetics,1979,9(10):617-621.
[14] Cover T,Hart P.Nearest neighbor pattern classification[J].IEEE Transactions on Information Theory,2003,13(1):21-27.
[15] Wenfeng Hou,Daiwei Li,Haiqing Zhang,et al.An Advanced k Nearest Neighbor Classification Algorithm Based on KD-tree[C].2018 IEEE International Conference of Safety Produce Informatization(IICSPI).Chongqing:IEEE,2019:902-905.
[16] Chao Xu,Daiwei Li,Haiqing Zhang,et al.A Weighted Fuzzy Rough Nearest Neighbor Classification Algorithm Based on Multiple Interpolation and Similarity Attribute Analysis[C].2018 IEEE International Conference of Safety Produce Informatization(IICSPI).Chongqing: IEEE,2019:906-910.
[17] Breiman L.Random Forests[J].Machine Learning,2001,45(1):5-32.
[18] 任家东,刘新倩,王倩,等.基于KNN离群点检测和随机森林的多层入侵检测方法[J].计算机研究与发展,2019,56(3):116-125.
[19] Dua D,and Graff C.UCI machine learning repository[DB/OL].http://archive.ics.uci.edu/ml,2019-10-19/2019-10-19.
[20] 陈慧佳.基于Random Forest的缺失数据补全策略研究[D].南昌:南昌大学,2016.
[21] Oba S,Sato M A,Takemasa I,et al.A Bayesian missing value estimation method for gene expression profile data[J].Bioinformatics,2003,19(16):2088-2096.
备注/Memo
收稿日期:2020-09-02
基金项目:国家自然科学基金资助项目(61602064); 四川省科技厅资助项目(2018JY0273、2019YFG0398); 欧盟资助项目(598649-EPP-1-2018-1-FR-EPPKA2-CBHE-JP)