西安电子科技大学学报 ›› 2021, Vol. 48 ›› Issue (4): 176-183.doi: 10.19665/j.issn1001-2400.2021.04.023

• 计算机科学与技术&网络空间安全 • 上一篇    下一篇

不平衡数据加权边界点集成欠采样方法

何云斌(),冷欣(),万静()   

  1. 哈尔滨理工大学 计算机科学与技术学院,黑龙江 哈尔滨 150000
  • 收稿日期:2020-05-11 出版日期:2021-08-30 发布日期:2021-08-31
  • 作者简介:何云斌(1972—),男,教授,博士,E-mail: hybha@163.com|冷 欣(1994—),女,哈尔滨理工大学硕士研究生,E-mail: 1012419440@qq.com|万 静(1972—),女,教授,博士,E-mail: wanjha@163.com
  • 基金资助:
    国家自然科学基金(61872105);黑龙江省教育厅科学技术研究项目(12531z004)

Unbalanced data weighted boundary point integration undersampling method

HE Yunbin(),LENG Xin(),WAN Jing()   

  1. School of Computer Science and Technology,Harbin University,Harbin 150000,China
  • Received:2020-05-11 Online:2021-08-30 Published:2021-08-31

摘要:

为了有效地解决不平衡数据中边界点直接被删除的问题,保持多数类数据的信息,提出一种基于聚类的加权边界点集成欠采样算法。首先该算法通过提取少数类数据集的数据点作为多数类数据集的初始聚类中心个数进行聚类;然后引入变异系数将边界点识别出来,对识别出的边界点进行加权,使得加权后的边界点可以加入到不平衡数据的处理中;再利用簇密度将多数类数据集分为高密度簇和低密度簇,把低密度簇删除;最后获得约简后的多数类样本集。再将约简后的多数类样本和少数类样本结合,形成平衡的数据集,利用Ada Boost对其进行训练,得到最终的分类模型。利用该方法可以实现对数据集的约简操作,提高执行效率。结果表明,所提方法可以有效地处理不平衡数据的问题,提高了不平衡数据加权边界点集成欠采样算法的执行效率和结果的精确性。

关键词: 采样, 聚类, 不平衡数据, 加权边界点

Abstract:

In order to effectively solve the problem that boundary points are deleted directly from unbalanced data and effectively maintain the information on most kinds of data,a clustering-based weighted boundary point integration undersampling algorithm is proposed.First,the algorithm extracts the number of minority class sets as the initial number of clustering centers of majority class sets to cluster.Then,the variation coefficient is introduced to identify the boundary points,and the identified boundary points are weighted so that the weighted boundary points can be added to the unbalanced data processing.Then,the cluster density is used to divide majority class sets into the high-density cluster and low-density cluster,delete the low-density cluster,and finally obtain the reduced majority of the sample sets.Then,the reduced majority of class samples is combined with the minority of class samples to form a balanced data set,which is trained with the Ada boost to get the final classification model.This method can be used to reduce the dataset and improve the efficiency of execution.The results show that the proposed method can effectively handle the problem of unbalanced data,and improve the execution efficiency and accuracy of the under-sampling algorithm for unbalanced data weighted boundary point integration.

Key words: sampling, clustering, unbalanced data, weighted boundary point

中图分类号: 

  • TP311.13