电子科技 ›› 2022, Vol. 35 ›› Issue (12): 78-83.doi: 10.16180/j.cnki.issn1007-7820.2022.12.011

• • 上一篇    下一篇

基于混合聚类的k-匿名数据发布算法

方凯1,史志才1,2,贾媛媛1   

  1. 1.上海工程技术大学 电子电气工程学院,上海 201620
    2.上海市信息安全综合管理技术研究重点实验室,上海 200240
  • 收稿日期:2021-05-19 出版日期:2022-12-15 发布日期:2022-12-13
  • 作者简介:方凯(1995-),男,硕士研究生。研究方向:网络安全、隐私保护。|史志才(1964-),男,博士,教授。研究方向:计算机网络、隐私保护、物联网与嵌入式系统等。|贾媛媛(1995-),女,硕士研究生。研究方向:隐私保护、网络信息安全。
  • 基金资助:
    国家自然科学基金(61802252)

K-Anonymity Data Publishing Algorithm Based on Hybrid Clustering

FANG Kai1,SHI Zhicai1,2,JIA Yuanyuan1   

  1. 1. School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
    2. Shanghai Key Laboratory of Integrated Administration Technologies for Information Security,Shanghai 200240,China
  • Received:2021-05-19 Online:2022-12-15 Published:2022-12-13
  • Supported by:
    National Natural Science Foundation of China(61802252)

摘要:

为了减少数据发布时的信息损失,针对基于聚类的数据发布匿名方案数据可用性较低等问题,文中提出了一种基于混合聚类的k-匿名数据发布算法。相对于传统的单一聚类方法,该算法将密度聚类和划分聚类相结合,依据数据集的密度特征选取初始聚类中心点,利用划分聚类进行迭代实现最优聚类。此外,该方法剔除了数据集中的部分离群点噪声,减小了其对聚类结果的影响。针对混合型数据记录,采用k-means和k-modes结合的距离度量方式,引入桶泛化算法,减少了泛化操作造成的信息损失。实验结果表明,相较于现有方法,基于混合聚类的k-匿名数据发布算法能够有效降低数据匿名的信息损失,提高数据发布的质量。

关键词: 隐私保护, 数据发布, k-匿名, 聚类, 桶泛化算法, 混合属性, 网络安全, 信息损失

Abstract:

In order to reduce the loss of information in data publishing, a k-anonymous data publishing algorithm based on hybrid clustering is proposed to solve the problem of low data availability in existing data anonymity schemes based on clustering. Compared with the traditional single clustering method, the proposed algorithm combines partition clustering and distance clustering, selects the initial clustering center point according to the density characteristics of the data set, and uses partition clustering to achieve the optimal clustering iteratively. In addition, the proposed method eliminates part of the outlier noise in the data set to reduce its impact on the clustering results. For hybrid data records, the distance measurement method combining k-means and k-modes is adopted, and the bucket generalization algorithm is introduced to reduce the information loss caused by generalization operation. Experimental results show that compared with the existing methods, the k-anonymity data publishing algorithm based on hybrid clustering can effectively reduce the information loss of data anonymity and improve the quality of data publishing.

Key words: privacy preserving, data publishing, k-anonymity, clustering, bucket generalization algorithm, mixed attributes, network security, information loss

中图分类号: 

  • TP309