J4 ›› 2015, Vol. 42 ›› Issue (4): 159-164.doi: 10.3969/j.issn.1001-2400.2015.04.026

• 研究论文 • 上一篇    下一篇

数据依赖的多索引哈希算法

马艳萍1,2;姬光荣1;邹海林2;谢洪涛3   

  1. (1. 中国海洋大学 信息科学与工程学院,山东 青岛  266100;
    2. 鲁东大学 信息与电气工程学院,山东 烟台  264025;
    3. 中国科学院信息工程研究所 信息内容安全技术国家工程实验室,北京  100093)
  • 收稿日期:2014-06-06 出版日期:2015-08-20 发布日期:2015-10-12
  • 通讯作者: 马艳萍
  • 作者简介:马艳萍(1975-), 女, 中国海洋大学博士研究生, E-mail:myp74920@126.com.
  • 基金资助:

    国家自然科学基金资助项目(61170161,61303171,61271406)

Data-oriented multi-index Hashing

MA Yanping1,2;JI Guangrong1;ZOU Hailin2;XIE Hongtao3   

  1. (1. School of Information Science and Engineering, Ocean Univ. of China, Qingdao  266100, China;
    2. Scholol of Information and Electrical Engineering, Ludong Univ., Yantai  264025, China;
    3. Institute of Information Engineering, Chinese Academy of Sciences, Beijing  100093, China)
  • Received:2014-06-06 Online:2015-08-20 Published:2015-10-12
  • Contact: MA Yanping

摘要:

由于多索引哈希基于数据集中的二进制码呈均匀分布这一假设,不能有效地处理非均匀分布的数据集,故针对这一问题提出数据依赖的多索引哈希算法. 首先把二进制码划分为多个连续不重合的子串,并通过计算二进制码每位之间的相关性为每一个子串学习得到自适应投影向量;在为每个子串建立哈希表时,使用投影向量对子串进行投影,从而得到哈希表中的下标;采用自适应投影的方法可以使得哈希表中的元素接近于均匀分布,进而提升了查询速度. 此外,提出一个基于熵的分布度量方法,以评价哈希表中数据元素的分布情况.在大规模数据集上的实验表明,与多索引哈希算法相比,数据依赖的多索引哈希算法可以使查询速度提升36.9%~87.4%.

关键词: 最近邻查询, 二进制码, 索引, 多索引哈希

Abstract:

The multi-index hashing (MIH) is the state-of-the-art method for indexing binary codes. However, it is based on the dataset codes uniform distribution assumption, and will lower efficiency in dealing with non-uniformly distributed codes. In this paper, we propose a data-oriented multi-index hashing method. We first compute the correlations between bits and learn adaptive projection vector for each binary substring. Then, instead of using substrings as direct indices into hash tables, we project them with corresponding projection vectors to generate new indices. With adaptive projection, the indices in each hash table are nearly uniformly distributed. Besides, we put forward an entropy based measurement to evaluate the distribution of data items in each hash table. Experiments conducted on reference large scale datasets show that compared to the MIH the time performance of our method can be 36.9%~87.4% better .

Key words: nearest neighbor search, binary codes, indexing, multi-index Hashing

中图分类号: 

  • TP391