›› 2016, Vol. 29 ›› Issue (9): 7-.

• 论文 • 上一篇    下一篇

基于编辑距离和相似度改进的汉字字符串匹配

邵 清,叶 琨   

  1. 上海理工大学 光电信息与计算机工程学院,上海 200093)
  • 出版日期:2016-09-15 发布日期:2016-09-26
  • 作者简介:邵清(1970-),女,博士,副教授。研究方向:网络智能等。叶琨(1993-),女,硕士研究生。研究方向:网络智能。
  • 基金资助:

    国家自然科学基金资助项目(61170277);上海市教委科研创新基金资助项目(02120557)

Chinese Character String Matching Algorithm Based on Improved Edit Distance and Similarity

SHAO Qing, YE Kun   

  1. (School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China)
  • Online:2016-09-15 Published:2016-09-26

摘要:

为解决中文字符串匹配精度较低的问题,提出了一种基于编辑距离和相似度改进的汉字字符串近似匹配算法,针对汉字字符串特点,使用汉字拼音和五笔编码计算;通过改进动态规划算法,能够有效提高编辑距离的计算准确度以及执行效率;再引入考虑交换问题的归一化算法,以语义编辑距离与长句长度的比值作为归一化结果,以此来提高近似匹配算法的准确度。实验结果表明,改进后算法计算的相似度质量要优于改进前的算法结果,且对提高算法效率和查全率、查准率和时间性能等指标均有明显改善,证明该算法的可行性和有效性。

关键词: 编辑距离, 相似度, 归一化, 中文字符串, 近似匹配

Abstract:

A Chinese character string approximate matching algorithm based on the improved edit distance and similarity is proposed for better accuracy in Chinese string matching. Firstly the pinyin code is used by considering character of Chinese string, then dynamic programming algorithm is improved to effectively improve the accuracy of calculation; next, a normalization algorithm considering switching problems is introduced. With semantic edit and long distance the ratio of the length of the sentence as the result of the normalization, the accuracy and executive efficiency of approximate matching algorithm is improved. Experimental results show that the quality of the results by the improved algorithm is better than those by traditional algorithms with significant improvement in efficiency, recall, precision, time cost and other indicators.

Key words: edit distance, similarity, normalization, Chinese character string, approximate matching

中图分类号: 

  • TP391.41