›› 2012, Vol. 25 ›› Issue (1): 105-.

• 论文 • 上一篇    下一篇

基于DOM和神经网络的网页净化应用

李剑   

  1. (南昌陆军学院 战斗实验室,江西 南昌 330103)
  • 出版日期:2012-01-15 发布日期:2012-01-10
  • 作者简介:李剑(1980—),男,硕士,讲师。研究方向:计算机网络,计算机网络安全。

Application Research of Web Page Purification Based on DOM and Neural Network

 LI Jian   

  1. (Battle Laboratory,Nanchang Army College,Nanchang 330103,China)
  • Online:2012-01-15 Published:2012-01-10

摘要:

为能够高效地把网页中的噪音信息过滤掉,采用基于改进的DOM树和BP神经网络的网页净化方法。根据DOM树和网页内容的特征,用HTMLParser建立内容块树,把网页中的内容按照一定的相关性分割成多个子块,从而把整个内容块的处理简化为处理各个子块。由统计可知,子内容块的内容具有明显的数值特征,可以该特征作为BP神经网络的学习来源。这样可把网页的净化问题转化成通过学习建立过滤模型的问题。实验结果证明,该方法在有主题的中文网页应用上取得了理想的效果。

关键词: 网页净化, DOM树, 内容块, 神经网络

Abstract:

In order to remove the noisy information existing in web pages effectively,this paper proposes a method of web page purification based on the improved DOM tree and BP neural network.The establishment of a block tree by the DOM tree and web content using HTMLParser can split the whole content into several sub-block trees according to their relations,thus simplifying the processing of the whole block into the processing of sub blocks.Statistic data shows that the content of the sub block has evident numerical characteristics,so the sub block can be used as the learning source of BP.In this way,the purification of web pages is converted into establishing a purifying model through learning.Experimental results show that this method can achieve satisfactory results in the application to Chinese web pages with  themes.

Key words: web page purification;DOM tree;content block;neural network

中图分类号: 

  • TP393.07