基于Wang−Landau抽样的主题爬虫方法

刘景发; 陈靖岚; 赵鹏

doi:10.12178/1001-0548.2022183

基于Wang−Landau抽样的主题爬虫方法

Focused Crawler Method Based on Wang−Landau Sampling

摘要

摘要: 针对传统爬虫方法存在搜索易陷入局部最优，且很少考虑结合历史爬行经验对爬行路径进行修正的缺陷，提出一种基于WL抽样的主题爬行方法。该方法分别使用向量空间模型(VSM)和PageRank算法对链接的相关性和重要性进行评价，采用区域竞争策略从具有主题相关或潜在价值的链接集合中选出目标链接。基于概率密度函数，WL抽样算法对侯选集中选出的目标链接进行抽样判断，根据历史统计经验指导爬虫的后续爬行，从而优化搜索路径。实验结果表明，提出的基于WL抽样的主题爬虫方法比其他主题爬虫方法能搜索到更多主题相关的网页，其爬准率和所有下载网页主题相关度的标准差具有明显优势。

Abstract: Aiming at the problem that the traditional crawler methods are easy to fall into local optima of the search and rarely consider modifying the crawling path based on historical crawling experience, a focused crawler method based on Wang−Landau (WL) sampling is proposed. This method uses the vector space model (VSM) and PageRank algorithm to evaluate the relevance and importance of links, respectively. Regional competition strategy is used to select the target link from the link set containing the topic−related links and links with potential value. Based on probability density function, the WL algorithm is used to sample the selected target links in the set, and guides the subsequent crawling of the crawler according to the historical statistical experience, so as to optimize the search path. The experimental results show that the WL-based focused crawling method can search more topic-relevant webpages than other methods in the literature, and the climbing accuracy and standard deviation of topic relevance of all downloaded pages are also significantly improved.

HTML全文

参考文献(28)

施引文献

资源附件(0)

基于Wang−Landau抽样的主题爬虫方法

Focused Crawler Method Based on Wang−Landau Sampling

期刊在线

编辑办公

友情链接