Abstract:
Aiming at the problem that the traditional crawler methods are easy to fall into local optima of the search and rarely consider modifying the crawling path based on historical crawling experience, a focused crawler method based on Wang−Landau (WL) sampling is proposed. This method uses the vector space model (VSM) and PageRank algorithm to evaluate the relevance and importance of links, respectively. Regional competition strategy is used to select the target link from the link set containing the topic−related links and links with potential value. Based on probability density function, the WL algorithm is used to sample the selected target links in the set, and guides the subsequent crawling of the crawler according to the historical statistical experience, so as to optimize the search path. The experimental results show that the WL-based focused crawling method can search more topic-relevant webpages than other methods in the literature, and the climbing accuracy and standard deviation of topic relevance of all downloaded pages are also significantly improved.