Efficient Algorithm for Crawling Ajax Web Pages
-
Graphical Abstract
-
Abstract
The generation of Ajax web pages and the Ajax page navigation must execute the client JavaScript, thus it is impossible to extract the complete content of an Ajax page through the traditional crawling algorithms. In this paper, the working mode of Ajax is analyzed, the problem of crawling Ajax web pages is elaborated, and an effective algorithm for crawling Ajax pages is proposed. The algorithm can realize the dynamic generation of Ajax web contents in client browser and the navigation of Ajax web pages, and also it can assign identification number for the crawled pages whose static pages can be generated. Experimental result shows that the number of Ajax pages crawled by the proposed algorithm is obvious bigger than the traditional ones', and the presented replicas-detecting policies can effectively reduce the time consumption of the algorithm.
-
-