CN103279507B

CN103279507B - Webpage spider operational method and system

Info

Publication number: CN103279507B
Application number: CN201310181364.XA
Authority: CN
Inventors: 许大伦; 毛颖; 黄明军
Original assignee: BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Current assignee: Lele Kaihang (beijing) Education Technology Co Ltd
Priority date: 2013-05-16
Filing date: 2013-05-16
Publication date: 2016-12-28
Anticipated expiration: 2033-05-16
Also published as: CN103279507A

Abstract

The invention discloses a kind of webpage spider operational method and system, the method specifically includes that and captures the URL of website by the parameter of predetermined manner and add memory queue to；Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance；To the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, and judge whether overlap；To the webpage capture data under this lower floor's link URL, then judging whether untreated URL, such as nothing, then described data crawl gone out carry out resolving and extracting being transferred to data handling queues according to pre-conditioned；These data are analyzed by described data handling queues with data with existing, and revise the crawl frequency in the parameter of described predetermined manner according to analysis result information.The present invention with solve in prior art web crawlers website is caused excessive added burden and can not accurately, the problem of effective acquisition site information.

Description

Webpage crawler operation method and system

Technical Field

The invention relates to the technical field of networks, in particular to a webpage crawler operation method and a webpage crawler operation system.

Background

The search engine is a system that collects information from the internet by using a specific computer program according to a certain policy, provides a search service for a user after organizing and processing the information, and displays information related to user search to the user. For the process of the search engine for collecting information from the Internet, the crawling of the web crawler for the related website information is relied on.

The web crawler is a program for automatically acquiring web page contents, and is an important component of a search engine.

In the prior art, for a common searched engine, a traditional crawler obtains URLs on initial webpages starting from URLs of one or a plurality of initial webpages, and continuously extracts new URLs from a current webpage and puts the new URLs into a queue in the process of capturing the webpages until certain stop conditions of a system are met.

In the prior art, the web crawler has poor analysis capability on web page contents, can only continuously capture website information mechanically, frequently and repeatedly capture dozens or hundreds of requests in a circulating manner, and has very high crawling frequency and crawling pressure, so that website resources are greatly consumed, and a burden is caused on the website and even the website is crashed. Meanwhile, web crawlers cannot accurately and efficiently crawl useful information in websites.

Therefore, how to solve the technical problems that the web crawler causes excessive extra burden on the website and cannot accurately and efficiently acquire website information in the prior art becomes an urgent need to be solved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a webpage crawler operation method and a webpage crawler operation system to solve the problems that in the prior art, a network crawler causes excessive extra burden on a website and website information cannot be accurately and efficiently acquired.

In order to solve the technical problem, the invention provides a web crawler operation method, which is characterized by comprising the following steps:

capturing a URL of a website through parameters in a preset mode and adding the URL to a memory queue;

the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;

and the data processing queue compares and analyzes the data with the existing data and modifies the parameters of the preset mode according to the analysis result information.

Preferably, the parameters of the preset mode further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.

Preferably, wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.

Preferably, the capturing the URL of the website and adding the URL to the memory queue according to the parameters in the preset manner further includes: setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the capturing process number, setting parameters in a preset mode through the preset template, capturing the URL of the website through the parameters of the preset mode and adding the URL to a memory queue.

Preferably, the data to be captured is analyzed according to a preset condition, extracted and transmitted to a data processing queue, and further: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or data of an analysis support script, analyzing and extracting information, packaging the information, and transmitting the information to a data processing queue.

Preferably, the modifying the parameter of the preset mode according to the analysis result information further includes: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.

In order to solve the above technical problem, the present invention further provides a web crawler operating system, including: the device comprises a grabbing module, a memory module and a data analysis processing module; it is characterized in that the preparation method is characterized in that,

the capturing module is used for capturing the URL of the website through parameters in a preset mode and transmitting and adding the URL to a memory queue in the memory module;

the memory module is used for receiving the website URL transmitted by the capturing module and storing the website URL in a memory queue, judging whether the URL stored in the memory queue is overlapped with the URL which is just added, and if so, ignoring the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured from the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data analysis processing module;

the data analysis processing module is used for receiving the extracted data transmitted by the memory module and putting the data into a data processing queue inside the memory module, and the data processing queue performs comparison analysis on the data and the existing data and modifies the parameters of the preset mode in the grabbing module according to the analysis result information.

Preferably, the parameters of the preset mode in the grabbing module further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.

Preferably, the preset condition in the memory module further includes: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.

Preferably, the capturing module is further configured to set a preset template according to an initialization environment of the system, server performance, network broadband conditions, and the number of capturing processes, and set parameters in a preset manner through the preset template.

Preferably, the memory module is further configured to parse the captured data according to DOM data of a parsing support class Jquery syntax in a preset condition, to parse and extract json-supporting data and/or data of a parsing support script, and to package the information and transmit the information to a data processing queue in the data analysis processing module.

Preferably, the data analysis processing module is further configured to modify, according to the analysis result information, content in the preset template in the capture module, and modify, through the preset template, capture frequency in the parameter of the preset mode.

Compared with the prior art, the webpage crawler operation method and the webpage crawler operation system achieve the following effects:

1) the method and the system reduce the crawling frequency and the crawling pressure of the web crawler and effectively reduce excessive extra burden on the website.

2) The invention realizes large-scale distributed concurrent acquisition, and greatly improves the efficiency of data acquisition and the efficiency of task customization.

3) The invention adopts the cloud technology, and realizes high accuracy of acquiring the required content.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic block diagram of a process of a web crawler operation method according to a first embodiment of the present invention;

fig. 2 is a block diagram of a specific structure of a web crawler operating system according to a second embodiment of the present invention.

Detailed Description

As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result. Furthermore, the term "coupled" is intended to encompass any direct or indirect electrical coupling. Thus, if a first device couples to a second device, that connection may be through a direct electrical coupling or through an indirect electrical coupling via other devices and couplings. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.

The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.

Example one

Fig. 1 shows a process of a web crawler operation method according to a first embodiment of the present invention.

Step 101, capturing a URL (Uniform/universal resource Locator, web page address) of a website through a parameter in a preset mode and adding the URL to a memory queue;

102, the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;

the data capturing of the web page under the URL and the traversal of the lower link URL related in the web page are realized by analyzing the lower link related in the web page under the URL according to preset conditions and performing traversal search according to the traversal depth value in the preset conditions. The same contents are not described in detail later.

And 103, comparing and analyzing the data with the existing data by the data processing queue, and modifying the capturing frequency in the parameters of the preset mode according to the analysis result information. (in the first and subsequent embodiments of the present invention, the capturing frequency is modified, but not limited to this parameter, other parameters included in the preset manner may also be modified, and details are not described herein again).

The parameters of the preset mode in step 101 include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.

Further, the setting process of the preset mode is as follows: and setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the number of capturing processes, and setting parameters in a preset mode through the preset template.

In step 103, the capturing frequency in the preset mode is modified according to the analysis result information, and the method further includes: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.

Further, the memory queue may also be referred to as a deduplication queue in this embodiment, and for those skilled in the art, it can be completely understood that the meanings expressed by the deduplication queue and the memory queue are consistent, and details are not described later.

Further, the preset conditions in step 102 include: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.

In the step 102, the data captured out is analyzed according to a preset condition, extracted and transmitted to a data processing queue, and further: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or script data, analyzing and extracting information (including binary files such as data information, pictures and/or flash), packaging the information, and transmitting the information to a data processing queue.

The purpose of the analysis in step 103 is to analyze the change frequency of the website data to correct the capturing frequency, so as to solve the problems that in the prior art, a web crawler causes excessive burden on a website and cannot accurately and efficiently acquire website information.

The specific operation of the first embodiment of the present invention may be:

firstly, a preset template is set according to the initialization environment of the system, the performance of the server, the network broadband condition and the number of grabbing processes, parameters in a preset mode are set through the preset template, the number of simultaneously and concurrently grabbing processes and the number of websites are set, and the grabbing processes are evenly distributed to each website, so that the websites can be reasonably grabbed, and the pressure of dense access to the grabbed websites caused by crawlers is avoided, and the efficiency is not lost;

secondly, snatch the URL of website and add to the memory queue through the parameter of preset mode, because the website has a plurality of pages, before not setting up the template, the crawler will snatch all pages in the website, but not every page all is useful to the user to cause the waste of network resource, so, through presetting the template, the crawler only snatchs the data that the user is interested in, wherein, the parameter of preset mode includes: the method comprises the following steps of (1) initially grabbing an address, grabbing frequency, grabbing delay conditions of a webpage and data storage queue conditions in the webpage;

thirdly, the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URL exists, if not, the captured data are analyzed according to preset conditions (the preset conditions comprise that DOM data supporting Jquery grammar is analyzed, json data supporting json and/or script data is analyzed), information (including binary files such as data information, pictures and/or flash and the like) is extracted, and the information is transmitted to the data processing queue after being packaged;

and finally, the data processing queue compares and analyzes the data with the existing data, modifies the content in a preset template according to the analysis result information, and modifies the capture frequency in the parameters of the preset mode through the preset template.

Example two

As shown in fig. 2, a web crawler operating system according to a second embodiment of the present invention includes: a capture module 201, a memory module 202 and a data analysis processing module 203; wherein,

the fetching module 201 is coupled to the memory module 202 and the data analysis processing module 203, and configured to fetch a URL (Uniform/Universal Resource Locator) of a website according to a parameter in a preset manner, and transmit and add the URL to a memory queue in the memory module 202.

The memory module 202, coupled to the capture module 201 and the data analysis processing module 203, is configured to receive a URL of a website transmitted by the capture module 201 and store the URL in a memory queue therein, and then determine whether a URL stored in the memory queue overlaps with a URL that has just been added, and if so, ignore the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data is captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist, if not, the captured data is analyzed according to preset conditions, and a data processing queue transmitted to the data analysis processing module 203 is extracted.

Wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.

Further, the memory module 202 parses json-supported data and/or parses script-supported data according to DOM data of a parsing support class Jquery syntax in a preset condition, parses and extracts information (including binary files such as data information, pictures and/or flash), packages the information, and transmits the information to a data processing queue in the data analysis processing module 203.

The data analysis processing module 203 is coupled to the capture module 201 and the memory module 202, and configured to receive the extracted data transmitted by the memory module 202 and place the data into a data processing queue inside the memory module, where the data processing queue performs comparison analysis on the data and existing data, and modifies the capture frequency in the parameter of the capture module 201 in the preset manner according to analysis result information.

In this embodiment, the parameters of the preset manner further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.

Wherein, for the setting process of the preset mode, the method further comprises the following steps: the capture module 201 sets a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition, and the number of capture processes, and sets parameters in a preset mode through the preset template.

Further, the data analysis processing module 203 modifies the content in the preset template in the capture module 201 according to the analysis result information, and modifies the capture frequency in the parameter of the preset mode according to the preset template.

The transmission and collection process can be realized through a cloud technology, so that large-scale distributed concurrent collection can be performed, the data collection efficiency is improved, required contents are accurately obtained, and efficient customization of tasks is facilitated to the maximum extent; meanwhile, the webpage crawler operating system flexibly collects all structural contents seen by the browser through configuring a template, and supports various page types including news, forums, blogs, pictures and the like.

The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A web crawler operation method, comprising:

the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL which is just added and entered is ignored; if not, capturing data of the webpage under the URL which is just added, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, capturing data of the webpage under the lower link URL, then judging whether an unprocessed URL exists in the memory queue, if not, analyzing the captured data according to preset conditions and extracting and transmitting the data to a data processing queue, wherein the capturing data of the webpage under the URL which is just added and the lower link URL related in the webpage are analyzed according to the preset conditions, and the lower link related in the webpage under the URL which is just added and the lower link URL related in the webpage are traversed and searched according to the traversal depth value in the preset conditions;

the data processing queue compares and analyzes the extracted data with the existing data, and modifies the parameters of the preset mode according to the analysis result information, wherein the parameters of the preset mode further comprise: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.

2. The web crawler operation method of claim 1, wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.

3. The web crawler operation method of claim 2, wherein the capturing of the URL of the website and the adding to the memory queue by the parameters of the preset manner further comprises: setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the capturing process number, setting parameters in a preset mode through the preset template, capturing the URL of the website through the parameters of the preset mode and adding the URL to a memory queue.

4. The web crawler operation method of claim 3, wherein the data to be captured is parsed according to preset conditions and extracted and transferred to a data processing queue, and further comprising: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or data of an analysis support script, analyzing and extracting the data, packaging the extracted data, and transmitting the packaged data to a data processing queue.

5. The web crawler operation method of claim 4, wherein the modifying the parameter of the preset manner according to the analysis result information further comprises: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.

6. A web crawler operating system, comprising: the device comprises a grabbing module, a memory module and a data analysis processing module; it is characterized in that the preparation method is characterized in that,

the capturing module is used for capturing the URL of the website through parameters in a preset mode and transmitting and adding the URL of the website to a memory queue in the memory module;

the memory module is used for receiving the URL of the website transmitted by the capturing module, storing the URL in a memory queue in the memory module, judging whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, and if so, ignoring the URL which is just added and entered; if not, capturing data of the webpage under the URL which is just added, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, the data of the webpage under the lower link URL is captured, then the memory queue judges whether unprocessed URLs exist or not, if not, the captured data is analyzed according to preset conditions and extracted and transmitted to the data analysis processing module, wherein the data of the webpage captured under the URL which is just added and the lower link URL related in the webpage are traversed, the lower link related in the webpage under the URL which is just added is analyzed according to the preset conditions, and traversal search is carried out according to traversal depth values in the preset conditions;

the data analysis processing module is configured to receive the extracted data transmitted by the memory module and place the extracted data into a data processing queue inside the memory module, where the data processing queue performs comparative analysis on the extracted data and existing data, and modifies a parameter of a preset mode in the capture module according to analysis result information, where the parameter of the preset mode in the capture module further includes: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.

7. The web crawler operating system of claim 6, wherein the preset conditions in the memory module further comprise: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.

8. The web crawler operating system of claim 7, wherein the crawling module is further configured to set a preset template according to an initialization environment of the system, a server performance, a network broadband condition, and a crawling process number, and set parameters in a preset manner through the preset template.

9. The web crawler operating system of claim 8,

the memory module is further configured to parse the data supporting json data and/or parse the data supporting script according to DOM data of a parsing supporting Jquery grammar in preset conditions, parse and extract the data, package the extracted data, and transmit the packaged data to a data processing queue in the data analysis processing module.

10. The web crawler operating system of claim 9,

the data analysis processing module is further configured to modify the content in the preset template in the capture module according to the analysis result information, and modify the capture frequency in the parameter of the preset mode through the preset template.