+

CN103279507A - Webpage spider operational method and system - Google Patents

Webpage spider operational method and system Download PDF

Info

Publication number
CN103279507A
CN103279507A CN201310181364XA CN201310181364A CN103279507A CN 103279507 A CN103279507 A CN 103279507A CN 201310181364X A CN201310181364X A CN 201310181364XA CN 201310181364 A CN201310181364 A CN 201310181364A CN 103279507 A CN103279507 A CN 103279507A
Authority
CN
China
Prior art keywords
data
url
preset
queue
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310181364XA
Other languages
Chinese (zh)
Other versions
CN103279507B (en
Inventor
许大伦
毛颖
黄明军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lele Kaihang (beijing) Education Technology Co Ltd
Original Assignee
BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310181364.XA priority Critical patent/CN103279507B/en
Publication of CN103279507A publication Critical patent/CN103279507A/en
Application granted granted Critical
Publication of CN103279507B publication Critical patent/CN103279507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage spider operational method and system. The method mainly comprises the following steps that a URL of a website is captured according to parameters of a preset mode and appended to an internal storage queue; the internal storage queue judges whether a URL stored in the internal storage queue overlaps with the URL which is just appended; data of a webpage under the URL are captured, traversal of lower link URLs related to the webpage is conducted, and whether the URLs overlap is judged; data of webpages under the lower link URLs are captured, and then whether unprocessed URLs exist is judged; if no unprocessed URL exists, the captured data are analyzed according to preset conditions, and the captured data are extracted and transmitted to a data processing queue; the data processing queue contrasts the data with existing data for analysis, and capture frequency of the parameters of the preset mode is modified according to analysis result information. The webpage spider operational method and system is used for solving the problems in the prior art that a webpage spider causes an excessively heavy added burden to the website, and information of the website can not be captured accurately and efficiently.

Description

Webpage crawler operation method and system
Technical Field
The invention relates to the technical field of networks, in particular to a webpage crawler operation method and a webpage crawler operation system.
Background
The search engine is a system that collects information from the internet by using a specific computer program according to a certain policy, provides a search service for a user after organizing and processing the information, and displays information related to user search to the user. For the process of the search engine for collecting information from the Internet, the crawling of the web crawler for the related website information is relied on.
The web crawler is a program for automatically acquiring web page contents, and is an important component of a search engine.
In the prior art, for a common searched engine, a traditional crawler obtains URLs on initial webpages starting from URLs of one or a plurality of initial webpages, and continuously extracts new URLs from a current webpage and puts the new URLs into a queue in the process of capturing the webpages until certain stop conditions of a system are met.
In the prior art, the web crawler has poor analysis capability on web page contents, can only continuously capture website information mechanically, frequently and repeatedly capture dozens or hundreds of requests in a circulating manner, and has very high crawling frequency and crawling pressure, so that website resources are greatly consumed, and a burden is caused on the website and even the website is crashed. Meanwhile, web crawlers cannot accurately and efficiently crawl useful information in websites.
Therefore, how to solve the technical problems that the web crawler causes excessive extra burden on the website and cannot accurately and efficiently acquire website information in the prior art becomes an urgent need to be solved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a webpage crawler operation method and a webpage crawler operation system to solve the problems that in the prior art, a network crawler causes excessive extra burden on a website and website information cannot be accurately and efficiently acquired.
In order to solve the technical problem, the invention provides a web crawler operation method, which is characterized by comprising the following steps:
capturing a URL of a website through parameters in a preset mode and adding the URL to a memory queue;
the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;
and the data processing queue compares and analyzes the data with the existing data and modifies the parameters of the preset mode according to the analysis result information.
Preferably, the parameters of the preset mode further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Preferably, wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
Preferably, the capturing the URL of the website and adding the URL to the memory queue according to the parameters in the preset manner further includes: setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the capturing process number, setting parameters in a preset mode through the preset template, capturing the URL of the website through the parameters of the preset mode and adding the URL to a memory queue.
Preferably, the data to be captured is analyzed according to a preset condition, extracted and transmitted to a data processing queue, and further: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or data of an analysis support script, analyzing and extracting information, packaging the information, and transmitting the information to a data processing queue.
Preferably, the modifying the parameter of the preset mode according to the analysis result information further includes: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.
In order to solve the above technical problem, the present invention further provides a web crawler operating system, including: the device comprises a grabbing module, a memory module and a data analysis processing module; it is characterized in that the preparation method is characterized in that,
the capturing module is used for capturing the URL of the website through parameters in a preset mode and transmitting and adding the URL to a memory queue in the memory module;
the memory module is used for receiving the website URL transmitted by the capturing module and storing the website URL in a memory queue, judging whether the URL stored in the memory queue is overlapped with the URL which is just added, and if so, ignoring the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured from the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data analysis processing module;
the data analysis processing module is used for receiving the extracted data transmitted by the memory module and putting the data into a data processing queue inside the memory module, and the data processing queue performs comparison analysis on the data and the existing data and modifies the parameters of the preset mode in the grabbing module according to the analysis result information.
Preferably, the parameters of the preset mode in the grabbing module further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Preferably, the preset condition in the memory module further includes: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
Preferably, the capturing module is further configured to set a preset template according to an initialization environment of the system, server performance, network broadband conditions, and the number of capturing processes, and set parameters in a preset manner through the preset template.
Preferably, the memory module is further configured to parse the captured data according to DOM data of a parsing support class Jquery syntax in a preset condition, to parse and extract json-supporting data and/or data of a parsing support script, and to package the information and transmit the information to a data processing queue in the data analysis processing module.
Preferably, the data analysis processing module is further configured to modify, according to the analysis result information, content in the preset template in the capture module, and modify, through the preset template, capture frequency in the parameter of the preset mode.
Compared with the prior art, the webpage crawler operation method and the webpage crawler operation system achieve the following effects:
1) the method and the system reduce the crawling frequency and the crawling pressure of the web crawler and effectively reduce excessive extra burden on the website.
2) The invention realizes large-scale distributed concurrent acquisition, and greatly improves the efficiency of data acquisition and the efficiency of task customization.
3) The invention adopts the cloud technology, and realizes high accuracy of acquiring the required content.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic block diagram of a process of a web crawler operation method according to a first embodiment of the present invention;
fig. 2 is a block diagram of a specific structure of a web crawler operating system according to a second embodiment of the present invention.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result. Furthermore, the term "coupled" is intended to encompass any direct or indirect electrical coupling. Thus, if a first device couples to a second device, that connection may be through a direct electrical coupling or through an indirect electrical coupling via other devices and couplings. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.
The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.
Example one
Fig. 1 shows a process of a web crawler operation method according to a first embodiment of the present invention.
Step 101, capturing a URL (Uniform/universal resource Locator, web page address) of a website through a parameter in a preset mode and adding the URL to a memory queue;
102, the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;
the data capturing of the web page under the URL and the traversal of the lower link URL related in the web page are realized by analyzing the lower link related in the web page under the URL according to preset conditions and performing traversal search according to the traversal depth value in the preset conditions. The same contents are not described in detail later.
And 103, comparing and analyzing the data with the existing data by the data processing queue, and modifying the capturing frequency in the parameters of the preset mode according to the analysis result information. (in the first and subsequent embodiments of the present invention, the capturing frequency is modified, but not limited to this parameter, other parameters included in the preset manner may also be modified, and details are not described herein again).
The parameters of the preset mode in step 101 include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Further, the setting process of the preset mode is as follows: and setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the number of capturing processes, and setting parameters in a preset mode through the preset template.
In step 103, the capturing frequency in the preset mode is modified according to the analysis result information, and the method further includes: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.
Further, the memory queue may also be referred to as a deduplication queue in this embodiment, and for those skilled in the art, it can be completely understood that the meanings expressed by the deduplication queue and the memory queue are consistent, and details are not described later.
Further, the preset conditions in step 102 include: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
In the step 102, the data captured out is analyzed according to a preset condition, extracted and transmitted to a data processing queue, and further: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or script data, analyzing and extracting information (including binary files such as data information, pictures and/or flash), packaging the information, and transmitting the information to a data processing queue.
The purpose of the analysis in step 103 is to analyze the change frequency of the website data to correct the capturing frequency, so as to solve the problems that in the prior art, a web crawler causes excessive burden on a website and cannot accurately and efficiently acquire website information.
The specific operation of the first embodiment of the present invention may be:
firstly, a preset template is set according to the initialization environment of the system, the performance of the server, the network broadband condition and the number of grabbing processes, parameters in a preset mode are set through the preset template, the number of simultaneously and concurrently grabbing processes and the number of websites are set, and the grabbing processes are evenly distributed to each website, so that the websites can be reasonably grabbed, and the pressure of dense access to the grabbed websites caused by crawlers is avoided, and the efficiency is not lost;
secondly, snatch the URL of website and add to the memory queue through the parameter of preset mode, because the website has a plurality of pages, before not setting up the template, the crawler will snatch all pages in the website, but not every page all is useful to the user to cause the waste of network resource, so, through presetting the template, the crawler only snatchs the data that the user is interested in, wherein, the parameter of preset mode includes: the method comprises the following steps of (1) initially grabbing an address, grabbing frequency, grabbing delay conditions of a webpage and data storage queue conditions in the webpage;
thirdly, the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URL exists, if not, the captured data are analyzed according to preset conditions (the preset conditions comprise that DOM data supporting Jquery grammar is analyzed, json data supporting json and/or script data is analyzed), information (including binary files such as data information, pictures and/or flash and the like) is extracted, and the information is transmitted to the data processing queue after being packaged;
and finally, the data processing queue compares and analyzes the data with the existing data, modifies the content in a preset template according to the analysis result information, and modifies the capture frequency in the parameters of the preset mode through the preset template.
Example two
As shown in fig. 2, a web crawler operating system according to a second embodiment of the present invention includes: a capture module 201, a memory module 202 and a data analysis processing module 203; wherein,
the fetching module 201 is coupled to the memory module 202 and the data analysis processing module 203, and configured to fetch a URL (Uniform/Universal Resource Locator) of a website according to a parameter in a preset manner, and transmit and add the URL to a memory queue in the memory module 202.
The memory module 202, coupled to the capture module 201 and the data analysis processing module 203, is configured to receive a URL of a website transmitted by the capture module 201 and store the URL in a memory queue therein, and then determine whether a URL stored in the memory queue overlaps with a URL that has just been added, and if so, ignore the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data is captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist, if not, the captured data is analyzed according to preset conditions, and a data processing queue transmitted to the data analysis processing module 203 is extracted.
Wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
Further, the memory module 202 parses json-supported data and/or parses script-supported data according to DOM data of a parsing support class Jquery syntax in a preset condition, parses and extracts information (including binary files such as data information, pictures and/or flash), packages the information, and transmits the information to a data processing queue in the data analysis processing module 203.
The data analysis processing module 203 is coupled to the capture module 201 and the memory module 202, and configured to receive the extracted data transmitted by the memory module 202 and place the data into a data processing queue inside the memory module, where the data processing queue performs comparison analysis on the data and existing data, and modifies the capture frequency in the parameter of the capture module 201 in the preset manner according to analysis result information.
In this embodiment, the parameters of the preset manner further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Wherein, for the setting process of the preset mode, the method further comprises the following steps: the capture module 201 sets a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition, and the number of capture processes, and sets parameters in a preset mode through the preset template.
Further, the data analysis processing module 203 modifies the content in the preset template in the capture module 201 according to the analysis result information, and modifies the capture frequency in the parameter of the preset mode according to the preset template.
The transmission and collection process can be realized through a cloud technology, so that large-scale distributed concurrent collection can be performed, the data collection efficiency is improved, required contents are accurately obtained, and efficient customization of tasks is facilitated to the maximum extent; meanwhile, the webpage crawler operating system flexibly collects all structural contents seen by the browser through configuring a template, and supports various page types including news, forums, blogs, pictures and the like.
Compared with the prior art, the webpage crawler operation method and the webpage crawler operation system achieve the following effects:
1) the method and the system reduce the crawling frequency and the crawling pressure of the web crawler and effectively reduce excessive extra burden on the website.
2) The invention realizes large-scale distributed concurrent acquisition, and greatly improves the efficiency of data acquisition and the efficiency of task customization.
3) The invention adopts the cloud technology, and realizes high accuracy of acquiring the required content.
The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (12)

1. A web crawler operation method, comprising:
capturing a URL of a website through parameters in a preset mode and adding the URL to a memory queue;
the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;
and the data processing queue compares and analyzes the data with the existing data and modifies the parameters of the preset mode according to the analysis result information.
2. The web crawler operation method of claim 1, wherein the parameters of the preset manner further comprise: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
3. The web crawler operation method of claim 2, wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
4. The web crawler operation method of claim 3, wherein the capturing of the URL of the website and the adding to the memory queue via the parameters in the preset manner further comprises: setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the capturing process number, setting parameters in a preset mode through the preset template, capturing the URL of the website through the parameters of the preset mode and adding the URL to a memory queue.
5. The web crawler operation method of claim 4, wherein the data to be captured is parsed according to preset conditions and extracted and transferred to a data processing queue, and further comprising: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or data of an analysis support script, analyzing and extracting information, packaging the information, and transmitting the information to a data processing queue.
6. The web crawler operation method of claim 5, wherein the modifying the parameter of the preset manner according to the analysis result information further comprises: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.
7. A web crawler operating system, comprising: the device comprises a grabbing module, a memory module and a data analysis processing module; it is characterized in that the preparation method is characterized in that,
the capturing module is used for capturing the URL of the website through parameters in a preset mode and transmitting and adding the URL to a memory queue in the memory module;
the memory module is used for receiving the website URL transmitted by the capturing module and storing the website URL in a memory queue, judging whether the URL stored in the memory queue is overlapped with the URL which is just added, and if so, ignoring the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured from the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data analysis processing module;
the data analysis processing module is used for receiving the extracted data transmitted by the memory module and putting the data into a data processing queue inside the memory module, and the data processing queue performs comparison analysis on the data and the existing data and modifies the parameters of the preset mode in the grabbing module according to the analysis result information.
8. The web crawler operating system of claim 7, wherein the parameters of the pre-set manner in the crawling module further comprise: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
9. The web crawler operating system of claim 8, wherein the preset conditions in the memory module further comprise: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
10. The web crawler operating system of claim 9, wherein the crawling module is further configured to set a preset template according to an initialization environment of the system, a server performance, a network broadband condition, and a crawling process number, and set parameters in a preset manner through the preset template.
11. The web crawler operating system of claim 10,
the memory module is further configured to parse the data supporting json data and/or parse the data supporting script according to DOM data of a parsing support class Jquery syntax in a preset condition, parse and extract information, package the information, and transmit the information to a data processing queue in the data analysis processing module.
12. The web crawler operating system of claim 11,
the data analysis processing module is further configured to modify the content in the preset template in the capture module according to the analysis result information, and modify the capture frequency in the parameter of the preset mode through the preset template.
CN201310181364.XA 2013-05-16 2013-05-16 Webpage spider operational method and system Active CN103279507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310181364.XA CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310181364.XA CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Publications (2)

Publication Number Publication Date
CN103279507A true CN103279507A (en) 2013-09-04
CN103279507B CN103279507B (en) 2016-12-28

Family

ID=49062027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310181364.XA Active CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Country Status (1)

Country Link
CN (1) CN103279507B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN106202300A (en) * 2016-06-30 2016-12-07 浪潮软件集团有限公司 Network information acquisition method and device
CN106649720A (en) * 2016-12-22 2017-05-10 北京览群智数据科技有限责任公司 Data processing method and server
CN107480264A (en) * 2017-08-17 2017-12-15 北京知道创宇信息技术有限公司 A kind of web crawlers De-weight method and computing device
CN108763279A (en) * 2018-04-11 2018-11-06 北京中科闻歌科技股份有限公司 A kind of web data distribution template acquisition method and system
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN110851746A (en) * 2018-07-27 2020-02-28 北京国双科技有限公司 Crawler seed generation method and device
CN118820566A (en) * 2024-06-13 2024-10-22 国网山西省电力公司长治供电公司 A data intelligent crawling method and system based on big data product development and screening

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
US20100268701A1 (en) * 2007-11-08 2010-10-21 Li Zhang Navigational ranking for focused crawling
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 A Design Approach Focusing on Reptiles
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100268701A1 (en) * 2007-11-08 2010-10-21 Li Zhang Navigational ranking for focused crawling
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 A Design Approach Focusing on Reptiles
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process
CN103942309B (en) * 2014-04-18 2017-06-30 网易乐得科技有限公司 A kind of implementation method of Network Data Capture equipment, method and acquisition process
CN104252530B (en) * 2014-09-10 2017-09-15 北京京东尚科信息技术有限公司 A kind of unit crawler capturing method and system
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104572901A (en) * 2014-12-25 2015-04-29 小米科技有限责任公司 Method and device for downloading webpage data
CN104572901B (en) * 2014-12-25 2018-12-18 小米科技有限责任公司 The method for down loading and device of web data
CN106202300A (en) * 2016-06-30 2016-12-07 浪潮软件集团有限公司 Network information acquisition method and device
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN106649720A (en) * 2016-12-22 2017-05-10 北京览群智数据科技有限责任公司 Data processing method and server
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN107480264A (en) * 2017-08-17 2017-12-15 北京知道创宇信息技术有限公司 A kind of web crawlers De-weight method and computing device
CN107480264B (en) * 2017-08-17 2019-11-15 北京知道创宇信息技术股份有限公司 A kind of web crawlers De-weight method and calculate equipment
CN108763279A (en) * 2018-04-11 2018-11-06 北京中科闻歌科技股份有限公司 A kind of web data distribution template acquisition method and system
CN108763279B (en) * 2018-04-11 2020-12-15 北京中科闻歌科技股份有限公司 Webpage data distributed template acquisition method and system
CN110851746A (en) * 2018-07-27 2020-02-28 北京国双科技有限公司 Crawler seed generation method and device
CN110851746B (en) * 2018-07-27 2022-08-12 北京国双科技有限公司 Crawler seed generation method and device
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device
CN118820566A (en) * 2024-06-13 2024-10-22 国网山西省电力公司长治供电公司 A data intelligent crawling method and system based on big data product development and screening

Also Published As

Publication number Publication date
CN103279507B (en) 2016-12-28

Similar Documents

Publication Publication Date Title
CN103279507B (en) Webpage spider operational method and system
CN106126693B (en) Method and device for sending related data of webpage
CN102739663A (en) Detection method and scanning engine of web pages
CN107885777A (en) A control method and system for crawling web page data based on collaborative crawler
CN101984429A (en) Method and device for acquiring destination page, search engine and browser
CN109600385B (en) Access control method and device
CN102831252A (en) Method and device for updating index database and search method and system
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN102982162A (en) System for acquiring webpage information
US20150120692A1 (en) Method, device, and system for acquiring user behavior
CN105069011A (en) Webpage favorite management method, device and system
CN103294717B (en) Web page opening method and device based on double-kernel browser
CN104361067A (en) Method and system for intelligent loading of browser webpage information
CN104158697B (en) A kind of dead chain detection method and device
CN104281629B (en) The method, apparatus and client device of picture are extracted from webpage
KR102009020B1 (en) Method and apparatus for providing website authentication data for search engine
CN102624910B (en) Method, the Apparatus and system of the web page contents that process user chooses
CN104363237B (en) Method and system for processing metadata of Internet media resources
CN104486333A (en) Debug method and debug device for mobile application programs
CN105721519B (en) A kind of webpage data acquiring method, apparatus and system
CN106412003A (en) Information pushing method and device, and information request device
CN103440281A (en) Method, device and equipment for acquiring download file
CN103246675A (en) Method and equipment for capturing data of website
CN109246069B (en) Webpage login method and device and readable storage medium
CN113407193B (en) System deployment method, device and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190923

Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1

Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd.

Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing

Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd.

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载