+

CN103279507B - Webpage spider operational method and system - Google Patents

Webpage spider operational method and system Download PDF

Info

Publication number
CN103279507B
CN103279507B CN201310181364.XA CN201310181364A CN103279507B CN 103279507 B CN103279507 B CN 103279507B CN 201310181364 A CN201310181364 A CN 201310181364A CN 103279507 B CN103279507 B CN 103279507B
Authority
CN
China
Prior art keywords
data
url
preset
webpage
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310181364.XA
Other languages
Chinese (zh)
Other versions
CN103279507A (en
Inventor
许大伦
毛颖
黄明军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lele Kaihang (beijing) Education Technology Co Ltd
Original Assignee
BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING SHANGYOU TONGDA INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310181364.XA priority Critical patent/CN103279507B/en
Publication of CN103279507A publication Critical patent/CN103279507A/en
Application granted granted Critical
Publication of CN103279507B publication Critical patent/CN103279507B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of webpage spider operational method and system, the method specifically includes that and captures the URL of website by the parameter of predetermined manner and add memory queue to;Whether the URL that described memory queue stores in judging it exists overlapping with the URL just adding entrance;To the webpage capture data under this URL and travel through in this webpage involved lower floor's link URL, and judge whether overlap;To the webpage capture data under this lower floor's link URL, then judging whether untreated URL, such as nothing, then described data crawl gone out carry out resolving and extracting being transferred to data handling queues according to pre-conditioned;These data are analyzed by described data handling queues with data with existing, and revise the crawl frequency in the parameter of described predetermined manner according to analysis result information.The present invention with solve in prior art web crawlers website is caused excessive added burden and can not accurately, the problem of effective acquisition site information.

Description

Webpage crawler operation method and system
Technical Field
The invention relates to the technical field of networks, in particular to a webpage crawler operation method and a webpage crawler operation system.
Background
The search engine is a system that collects information from the internet by using a specific computer program according to a certain policy, provides a search service for a user after organizing and processing the information, and displays information related to user search to the user. For the process of the search engine for collecting information from the Internet, the crawling of the web crawler for the related website information is relied on.
The web crawler is a program for automatically acquiring web page contents, and is an important component of a search engine.
In the prior art, for a common searched engine, a traditional crawler obtains URLs on initial webpages starting from URLs of one or a plurality of initial webpages, and continuously extracts new URLs from a current webpage and puts the new URLs into a queue in the process of capturing the webpages until certain stop conditions of a system are met.
In the prior art, the web crawler has poor analysis capability on web page contents, can only continuously capture website information mechanically, frequently and repeatedly capture dozens or hundreds of requests in a circulating manner, and has very high crawling frequency and crawling pressure, so that website resources are greatly consumed, and a burden is caused on the website and even the website is crashed. Meanwhile, web crawlers cannot accurately and efficiently crawl useful information in websites.
Therefore, how to solve the technical problems that the web crawler causes excessive extra burden on the website and cannot accurately and efficiently acquire website information in the prior art becomes an urgent need to be solved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a webpage crawler operation method and a webpage crawler operation system to solve the problems that in the prior art, a network crawler causes excessive extra burden on a website and website information cannot be accurately and efficiently acquired.
In order to solve the technical problem, the invention provides a web crawler operation method, which is characterized by comprising the following steps:
capturing a URL of a website through parameters in a preset mode and adding the URL to a memory queue;
the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;
and the data processing queue compares and analyzes the data with the existing data and modifies the parameters of the preset mode according to the analysis result information.
Preferably, the parameters of the preset mode further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Preferably, wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
Preferably, the capturing the URL of the website and adding the URL to the memory queue according to the parameters in the preset manner further includes: setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the capturing process number, setting parameters in a preset mode through the preset template, capturing the URL of the website through the parameters of the preset mode and adding the URL to a memory queue.
Preferably, the data to be captured is analyzed according to a preset condition, extracted and transmitted to a data processing queue, and further: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or data of an analysis support script, analyzing and extracting information, packaging the information, and transmitting the information to a data processing queue.
Preferably, the modifying the parameter of the preset mode according to the analysis result information further includes: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.
In order to solve the above technical problem, the present invention further provides a web crawler operating system, including: the device comprises a grabbing module, a memory module and a data analysis processing module; it is characterized in that the preparation method is characterized in that,
the capturing module is used for capturing the URL of the website through parameters in a preset mode and transmitting and adding the URL to a memory queue in the memory module;
the memory module is used for receiving the website URL transmitted by the capturing module and storing the website URL in a memory queue, judging whether the URL stored in the memory queue is overlapped with the URL which is just added, and if so, ignoring the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured from the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data analysis processing module;
the data analysis processing module is used for receiving the extracted data transmitted by the memory module and putting the data into a data processing queue inside the memory module, and the data processing queue performs comparison analysis on the data and the existing data and modifies the parameters of the preset mode in the grabbing module according to the analysis result information.
Preferably, the parameters of the preset mode in the grabbing module further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Preferably, the preset condition in the memory module further includes: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
Preferably, the capturing module is further configured to set a preset template according to an initialization environment of the system, server performance, network broadband conditions, and the number of capturing processes, and set parameters in a preset manner through the preset template.
Preferably, the memory module is further configured to parse the captured data according to DOM data of a parsing support class Jquery syntax in a preset condition, to parse and extract json-supporting data and/or data of a parsing support script, and to package the information and transmit the information to a data processing queue in the data analysis processing module.
Preferably, the data analysis processing module is further configured to modify, according to the analysis result information, content in the preset template in the capture module, and modify, through the preset template, capture frequency in the parameter of the preset mode.
Compared with the prior art, the webpage crawler operation method and the webpage crawler operation system achieve the following effects:
1) the method and the system reduce the crawling frequency and the crawling pressure of the web crawler and effectively reduce excessive extra burden on the website.
2) The invention realizes large-scale distributed concurrent acquisition, and greatly improves the efficiency of data acquisition and the efficiency of task customization.
3) The invention adopts the cloud technology, and realizes high accuracy of acquiring the required content.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic block diagram of a process of a web crawler operation method according to a first embodiment of the present invention;
fig. 2 is a block diagram of a specific structure of a web crawler operating system according to a second embodiment of the present invention.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result. Furthermore, the term "coupled" is intended to encompass any direct or indirect electrical coupling. Thus, if a first device couples to a second device, that connection may be through a direct electrical coupling or through an indirect electrical coupling via other devices and couplings. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.
The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.
Example one
Fig. 1 shows a process of a web crawler operation method according to a first embodiment of the present invention.
Step 101, capturing a URL (Uniform/universal resource Locator, web page address) of a website through a parameter in a preset mode and adding the URL to a memory queue;
102, the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist or not, and if not, the captured data are analyzed according to preset conditions and extracted and transmitted to the data processing queue;
the data capturing of the web page under the URL and the traversal of the lower link URL related in the web page are realized by analyzing the lower link related in the web page under the URL according to preset conditions and performing traversal search according to the traversal depth value in the preset conditions. The same contents are not described in detail later.
And 103, comparing and analyzing the data with the existing data by the data processing queue, and modifying the capturing frequency in the parameters of the preset mode according to the analysis result information. (in the first and subsequent embodiments of the present invention, the capturing frequency is modified, but not limited to this parameter, other parameters included in the preset manner may also be modified, and details are not described herein again).
The parameters of the preset mode in step 101 include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Further, the setting process of the preset mode is as follows: and setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the number of capturing processes, and setting parameters in a preset mode through the preset template.
In step 103, the capturing frequency in the preset mode is modified according to the analysis result information, and the method further includes: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.
Further, the memory queue may also be referred to as a deduplication queue in this embodiment, and for those skilled in the art, it can be completely understood that the meanings expressed by the deduplication queue and the memory queue are consistent, and details are not described later.
Further, the preset conditions in step 102 include: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
In the step 102, the data captured out is analyzed according to a preset condition, extracted and transmitted to a data processing queue, and further: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or script data, analyzing and extracting information (including binary files such as data information, pictures and/or flash), packaging the information, and transmitting the information to a data processing queue.
The purpose of the analysis in step 103 is to analyze the change frequency of the website data to correct the capturing frequency, so as to solve the problems that in the prior art, a web crawler causes excessive burden on a website and cannot accurately and efficiently acquire website information.
The specific operation of the first embodiment of the present invention may be:
firstly, a preset template is set according to the initialization environment of the system, the performance of the server, the network broadband condition and the number of grabbing processes, parameters in a preset mode are set through the preset template, the number of simultaneously and concurrently grabbing processes and the number of websites are set, and the grabbing processes are evenly distributed to each website, so that the websites can be reasonably grabbed, and the pressure of dense access to the grabbed websites caused by crawlers is avoided, and the efficiency is not lost;
secondly, snatch the URL of website and add to the memory queue through the parameter of preset mode, because the website has a plurality of pages, before not setting up the template, the crawler will snatch all pages in the website, but not every page all is useful to the user to cause the waste of network resource, so, through presetting the template, the crawler only snatchs the data that the user is interested in, wherein, the parameter of preset mode includes: the method comprises the following steps of (1) initially grabbing an address, grabbing frequency, grabbing delay conditions of a webpage and data storage queue conditions in the webpage;
thirdly, the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL is ignored; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data are captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URL exists, if not, the captured data are analyzed according to preset conditions (the preset conditions comprise that DOM data supporting Jquery grammar is analyzed, json data supporting json and/or script data is analyzed), information (including binary files such as data information, pictures and/or flash and the like) is extracted, and the information is transmitted to the data processing queue after being packaged;
and finally, the data processing queue compares and analyzes the data with the existing data, modifies the content in a preset template according to the analysis result information, and modifies the capture frequency in the parameters of the preset mode through the preset template.
Example two
As shown in fig. 2, a web crawler operating system according to a second embodiment of the present invention includes: a capture module 201, a memory module 202 and a data analysis processing module 203; wherein,
the fetching module 201 is coupled to the memory module 202 and the data analysis processing module 203, and configured to fetch a URL (Uniform/Universal Resource Locator) of a website according to a parameter in a preset manner, and transmit and add the URL to a memory queue in the memory module 202.
The memory module 202, coupled to the capture module 201 and the data analysis processing module 203, is configured to receive a URL of a website transmitted by the capture module 201 and store the URL in a memory queue therein, and then determine whether a URL stored in the memory queue overlaps with a URL that has just been added, and if so, ignore the URL; if not, capturing data of the webpage under the URL, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, data is captured for the webpage under the lower link URL, then the memory queue judges whether unprocessed URLs exist, if not, the captured data is analyzed according to preset conditions, and a data processing queue transmitted to the data analysis processing module 203 is extracted.
Wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
Further, the memory module 202 parses json-supported data and/or parses script-supported data according to DOM data of a parsing support class Jquery syntax in a preset condition, parses and extracts information (including binary files such as data information, pictures and/or flash), packages the information, and transmits the information to a data processing queue in the data analysis processing module 203.
The data analysis processing module 203 is coupled to the capture module 201 and the memory module 202, and configured to receive the extracted data transmitted by the memory module 202 and place the data into a data processing queue inside the memory module, where the data processing queue performs comparison analysis on the data and existing data, and modifies the capture frequency in the parameter of the capture module 201 in the preset manner according to analysis result information.
In this embodiment, the parameters of the preset manner further include: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
Wherein, for the setting process of the preset mode, the method further comprises the following steps: the capture module 201 sets a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition, and the number of capture processes, and sets parameters in a preset mode through the preset template.
Further, the data analysis processing module 203 modifies the content in the preset template in the capture module 201 according to the analysis result information, and modifies the capture frequency in the parameter of the preset mode according to the preset template.
The transmission and collection process can be realized through a cloud technology, so that large-scale distributed concurrent collection can be performed, the data collection efficiency is improved, required contents are accurately obtained, and efficient customization of tasks is facilitated to the maximum extent; meanwhile, the webpage crawler operating system flexibly collects all structural contents seen by the browser through configuring a template, and supports various page types including news, forums, blogs, pictures and the like.
Compared with the prior art, the webpage crawler operation method and the webpage crawler operation system achieve the following effects:
1) the method and the system reduce the crawling frequency and the crawling pressure of the web crawler and effectively reduce excessive extra burden on the website.
2) The invention realizes large-scale distributed concurrent acquisition, and greatly improves the efficiency of data acquisition and the efficiency of task customization.
3) The invention adopts the cloud technology, and realizes high accuracy of acquiring the required content.
The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A web crawler operation method, comprising:
capturing a URL of a website through parameters in a preset mode and adding the URL to a memory queue;
the memory queue judges whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, if so, the URL which is just added and entered is ignored; if not, capturing data of the webpage under the URL which is just added, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, capturing data of the webpage under the lower link URL, then judging whether an unprocessed URL exists in the memory queue, if not, analyzing the captured data according to preset conditions and extracting and transmitting the data to a data processing queue, wherein the capturing data of the webpage under the URL which is just added and the lower link URL related in the webpage are analyzed according to the preset conditions, and the lower link related in the webpage under the URL which is just added and the lower link URL related in the webpage are traversed and searched according to the traversal depth value in the preset conditions;
the data processing queue compares and analyzes the extracted data with the existing data, and modifies the parameters of the preset mode according to the analysis result information, wherein the parameters of the preset mode further comprise: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
2. The web crawler operation method of claim 1, wherein the preset condition further comprises: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
3. The web crawler operation method of claim 2, wherein the capturing of the URL of the website and the adding to the memory queue by the parameters of the preset manner further comprises: setting a preset template according to the initialization environment of the system, the performance of the server, the network broadband condition and the capturing process number, setting parameters in a preset mode through the preset template, capturing the URL of the website through the parameters of the preset mode and adding the URL to a memory queue.
4. The web crawler operation method of claim 3, wherein the data to be captured is parsed according to preset conditions and extracted and transferred to a data processing queue, and further comprising: and analyzing the captured data according to DOM data of an analysis support Jquery grammar in preset conditions, analyzing json data and/or data of an analysis support script, analyzing and extracting the data, packaging the extracted data, and transmitting the packaged data to a data processing queue.
5. The web crawler operation method of claim 4, wherein the modifying the parameter of the preset manner according to the analysis result information further comprises: and modifying the content in the preset template according to the analysis result information, and modifying the capture frequency in the parameters of the preset mode through the preset template.
6. A web crawler operating system, comprising: the device comprises a grabbing module, a memory module and a data analysis processing module; it is characterized in that the preparation method is characterized in that,
the capturing module is used for capturing the URL of the website through parameters in a preset mode and transmitting and adding the URL of the website to a memory queue in the memory module;
the memory module is used for receiving the URL of the website transmitted by the capturing module, storing the URL in a memory queue in the memory module, judging whether the URL stored in the memory queue is overlapped with the URL which is just added and entered, and if so, ignoring the URL which is just added and entered; if not, capturing data of the webpage under the URL which is just added, traversing lower link URLs related in the webpage, judging whether the lower link URLs are overlapped or not, and if so, ignoring; if not, the data of the webpage under the lower link URL is captured, then the memory queue judges whether unprocessed URLs exist or not, if not, the captured data is analyzed according to preset conditions and extracted and transmitted to the data analysis processing module, wherein the data of the webpage captured under the URL which is just added and the lower link URL related in the webpage are traversed, the lower link related in the webpage under the URL which is just added is analyzed according to the preset conditions, and traversal search is carried out according to traversal depth values in the preset conditions;
the data analysis processing module is configured to receive the extracted data transmitted by the memory module and place the extracted data into a data processing queue inside the memory module, where the data processing queue performs comparative analysis on the extracted data and existing data, and modifies a parameter of a preset mode in the capture module according to analysis result information, where the parameter of the preset mode in the capture module further includes: the method comprises the steps of initial grabbing address, grabbing frequency, grabbing delay conditions of web pages and data storage queue conditions in the web pages.
7. The web crawler operating system of claim 6, wherein the preset conditions in the memory module further comprise: and analyzing DOM data supporting class Jquery grammar, and analyzing json data and/or script data.
8. The web crawler operating system of claim 7, wherein the crawling module is further configured to set a preset template according to an initialization environment of the system, a server performance, a network broadband condition, and a crawling process number, and set parameters in a preset manner through the preset template.
9. The web crawler operating system of claim 8,
the memory module is further configured to parse the data supporting json data and/or parse the data supporting script according to DOM data of a parsing supporting Jquery grammar in preset conditions, parse and extract the data, package the extracted data, and transmit the packaged data to a data processing queue in the data analysis processing module.
10. The web crawler operating system of claim 9,
the data analysis processing module is further configured to modify the content in the preset template in the capture module according to the analysis result information, and modify the capture frequency in the parameter of the preset mode through the preset template.
CN201310181364.XA 2013-05-16 2013-05-16 Webpage spider operational method and system Active CN103279507B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310181364.XA CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310181364.XA CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Publications (2)

Publication Number Publication Date
CN103279507A CN103279507A (en) 2013-09-04
CN103279507B true CN103279507B (en) 2016-12-28

Family

ID=49062027

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310181364.XA Active CN103279507B (en) 2013-05-16 2013-05-16 Webpage spider operational method and system

Country Status (1)

Country Link
CN (1) CN103279507B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942309B (en) * 2014-04-18 2017-06-30 网易乐得科技有限公司 A kind of implementation method of Network Data Capture equipment, method and acquisition process
CN104252530B (en) * 2014-09-10 2017-09-15 北京京东尚科信息技术有限公司 A kind of unit crawler capturing method and system
CN104391917A (en) * 2014-11-19 2015-03-04 四川长虹电器股份有限公司 Method for incrementally capturing webpage contents
CN104572901B (en) * 2014-12-25 2018-12-18 小米科技有限责任公司 The method for down loading and device of web data
CN106202300A (en) * 2016-06-30 2016-12-07 浪潮软件集团有限公司 Network information acquisition method and device
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN106649720B (en) * 2016-12-22 2020-10-13 北京一览群智数据科技有限责任公司 Data processing method and server
CN109213824B (en) * 2017-06-29 2022-03-04 北京京东尚科信息技术有限公司 Data capture system, method and device
CN107480264B (en) * 2017-08-17 2019-11-15 北京知道创宇信息技术股份有限公司 A kind of web crawlers De-weight method and calculate equipment
CN108763279B (en) * 2018-04-11 2020-12-15 北京中科闻歌科技股份有限公司 Webpage data distributed template acquisition method and system
CN110851746B (en) * 2018-07-27 2022-08-12 北京国双科技有限公司 Crawler seed generation method and device
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device
CN118820566A (en) * 2024-06-13 2024-10-22 国网山西省电力公司长治供电公司 A data intelligent crawling method and system based on big data product development and screening

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 A Design Approach Focusing on Reptiles
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009059481A1 (en) * 2007-11-08 2009-05-14 Shanghai Hewlett-Packard Co., Ltd Navigational ranking for focused crawling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635718A (en) * 2009-08-26 2010-01-27 中兴通讯股份有限公司 Network crawler system and method for acquiring resource as well as network resource gripping device
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 A Design Approach Focusing on Reptiles
CN103067521A (en) * 2013-01-08 2013-04-24 中国科学院声学研究所 Distributed-type nodes and distributed-type system in a crawler cluster
CN103092999A (en) * 2013-02-22 2013-05-08 人民搜索网络股份公司 Webpage crawling cycle adjusting method and device

Also Published As

Publication number Publication date
CN103279507A (en) 2013-09-04

Similar Documents

Publication Publication Date Title
CN103279507B (en) Webpage spider operational method and system
US8799262B2 (en) Configurable web crawler
CN106126693B (en) Method and device for sending related data of webpage
CN107885777A (en) A control method and system for crawling web page data based on collaborative crawler
CN110020062B (en) A customizable web crawler method and system
CN102739663A (en) Detection method and scanning engine of web pages
CN101984429A (en) Method and device for acquiring destination page, search engine and browser
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
CN102982162A (en) System for acquiring webpage information
US9785710B2 (en) Automatic crawling of encoded dynamic URLs
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN106790593B (en) A page processing method and device
US20150120692A1 (en) Method, device, and system for acquiring user behavior
CN104158697B (en) A kind of dead chain detection method and device
CN104281629B (en) The method, apparatus and client device of picture are extracted from webpage
KR102009020B1 (en) Method and apparatus for providing website authentication data for search engine
CN104363237B (en) Method and system for processing metadata of Internet media resources
CN106412003A (en) Information pushing method and device, and information request device
CN103440281A (en) Method, device and equipment for acquiring download file
CN103246675A (en) Method and equipment for capturing data of website
CN109246069B (en) Webpage login method and device and readable storage medium
CN109450742B (en) Method for monitoring network data, physical machine virtual device and network system
CN104704495B (en) Method and device for information search
CN114048400A (en) Method, device, system and medium for acquiring abnormal application program
CN105930385A (en) Data crawling method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190923

Address after: 100088 Beijing Haidian District Garden Road No. 13 Courtyard 7 Floor 12, 1203-1

Patentee after: Lele Kaihang (Beijing) Education Technology Co., Ltd.

Address before: 100085, room 2, building 5, building 1, No. 516, ten Street, Haidian District, Beijing

Patentee before: Beijing Shangyou Tongda Information Technology Co., Ltd.

TR01 Transfer of patent right
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载