+

CN105224691B - A kind of information processing method and device - Google Patents

A kind of information processing method and device Download PDF

Info

Publication number
CN105224691B
CN105224691B CN201510729292.7A CN201510729292A CN105224691B CN 105224691 B CN105224691 B CN 105224691B CN 201510729292 A CN201510729292 A CN 201510729292A CN 105224691 B CN105224691 B CN 105224691B
Authority
CN
China
Prior art keywords
log
access
logs
domain name
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510729292.7A
Other languages
Chinese (zh)
Other versions
CN105224691A (en
Inventor
才华
肖春天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING NETENTSEC Inc
Original Assignee
BEIJING NETENTSEC Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING NETENTSEC Inc filed Critical BEIJING NETENTSEC Inc
Priority to CN201510729292.7A priority Critical patent/CN105224691B/en
Publication of CN105224691A publication Critical patent/CN105224691A/en
Application granted granted Critical
Publication of CN105224691B publication Critical patent/CN105224691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of information processing methods, which comprises collects web page access log from the internet behavior audit device of N number of sampled point;Wherein, N is positive integer;Sorted out and analyzed according to domain name of first predetermined period to the web page access log, generates domain name classification information;Obtain the web page access log of the internet behavior audit device from the first user;The web page access log of the internet behavior audit device from the first user is analyzed according to the second predetermined period, and based on domain name classification information, to identify the web page access log of the true access behavior for characterizing the first user.The present invention further simultaneously discloses a kind of information processing unit.Using technical solution of the present invention, the actual access behavior of user can be accurately identified.

Description

Information processing method and device
Technical Field
The present invention relates to the field of network management and security technologies, and in particular, to an information processing method and apparatus.
Background
At present, the Internet behavior management device can obtain a webpage access log of a user through auditing Internet traffic of the user, and the principle of obtaining the webpage access log is generally to analyze and recombine an Internet Protocol (IP) message of the user to obtain various information in a hypertext Transfer Protocol (HTTP) request and a response of the user. Each web page access log records access information of a specific user to a specific Uniform Resource Locator (URL).
However, when analyzing user behavior using a web access log, noise problems may be faced; and removing noise and identifying the actual webpage access behavior of the user are the basis for carrying out subsequent user behavior analysis. Noise comes from several aspects:
1. when a user accesses a page, the click action of the user is directed to the main URL of the page, and a request for the main URL is triggered. Further, however, the browser, upon receiving a reply from the main URL, initiates a request for various internal resources (e.g., icons and pictures) and external resources (e.g., advertisements) referenced by the main page. These requests for resource URLs may also be audited by the internet behavior manager as part of the web page access log. Typically, each visit to a large website results in tens or even hundreds of logs of web page visits, but only one of them represents the true behavior of the user.
2. When a user does not close the browser after accessing a certain page, the script in the page can automatically generate heartbeat, state updating and other requests, and the requests can also generate a webpage access log.
3. Some software, such as antivirus software and terminal management software, uses HTTP protocol to communicate with the application server to support services such as upgrading; automatic behavior of software like this also produces a log of web page accesses.
In the prior art, the noise determination is usually performed by using the field contents in the HTTP request (request) and its response (response). For example, the noise determination is performed using the rule of the browser filling in the accept () field, the content type (content type) in the request and response, and the like. However, according to standards such as RFC (Request For Comments, a series of files scheduled by numbers), the values of the fields are not constrained and are defined by software implementers, so that whether the initiator is a user or software cannot be reflected fundamentally. Even if there is a certain regularity in the content due to the implementation of the browser, requests and responses, it is likely that such regularity is no longer valid as the versions are continually updated. Moreover, it is not only a browser that can initiate an HTTP request, but it is also difficult to effectively overlay various software using the above method.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide an information processing method and apparatus, which can accurately identify the actual access behavior of a user.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the invention provides an information processing method, which comprises the following steps:
collecting a webpage access log from the Internet access behavior auditing equipment of the N sampling points; wherein N is a positive integer;
classifying and analyzing the domain name of the webpage access log according to a first preset period to generate domain name classification information;
acquiring a webpage access log of internet behavior auditing equipment from a first user;
and analyzing the webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information according to a second preset period so as to identify the webpage access log for representing the real access behavior of the first user.
In the foregoing solution, preferably, the classifying and analyzing the domain name of the web page access log according to a first predetermined period to generate domain name classification information includes:
for all logs that access the same domain name,
checking whether the number of the logs exceeds a first threshold, and if not, exiting the analysis;
if the number of the users who initiate access in all the logs exceeds a first threshold, checking whether the number of the users who initiate access in all the logs exceeds a second threshold, and if the number of the users does not exceed the second threshold, exiting the analysis;
if the number exceeds the second threshold, checking whether the subject field of each log contains an abnormal field, and excluding the logs containing the abnormal field in the subject field;
calculating the proportion of the logs containing the effective topics, if the proportion of the logs containing the effective topics exceeds a third threshold, calculating the distribution of topic length weights in all the logs containing the effective topics, if the weighted average of the topic length weights exceeds a fourth threshold, calculating the information quantity of the topics in all the logs containing the effective topics, and if the information quantity exceeds a fifth threshold, judging that the domain name is a content domain name;
otherwise, if the proportion of the logs containing the effective topics does not exceed the third threshold, or if the weighted average of the length weights of the topics does not exceed the fourth threshold, or if the information quantity does not exceed the fifth threshold, the domain name is judged as the resource domain name.
In the foregoing solution, preferably, the analyzing, according to the second predetermined period and based on the domain name classification information, the web page access log from the internet access behavior auditing device of the first user includes:
analyzing the domain name of a webpage access log from the online behavior auditing equipment of the first user based on the domain name classification information, and dividing the webpage access log into access to a content domain name and access to a resource domain name;
analyzing the subject information of the log of the access content domain name, and finding out the log of the access behavior belonging to the first user;
performing time sequence analysis on the logs of the access content domain names, and finding out the logs of the access behaviors belonging to the first user;
performing periodic analysis on the logs belonging to the access behaviors of the first user based on the URL, judging whether periodic characteristics exist or not, and if so, cleaning as noise;
for the log which is still determined as the access behavior of the first user after periodic analysis, performing access frequency analysis based on the domain name, judging whether the access frequency exceeds a sixth threshold, and if so, cleaning as noise;
and determining the log which is still determined as the access behavior of the first user after the frequency analysis as the real access behavior of the first user.
In the foregoing solution, preferably, the performing a time sequence analysis on the log of the access content domain name to find out the log of the access behavior belonging to the first user includes:
for a log of access to a domain name of a content class,
calculating the weighted length of the theme;
calculating the information amount of the subject;
and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
In the foregoing solution, preferably, the performing a time sequence analysis on the log of accessing the content domain name includes:
classifying the logs of the access content domain names according to website names;
sequencing all logs in each class according to time, and dividing the sequenced logs into sets according to a preset rule;
and selecting the log in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the log quantity in the set as the webpage access log of the real access behavior of the first user.
In the foregoing solution, preferably, the selecting a log in a set conforming to a time sequence model according to a domain name type, URL information, topic information, and a log number in the set as a web page access log of a real access behavior of a first user includes:
performing duplicate removal processing on the logs in the set according to the URL;
judging whether a log which is judged as the access behavior of the first user exists or not;
if the log exists, cleaning other logs in the set as noise;
if the log does not exist, acquiring the number of the logs in the set; if the number of the logs does not exceed the first threshold, cleaning the logs in the set as noise; and if the first threshold value is exceeded, determining the log at the beginning in the set as a webpage access log representing the real access behavior of the first user.
The invention also provides an information processing device, which comprises a collecting module, a domain name classifying module, an obtaining module and a log analyzing module; wherein,
the collection module is used for collecting webpage access logs from the Internet surfing behavior auditing equipment of the N sampling points; wherein N is a positive integer;
the domain name classification module is used for classifying and analyzing the domain names of the webpage access logs according to a first preset period to generate domain name classification information;
the acquisition module is used for acquiring a webpage access log from the internet behavior auditing equipment of the first user;
and the log analysis module is used for analyzing the webpage access log from the internet behavior auditing equipment of the first user according to a second preset period and based on the domain name classification information so as to identify the webpage access log used for representing the real access behavior of the first user.
In the foregoing solution, preferably, the domain name classification module is further configured to:
for all logs that access the same domain name,
checking whether the number of the logs exceeds a first threshold, and if not, exiting the analysis;
if the number of the users who initiate access in all the logs exceeds a first threshold, checking whether the number of the users who initiate access in all the logs exceeds a second threshold, and if the number of the users does not exceed the second threshold, exiting the analysis;
if the number exceeds the second threshold, checking whether the subject field of each log contains an abnormal field, and excluding the logs containing the abnormal field in the subject field;
calculating the proportion of the logs containing the effective topics, if the proportion of the logs containing the effective topics exceeds a third threshold, calculating the distribution of topic length weights in all the logs containing the effective topics, if the weighted average of the topic length weights exceeds a fourth threshold, calculating the information quantity of the topics in all the logs containing the effective topics, and if the information quantity exceeds a fifth threshold, judging that the domain name is a content domain name;
otherwise, if the proportion of the logs containing the effective topics does not exceed the third threshold, or if the weighted average of the length weights of the topics does not exceed the fourth threshold, or if the information quantity does not exceed the fifth threshold, the domain name is judged as the resource domain name.
In the foregoing solution, preferably, the log analysis module includes:
the domain name analysis sub-module is used for analyzing the domain name of a webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information, and dividing the webpage access log into access to a content domain name and access to a resource domain name;
the topic analysis submodule is used for analyzing the topic information of the logs of the access content domain names and finding out the logs of the access behaviors of the first user;
the time sequence analysis submodule is used for carrying out time sequence analysis on the logs of the access content domain names and finding out the logs of the access behaviors of the first user;
the periodic analysis submodule is used for carrying out periodic analysis on the logs belonging to the access behaviors of the first user based on the URL, judging whether periodic characteristics exist or not, and if so, cleaning the logs as noise;
the frequency analysis submodule is used for carrying out access frequency analysis on the log which is still judged as the access behavior of the first user after periodic analysis based on the domain name, judging whether the access frequency exceeds a sixth threshold, and if so, cleaning the log as noise;
and the determining submodule is used for determining the log which is still determined as the access behavior of the first user after the frequency analysis as the real access behavior of the first user.
In the foregoing solution, preferably, the topic analysis submodule is further configured to:
for a log of access to a domain name of a content class,
calculating the weighted length of the theme;
calculating the information amount of the subject;
and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
In the foregoing solution, preferably, the timing analysis submodule is further configured to:
classifying the logs of the access content domain names according to website names;
sequencing all logs in each class according to time, and dividing the sequenced logs into sets according to a preset rule;
and selecting the log in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the log quantity in the set as the webpage access log of the real access behavior of the first user.
In the foregoing solution, preferably, the timing analysis submodule is further configured to:
performing duplicate removal processing on the logs in the set according to the URL;
judging whether a log which is judged as the access behavior of the first user exists or not;
if the log exists, cleaning other logs in the set as noise;
if the log does not exist, acquiring the number of the logs in the set; if the number of the logs does not exceed the first threshold, cleaning the logs in the set as noise; and if the first threshold value is exceeded, determining the log at the beginning in the set as a webpage access log representing the real access behavior of the first user.
According to the information processing method and device provided by the invention, webpage access logs are collected from the online behavior audit equipment of a plurality of sampling points; classifying and analyzing the domain name of the webpage access log according to a first preset period to generate domain name classification information; acquiring a webpage access log of internet behavior auditing equipment from a first user; and analyzing the webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information according to a second preset period so as to identify the webpage access log for representing the real access behavior of the first user. Therefore, noise can be effectively cleaned from the webpage access log, and the real access behavior of the user can be accurately identified, namely the actual access behavior of the user can be identified.
Drawings
FIG. 1 is a schematic diagram of an implementation flow of an information processing method provided by the present invention;
FIG. 2 is a schematic diagram illustrating an implementation process of classifying and analyzing domain names of collected web page access logs according to the present invention;
fig. 3 is a schematic view of an implementation process of analyzing a web access log from a first user's internet behavior auditing device based on domain name classification information according to the present invention;
FIG. 4 is a schematic diagram of a process for analyzing topic information of a log of an access content domain name according to the present invention;
FIG. 5 is a schematic diagram illustrating an implementation process of performing timing analysis on a log of access content domain names according to the present invention;
fig. 6 is a schematic diagram of a composition structure of an information processing apparatus according to the present invention.
Detailed Description
So that the manner in which the features and aspects of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.
Example one
Fig. 1 is a schematic view of an implementation flow of an information processing method provided by the present invention, and as shown in fig. 1, the information processing method mainly includes the following steps:
step 101: collecting a webpage access log from the Internet access behavior auditing equipment of the N sampling points; wherein N is a positive integer.
In this embodiment, the internet access behavior auditing device in step 101 and step 103 can obtain the web access log of the user through auditing the internet access traffic of the user.
Step 102: and classifying and analyzing the domain name of the webpage access log according to a first preset period to generate domain name classification information.
Preferably, the classifying and analyzing the domain name of the web page access log according to the first predetermined period to generate domain name classification information may include:
for all logs that access the same domain name,
checking whether the number of the logs exceeds a first threshold, and if not, exiting the analysis;
if the number of the users who initiate access in all the logs exceeds a first threshold, checking whether the number of the users who initiate access in all the logs exceeds a second threshold, and if the number of the users does not exceed the second threshold, exiting the analysis;
if the number exceeds the second threshold, checking whether the subject field of each log contains an abnormal field, and excluding the logs containing the abnormal field in the subject field;
calculating the proportion of the logs containing the effective topics, if the proportion of the logs containing the effective topics exceeds a third threshold, calculating the distribution of topic length weights in all the logs containing the effective topics, if the weighted average of the topic length weights exceeds a fourth threshold, calculating the information quantity of the topics in all the logs containing the effective topics, and if the information quantity exceeds a fifth threshold, judging that the domain name is a content domain name;
otherwise, if the proportion of the logs containing the effective topics does not exceed the third threshold, or if the weighted average of the length weights of the topics does not exceed the fourth threshold, or if the information quantity does not exceed the fifth threshold, the domain name is judged as the resource domain name.
Specifically, the content domain name is mainly used for storing a URL pointing to a news page, a video page, and the like to provide browsing content for a user; the resource domain name is mainly used for storing URL pointing to resources such as advertisements and pictures.
Specifically, the exception field generally represents that the returned page is a wrong page or does not have valid content; for example, the code is "304", or "404", or "error" or other error information.
Here, the log containing the valid subject may be understood as: logs other than logs containing no topics or logs containing anomalous topics.
The values of the first threshold, the second threshold, the third threshold, the fourth threshold and the fifth threshold can be set according to actual conditions.
Step 103: and acquiring a webpage access log from the internet behavior auditing equipment of the first user.
Here, the first user refers to a specific user, i.e., a user pre-analyzed by the system.
Step 104: and analyzing the webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information according to a second preset period so as to identify the webpage access log for representing the real access behavior of the first user.
The first predetermined period and the second predetermined period may be the same or different.
Preferably, the analyzing, according to the second predetermined period and based on the domain name classification information, the web page access log from the internet behavior auditing device of the first user may include:
analyzing the domain name of a webpage access log from the online behavior auditing equipment of the first user based on the domain name classification information, and dividing the webpage access log into access to a content domain name and access to a resource domain name;
analyzing the subject information of the log of the access content domain name, and finding out the log of the access behavior belonging to the first user;
performing time sequence analysis on the logs of the access content domain names, and finding out the logs of the access behaviors belonging to the first user;
performing periodic analysis on the logs belonging to the access behaviors of the first user based on the URL, judging whether periodic characteristics exist or not, and if so, cleaning as noise;
for the log which is still determined as the access behavior of the first user after periodic analysis, performing access frequency analysis based on the domain name, judging whether the access frequency exceeds a sixth threshold, and if so, cleaning as noise;
and determining the log which is still determined as the access behavior of the first user after the frequency analysis as the real access behavior of the first user.
Here, it should be noted that, if the domain name of the web page access log from the internet behavior auditing device of the first user is not within the range of the domain name classification information obtained in step 102, the log is marked as access to the content class domain name.
Preferably, the performing a time sequence analysis on the log of accessing the content domain name to find out the log of the access behavior belonging to the first user may include:
for a log of access to a domain name of a content class,
calculating the weighted length of the theme;
calculating the information amount of the subject;
and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
Preferably, the performing a time sequence analysis on the log of the access content class domain name may include:
classifying the logs of the access content domain names according to website names;
sequencing all logs in each class according to time, and dividing the sequenced logs into sets according to a preset rule;
and selecting the log in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the log quantity in the set as the webpage access log of the real access behavior of the first user.
Preferably, the dividing the sorted logs into sets according to a preset rule may include:
and for the sorted logs, taking the log with the highest access time ranking as a starting point, and forming a set by all logs with time sequences behind the log and intervals not exceeding a preset threshold T.
Of course, the preset rule is not limited to only the form listed above, and is not listed here.
Preferably, the selecting a log from the set according to the domain name type, the URL information, the subject information, and the number of logs in the set, which is used as the web page access log of the real access behavior of the first user, may include:
performing duplicate removal processing on the logs in the set according to the URL;
judging whether a log which is judged as the access behavior of the first user exists or not;
if the log exists, cleaning other logs in the set as noise;
if the log does not exist, acquiring the number of the logs in the set; if the number of the logs does not exceed the first threshold, cleaning the logs in the set as noise; and if the first threshold value is exceeded, determining the log at the beginning in the set as a webpage access log representing the real access behavior of the first user.
Here, the first threshold may be set according to actual conditions.
In this embodiment, the method may be applied to a device for analyzing, counting, or managing actual internet surfing behavior of a user.
In the embodiment of the invention, web page access logs are collected from the internet behavior auditing equipment of a plurality of sampling points; classifying and analyzing the domain name of the webpage access log according to a first preset period to generate domain name classification information; acquiring a webpage access log of internet behavior auditing equipment from a first user; and analyzing the webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information according to a second preset period so as to identify the webpage access log for representing the real access behavior of the first user. Therefore, noise can be effectively cleaned from the webpage access log, and the real access behavior of the user can be accurately identified, namely the actual access behavior of the user can be identified.
Example two
Fig. 2 is a schematic view of an implementation process for classifying and analyzing domain names of collected web page access logs, which is provided by the present invention, and as shown in fig. 2, the process mainly includes the following steps:
step 201: for all logs accessing the same domain name, checking whether the number of the logs exceeds a first threshold, if not, indicating that the samples are insufficient, and exiting the analysis; if the first threshold is exceeded, go to step 202;
step 202: checking whether the number of users initiating access in the log exceeds a second threshold, if not, indicating that the sample is insufficient, and exiting the analysis; if the second threshold is exceeded, go to step 203;
step 203: filtering the logs of the abnormal subjects; then, step 204 is executed;
specifically, whether the subject field of each log contains an abnormal field is checked, and if yes, the log containing the abnormal field in the subject field is excluded.
Here, the exception field generally represents that the returned page is a wrong page or has no valid content, such as error information "304", "404", "error", and so on.
Step 204: calculating the proportion of the logs containing the effective subjects, and if the proportion of the logs containing the effective subjects exceeds a third threshold, executing step 205; otherwise, go to step 208;
step 205: calculating the distribution of the subject length weights in all logs containing valid subjects, and if the weighted average of the subject length weights exceeds a fourth threshold, executing step 206; otherwise, go to step 208;
in particular, the length weight depends on the one hand on the length of the subject field and on the other hand on the content of the character, for example, a chinese character has a length weight greater than an english character.
Specifically, a weighted average of the subject length weights is calculated, and if the weighted average is lower than a fourth threshold, the domain name is determined as the resource class domain name.
Step 206: calculating the information amount of the topics in all the logs containing the effective topics, and if the information amount exceeds a fifth threshold, executing step 207; otherwise, go to step 208;
here, the amount of information may also be referred to as an entropy value.
Specifically, the information amount of the topics in all the logs containing the effective topics is calculated, and if the information amount is lower than a fifth threshold, the domain name is determined as the resource domain name.
Step 207: judging the domain names passing all the checks as content domain names;
step 208: and judging the domain name which fails to pass the check as the resource class domain name.
The execution subject of the above steps 201 to 208 may be a domain name classification subsystem.
It should be noted that the above analysis model may be combined with the input of artificial knowledge, and the domain name classification subsystem is trained through artificially labeled data to finally determine the effective threshold of each link.
In addition, for domain names which are not suitable for the analysis model, the classification result needs to be supplemented and corrected manually, and the domain names are marked as resource domain names or content domain names manually.
EXAMPLE III
Fig. 3 is a schematic diagram of an implementation process for analyzing a web access log from an internet behavior auditing device of a first user based on domain name classification information, as shown in fig. 3, the process mainly includes the following steps:
step 301: analyzing the domain name of a webpage access log from the online behavior auditing equipment of the first user based on the domain name classification information, and dividing the webpage access log into access to a content domain name and access to a resource domain name;
here, it should be noted that, if the domain name of the web page access log from the internet behavior auditing device of the first user is not within the range of the domain name classification information obtained in step 102, the log is marked as access to the content class domain name.
Step 302: and analyzing the subject information of the log of the access content domain name, and finding out the log of the access behavior belonging to the first user.
Step 303: and performing time sequence analysis on the log of the access content domain name, and finding out the log of the access behavior belonging to the first user.
Specifically, the logs are divided according to the website domain names and are sorted according to time; dividing the sorted logs into sets according to time sequence; and selecting the logs in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the number of the logs in the set as real access logs of the user.
Step 304: in the log which passes through the flow and is determined as the access behavior of the first user, periodic analysis is carried out based on the URL, whether the log has periodic characteristics or not is judged, and if the log has the periodic characteristics, the log is taken as noise cleaning.
Step 305: and for the log which is still determined as the access behavior of the first user after periodic analysis, performing access frequency analysis based on the domain name, judging whether the access frequency exceeds a sixth threshold, and if so, cleaning as noise.
Step 306: through the above process, the log which is still determined as the access behavior of the first user after the frequency analysis is determined as the real access behavior of the first user.
The execution subject of the above steps 301 to 306 may be a log analysis subsystem.
Example four
Fig. 4 is a schematic diagram of an implementation process of analyzing topic information of a log for accessing a content domain name, where as shown in fig. 4, the process mainly includes the following steps:
401: when the log is marked as the access to the content class domain name, calculating the weighted length of the log subject;
402: calculating the information amount of the log subject;
403: judging whether the weighted length and the information quantity are both larger than corresponding threshold values, if so, executing a step 404;
here, the threshold value of the weighted length is a fourth threshold; the threshold value of the information amount is a fifth threshold.
404: and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
EXAMPLE five
Fig. 5 is a schematic diagram of an implementation process of performing time sequence analysis on a log of accessing a content domain name, where as shown in fig. 5, the process mainly includes the following steps:
501: and classifying the access log of the content domain name according to the website name.
Here, the website name refers to a primary domain name such as baidu and sina. Can be extracted from the URL according to the naming convention for the domain name.
502: and sequencing logs visiting the same website according to time.
503: and taking the log with the highest access time ranking as a starting point, and forming a set by all logs with time sequences behind the log and intervals not exceeding a preset threshold T.
504: and eliminating logs of repeated URLs in the set.
505: judging whether a log which is judged as the access behavior of the first user exists, and if so, executing a step 506; if not, go to step 507;
step 506: cleaning other logs in the set as noise;
step 507: acquiring the number of logs in the set, judging whether the number of logs exceeds a first threshold value, and if not, executing a step 508; if so, go to step 509;
step 508: and cleaning the logs in the collection as noise.
Step 509: and judging the initial log in the set as a webpage access log representing the real access behavior of the first user.
EXAMPLE six
Fig. 6 is a schematic diagram of a composition structure of an information processing apparatus provided by the present invention, as shown in fig. 6, the apparatus includes a collecting module 61, a domain name classifying module 62, an obtaining module 63, and a log analyzing module 64; wherein,
the collecting module 61 is configured to collect a web access log from the internet access behavior auditing equipment of the N sampling points; wherein N is a positive integer;
the domain name classification module 62 is configured to classify and analyze domain names of the web page access logs according to a first predetermined period, and generate domain name classification information;
the obtaining module 63 is configured to obtain a web access log of an internet behavior auditing device from a first user;
the log analysis module 64 is configured to analyze, according to a second predetermined period and based on the domain name classification information, the web access log from the internet access behavior auditing device of the first user, so as to identify the web access log used for characterizing the real access behavior of the first user.
Preferably, the domain name classification module 62 is further configured to:
for all logs that access the same domain name,
checking whether the number of the logs exceeds a first threshold, and if not, exiting the analysis;
if the number of the users who initiate access in all the logs exceeds a first threshold, checking whether the number of the users who initiate access in all the logs exceeds a second threshold, and if the number of the users does not exceed the second threshold, exiting the analysis;
if the number exceeds the second threshold, checking whether the subject field of each log contains an abnormal field, and excluding the logs containing the abnormal field in the subject field;
calculating the proportion of the logs containing the effective topics, if the proportion of the logs containing the effective topics exceeds a third threshold, calculating the distribution of topic length weights in all the logs containing the effective topics, if the weighted average of the topic length weights exceeds a fourth threshold, calculating the information quantity of the topics in all the logs containing the effective topics, and if the information quantity exceeds a fifth threshold, judging that the domain name is a content domain name;
otherwise, if the proportion of the logs containing the effective topics does not exceed the third threshold, or if the weighted average of the length weights of the topics does not exceed the fourth threshold, or if the information quantity does not exceed the fifth threshold, the domain name is judged as the resource domain name.
Preferably, the log analysis module 64 includes:
the domain name analysis sub-module 641 is configured to analyze a domain name of a web access log from the internet behavior audit device of the first user based on the domain name classification information, and divide the web access log into an access to a content domain name and an access to a resource domain name;
the topic analysis sub-module 642 is configured to perform topic information analysis on the log of the access content domain name, and find out the log of the access behavior belonging to the first user;
the time sequence analysis sub-module 643, configured to perform time sequence analysis on the log of the access content domain name, and find out the log of the access behavior belonging to the first user;
a period analysis submodule 644, configured to perform periodic analysis on the log belonging to the access behavior of the first user based on the URL, determine whether the log has a periodic feature, and if so, perform noise cleaning;
the frequency analysis sub-module 645 is configured to perform access frequency analysis on the log that is still determined as the access behavior of the first user after the periodic analysis based on the domain name, determine whether the access frequency exceeds a sixth threshold, and if the access frequency exceeds the sixth threshold, perform noise cleaning;
the determining sub-module 646 is configured to determine, as the real access behavior of the first user, the log that is still determined as the access behavior of the first user after the frequency analysis.
Preferably, the topic analysis sub-module 642 is further configured to:
for a log of access to a domain name of a content class,
calculating the weighted length of the theme;
calculating the information amount of the subject;
and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
Preferably, the timing analysis submodule 643 is further configured to:
classifying the logs of the access content domain names according to website names;
sequencing all logs in each class according to time, and dividing the sequenced logs into sets according to a preset rule;
and selecting the log in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the log quantity in the set as the webpage access log of the real access behavior of the first user.
Preferably, the timing analysis submodule 643 is further configured to:
performing duplicate removal processing on the logs in the set according to the URL;
judging whether a log which is judged as the access behavior of the first user exists or not;
if the log exists, cleaning other logs in the set as noise;
if the log does not exist, acquiring the number of the logs in the set; if the number of the logs does not exceed the first threshold, cleaning the logs in the set as noise; and if the first threshold value is exceeded, determining the log at the beginning in the set as a webpage access log representing the real access behavior of the first user.
Specifically, the collecting module 61 and the domain name classifying module 62 may constitute a domain name classifying subsystem; the acquisition module 63 and the log analysis module 64 may constitute a log analysis subsystem.
Those skilled in the art will understand that the functions implemented by the processing modules in the information processing apparatus shown in fig. 6 can be understood by referring to the related description of the aforementioned information processing method. Those skilled in the art will appreciate that each processing module in the information processing apparatus shown in fig. 6 may be implemented by a program running on a processor, or may be implemented by a specific logic circuit.
In practical applications, the collecting module 61, the domain name classifying module 62, the obtaining module 63, the log analyzing module 64, and each sub-module in the network switching device described in the above embodiments may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or a Programmable gate array (FPGA) in the information Processing device or a device in which the information Processing device is located.
EXAMPLE seven
The technical solution of the present invention is explained in detail by a specific case.
Suppose that the web page access log generated by a user clicking a news page of new wave is as shown in table 1:
TABLE 1
And 701, analyzing the domain name.
According to the input of the domain name classification subsystem, for the logs with the IDs of 1-6, the domain name' news. For the log with the ID of 9, the domain name "sax.sina.com.cn" is the domain name of the resource class; for the logs with IDs of 7 to 8, specifically, the domain name "1303. adsina. all yes.com" corresponding to the log with ID of 7 and the log "1352. adsina. all yes.com" corresponding to the log with ID of 8 are both not present in the domain name classification result, and thus are still used as the content domain name.
Step 702: the subject is analyzed.
Specifically, in this embodiment, the subject analysis is performed on the logs marked as the content domain names, i.e., IDs 1 to 6, 7, and 8.
Firstly, carrying out weighted length analysis; specifically, the log with ID 1, the subject contains 20 chinese characters and 3 special characters. In one implementation, the chinese character has a weight of 2, the english and special characters have a weight of 1, and the weight length is 20 x 2+3 ═ 43; the weighting length of the ID 2-6 is 8; the weighting lengths of ID 7-8 are 0.
Then, carrying out information quantity analysis; specifically, the log with the ID of 1 has an entropy value of 1.09; the ID is 2-6 logs, and the entropy value is 0.6; the ID is 7, 8 logs, and the entropy value is 0.
Here, how to perform the threshold calculation is not described in detail.
In one specific example, if the threshold of the selected entropy value is 0.5 and the threshold of the weighted length is 30, the log with ID 1 is determined as the user behavior.
Step 703: the time sequence is analyzed.
(1) Partition classification
According to the website names, the logs with the IDs of 1-6 and 9 are divided into the same category (namely sina), and the logs with the IDs of 7 and 8 are divided into the other category (namely allyes).
(2) Sorting and partitioning collections
Under one implementation, it is assumed that logs within 5 seconds have relevance.
In sina classification, after sorting, ID 1-ID 6 and ID 9 are divided into a set A; the IDs 10 are divided into another set B. The initial log of the set A is the log of ID 1; since there is no log of access content types in set B, set B is treated as an invalid set.
In the allyes classification, ID 7, ID 8 are divided into the same set C.
(3) De-weighting
And carrying out duplicate removal on the set A and the set C according to the URL, wherein no log is eliminated.
(4) Determination
In set A, ID 1 was previously marked as user behavior, so the other logs were filtered as noise.
In the set C, the number of deduplicated logs is 2. In one implementation, the log number threshold takes a value of 5, and thus, all logs in set C are determined to be noise.
Step 704: periodic analysis was performed.
Under one implementation, the period analysis was performed in the range of 8 hours. The number of times a user accesses http:// news. sina. com. cn/w/2015-06-29/062832030997.shtml is only once in 8 hours. Thus, there is no periodicity.
Step 705: frequency analysis was performed.
In one implementation, the frequency analysis is performed in the range of 5 minutes and the frequency threshold for visiting the same web site is 50. Within 5 minutes, the user accesses http:// news. sina. com. cn and the log determined to be user behavior is 1. Thus, the frequency threshold is not exceeded.
Step 706: and (6) making a decision.
In summary, the log with ID 1 is determined as the real access behavior of the user.
According to the information processing method and the device, the defect of noise removal in the existing method can be overcome, and according to analysis of resource types under the domain name, the noise can be effectively cleaned from the webpage access log by utilizing the time sequence characteristics, the period characteristics, the frequency characteristics and the like of webpage access, so that the real access behavior of a user can be accurately identified, namely the actual access behavior of the user can be identified.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. An information processing method, characterized in that the method comprises:
collecting a webpage access log from the Internet access behavior auditing equipment of the N sampling points; wherein N is a positive integer;
classifying and analyzing the domain name of the webpage access log according to a first preset period to generate domain name classification information;
acquiring a webpage access log of internet behavior auditing equipment from a first user;
analyzing the webpage access log from the internet behavior auditing equipment of the first user based on domain name classification information according to a second preset period to identify the webpage access log for representing the real access behavior of the first user;
and classifying and analyzing the domain name of the web page access log according to a first predetermined period to generate domain name classification information, including:
for all logs that access the same domain name,
checking whether the number of the logs exceeds a first threshold, and if not, exiting the analysis;
if the number of the users who initiate access in all the logs exceeds a first threshold, checking whether the number of the users who initiate access in all the logs exceeds a second threshold, and if the number of the users does not exceed the second threshold, exiting the analysis;
if the number exceeds the second threshold, checking whether the subject field of each log contains an abnormal field, and excluding the logs containing the abnormal field in the subject field;
calculating the proportion of the logs containing the effective topics, if the proportion of the logs containing the effective topics exceeds a third threshold, calculating the distribution of topic length weights in all the logs containing the effective topics, if the weighted average of the topic length weights exceeds a fourth threshold, calculating the information quantity of the topics in all the logs containing the effective topics, and if the information quantity exceeds a fifth threshold, judging that the domain name is a content domain name;
otherwise, if the proportion of the logs containing the effective topics does not exceed a third threshold, or if the weighted average of the length weights of the topics does not exceed a fourth threshold, or if the information quantity does not exceed a fifth threshold, determining that the domain name is the resource domain name;
analyzing the webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information according to a second preset period, wherein the analyzing comprises the following steps:
analyzing the domain name of a webpage access log from the online behavior auditing equipment of the first user based on the domain name classification information, and dividing the webpage access log into access to a content domain name and access to a resource domain name;
analyzing the subject information of the log of the access content domain name, and finding out the log of the access behavior belonging to the first user;
performing time sequence analysis on the logs of the access content domain names, and finding out the logs of the access behaviors belonging to the first user;
performing periodic analysis on the logs belonging to the access behaviors of the first user based on a Uniform Resource Locator (URL), judging whether periodic characteristics exist or not, and if so, cleaning as noise;
for the log which is still determined as the access behavior of the first user after periodic analysis, performing access frequency analysis based on the domain name, judging whether the access frequency exceeds a sixth threshold, and if so, cleaning as noise;
and determining the log which is still determined as the access behavior of the first user after the frequency analysis as the real access behavior of the first user.
2. The method of claim 1, wherein the performing a time-series analysis on the log of the access content class domain name to find out the log of the access behavior belonging to the first user comprises:
for a log of access to a domain name of a content class,
calculating the weighted length of the theme;
calculating the information amount of the subject;
and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
3. The method of claim 1, wherein the performing a time-series analysis on the log of access content class domain names comprises:
classifying the logs of the access content domain names according to website names;
sequencing all logs in each class according to time, and dividing the sequenced logs into sets according to a preset rule;
and selecting the log in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the log quantity in the set as the webpage access log of the real access behavior of the first user.
4. The method of claim 3, wherein selecting the log in the set according to the domain name type, the URL information, the subject information and the log number in the set, and according to the time sequence model, as the webpage access log of the real access behavior of the first user comprises:
performing duplicate removal processing on the logs in the set according to the URL;
judging whether a log which is judged as the access behavior of the first user exists or not;
if the log exists, cleaning other logs in the set as noise;
if the log does not exist, acquiring the number of the logs in the set; if the number of the logs does not exceed the first threshold, cleaning the logs in the set as noise; and if the first threshold value is exceeded, determining the log at the beginning in the set as a webpage access log representing the real access behavior of the first user.
5. An information processing device is characterized by comprising a collecting module, a domain name classifying module, an acquiring module and a log analyzing module; wherein,
the collection module is used for collecting webpage access logs from the Internet surfing behavior auditing equipment of the N sampling points; wherein N is a positive integer;
the domain name classification module is used for classifying and analyzing the domain names of the webpage access logs according to a first preset period to generate domain name classification information;
the acquisition module is used for acquiring a webpage access log from the internet behavior auditing equipment of the first user;
the log analysis module is used for analyzing the webpage access log from the internet behavior auditing equipment of the first user according to a second preset period and based on domain name classification information so as to identify the webpage access log used for representing the real access behavior of the first user;
and the domain name classification module is further configured to:
for all logs that access the same domain name,
checking whether the number of the logs exceeds a first threshold, and if not, exiting the analysis;
if the number of the users who initiate access in all the logs exceeds a first threshold, checking whether the number of the users who initiate access in all the logs exceeds a second threshold, and if the number of the users does not exceed the second threshold, exiting the analysis;
if the number exceeds the second threshold, checking whether the subject field of each log contains an abnormal field, and excluding the logs containing the abnormal field in the subject field;
calculating the proportion of the logs containing the effective topics, if the proportion of the logs containing the effective topics exceeds a third threshold, calculating the distribution of topic length weights in all the logs containing the effective topics, if the weighted average of the topic length weights exceeds a fourth threshold, calculating the information quantity of the topics in all the logs containing the effective topics, and if the information quantity exceeds a fifth threshold, judging that the domain name is a content domain name;
otherwise, if the proportion of the logs containing the effective topics does not exceed a third threshold, or if the weighted average of the length weights of the topics does not exceed a fourth threshold, or if the information quantity does not exceed a fifth threshold, determining that the domain name is the resource domain name;
the log analysis module comprises:
the domain name analysis sub-module is used for analyzing the domain name of a webpage access log from the internet behavior auditing equipment of the first user based on the domain name classification information, and dividing the webpage access log into access to a content domain name and access to a resource domain name;
the topic analysis submodule is used for analyzing the topic information of the logs of the access content domain names and finding out the logs of the access behaviors of the first user;
the time sequence analysis submodule is used for carrying out time sequence analysis on the logs of the access content domain names and finding out the logs of the access behaviors of the first user;
the periodic analysis submodule is used for carrying out periodic analysis on the logs belonging to the access behaviors of the first user based on the URL, judging whether periodic characteristics exist or not, and if so, cleaning the logs as noise;
the frequency analysis submodule is used for carrying out access frequency analysis on the log which is still judged as the access behavior of the first user after periodic analysis based on the domain name, judging whether the access frequency exceeds a sixth threshold, and if so, cleaning the log as noise;
and the determining submodule is used for determining the log which is still determined as the access behavior of the first user after the frequency analysis as the real access behavior of the first user.
6. The apparatus of claim 5, wherein the topic analysis sub-module is further configured to:
for a log of access to a domain name of a content class,
calculating the weighted length of the theme;
calculating the information amount of the subject;
and judging the log with the weighted length and the information quantity larger than the corresponding threshold value as the log belonging to the access behavior of the first user.
7. The apparatus of claim 5, wherein the timing analysis sub-module is further configured to:
classifying the logs of the access content domain names according to website names;
sequencing all logs in each class according to time, and dividing the sequenced logs into sets according to a preset rule;
and selecting the log in the set according with the time sequence model according to the domain name type, the URL information, the subject information and the log quantity in the set as the webpage access log of the real access behavior of the first user.
8. The apparatus of claim 7, wherein the timing analysis sub-module is further configured to:
performing duplicate removal processing on the logs in the set according to the URL;
judging whether a log which is judged as the access behavior of the first user exists or not;
if the log exists, cleaning other logs in the set as noise;
if the log does not exist, acquiring the number of the logs in the set; if the number of the logs does not exceed the first threshold, cleaning the logs in the set as noise; and if the first threshold value is exceeded, determining the log at the beginning in the set as a webpage access log representing the real access behavior of the first user.
CN201510729292.7A 2015-10-30 2015-10-30 A kind of information processing method and device Active CN105224691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510729292.7A CN105224691B (en) 2015-10-30 2015-10-30 A kind of information processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510729292.7A CN105224691B (en) 2015-10-30 2015-10-30 A kind of information processing method and device

Publications (2)

Publication Number Publication Date
CN105224691A CN105224691A (en) 2016-01-06
CN105224691B true CN105224691B (en) 2019-03-26

Family

ID=54993659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510729292.7A Active CN105224691B (en) 2015-10-30 2015-10-30 A kind of information processing method and device

Country Status (1)

Country Link
CN (1) CN105224691B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931073A (en) * 2016-04-08 2016-09-07 久远谦长(北京)技术服务有限公司 Mobile Internet advertising platform analysis method and system
CN106682096A (en) * 2016-12-01 2017-05-17 北京奇虎科技有限公司 Method and device for log data management
CN107704478B (en) * 2017-01-16 2019-03-15 贵州白山云科技股份有限公司 A kind of method and system that log is written
CN108897804A (en) * 2018-06-15 2018-11-27 东北大学秦皇岛分校 A kind of search system and method for the Internet space data
CN109688094B (en) * 2018-09-07 2022-05-17 平安科技(深圳)有限公司 Suspicious IP configuration method, device, equipment and storage medium based on network security
CN110912860B (en) * 2018-09-18 2022-02-18 北京数安鑫云信息技术有限公司 Method and device for detecting pseudo periodic access behavior
CN109347688B (en) * 2018-11-26 2022-04-26 锐捷网络股份有限公司 Method and device for positioning fault in wireless local area network
CN110825873B (en) * 2019-10-11 2022-04-12 支付宝(杭州)信息技术有限公司 Method and device for expanding log exception classification rules
CN114915434B (en) * 2021-02-08 2025-04-04 腾讯科技(深圳)有限公司 A network proxy detection method, device, storage medium and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855248A (en) * 2011-06-29 2013-01-02 中国移动通信集团广西有限公司 Determination method, apparatus and system for user characteristic information
CN103605738A (en) * 2013-11-19 2014-02-26 北京国双科技有限公司 Webpage access data statistical method and webpage access data statistical device
CN104298780A (en) * 2014-11-05 2015-01-21 百纳(武汉)信息技术有限公司 Method and system for pre-obtaining browser webpage information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855248A (en) * 2011-06-29 2013-01-02 中国移动通信集团广西有限公司 Determination method, apparatus and system for user characteristic information
CN103605738A (en) * 2013-11-19 2014-02-26 北京国双科技有限公司 Webpage access data statistical method and webpage access data statistical device
CN104298780A (en) * 2014-11-05 2015-01-21 百纳(武汉)信息技术有限公司 Method and system for pre-obtaining browser webpage information

Also Published As

Publication number Publication date
CN105224691A (en) 2016-01-06

Similar Documents

Publication Publication Date Title
CN105224691B (en) A kind of information processing method and device
US20210360022A1 (en) Clustering-based security monitoring of accessed domain names
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
US10404731B2 (en) Method and device for detecting website attack
CN107885777A (en) A control method and system for crawling web page data based on collaborative crawler
Bomhardt et al. Web robot detection-preprocessing web logfiles for robot detection
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
CN111368227B (en) URL processing method and device
CN109583211B (en) Website clustering and vulnerability scanning method and device, electronic equipment and storage medium
CN113821754A (en) Sensitive data interface crawler identification method and device
CN106790025B (en) Method and device for detecting link maliciousness
US9756064B2 (en) Apparatus and method for collecting harmful website information
CN109064067B (en) Financial risk operation subject determination method and device based on Internet
CN107526748B (en) A method and device for identifying user click behavior
JP6823205B2 (en) Collection device, collection method and collection program
CN105989019B (en) A method and device for cleaning data
CN111611508B (en) Identification method and device for actual website access of user
CN112256889B (en) A method, device, equipment and medium for constructing a knowledge graph of a security entity
CN106528569A (en) Method and device for calculating validity of site search
CN107944001A (en) Hot news detection method and device and electronic equipment
CN111625700A (en) Anti-grabbing method, device, equipment and computer storage medium
CN111756679A (en) Log analysis method and device, storage medium and computer equipment
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN103258019B (en) Method and device for providing query result
EP3361405A1 (en) Enhancement of intrusion detection systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载