CN112511525B

CN112511525B - Website malicious third-party content detection method and system

Info

Publication number: CN112511525B
Application number: CN202011332352.9A
Authority: CN
Inventors: 潘晓光; 马泽宇; 焦璐璐; 韩锋; 李娟�
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2022-07-22
Anticipated expiration: 2040-11-24
Also published as: CN112511525A

Abstract

The invention belongs to the technical field of website content detection, and particularly relates to a method and a system for detecting malicious third-party content of a website, which comprise the following steps: firstly, the webpage resources are checked by a content security policy CSP, and if the resources cannot pass through the content security policy CSP, the webpage resources are directly judged to be malicious content; the content which is not detected by the content security policy CSP enters an inclusion sequence construction module to construct an inclusion sequence; extracting characteristics; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification; the inclusion sequences are classified by an inclusion sequence classifier. According to the method, malicious content detection of a third party is performed on the page resource containing sequence analyzed by the DOM tree fine granularity, compared with a method based on a traditional security strategy, the method is easier to deploy, meanwhile, the third party cannot find security holes to bypass the method, and the safety of the webpage is further improved. The method and the device are used for detecting the malicious third-party content of the website.

Description

Website malicious third-party content detection method and system

Technical Field

The invention belongs to the technical field of website content detection, and particularly relates to a method and a system for detecting malicious third-party content of a website.

Background

Under the influence of the same source policy, isolation is enforced between codes and data from different sources, and the current security mechanisms for protecting websites from malicious third parties include Content Security Policy (CSP), cross-source resource sharing (CORS) and cross-domain communication based on POST messages, but because these policies are difficult to be safely applied in practice and cannot solve the trust problem on dynamic networks, the third parties can also bypass these security mechanisms by using their capabilities.

Problems or disadvantages of the prior art: existing security policies are difficult to deploy in practice, fail to address trust issues on dynamic networks, and third parties can also use their capabilities to bypass these security mechanisms.

Disclosure of Invention

Aiming at the technical problem that the existing security strategy cannot solve the trust problem on a dynamic network, the invention provides a method and a system for detecting the malicious third-party content of a website, which are easy to deploy, strong in security and high in efficiency.

In order to solve the technical problems, the invention adopts the technical scheme that:

a method for detecting malicious third-party content of a website comprises the following steps:

s1, the webpage resource is checked by the content security policy CSP, and if the resource can not pass through the content security policy CSP, the webpage resource is directly judged to be malicious content;

s2, the content undetected by the content security policy CSP enters a sequence construction module to construct a sequence;

s3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;

and S4, classifying the sequence through a sequence classifier.

The Content Security Policy CSP in S1 sets http-equ to Content-Security-Policy by using META tag, and if the resource cannot pass through the Content Security Policy CSP, it is directly determined as malicious Content.

The method for constructing the sequence in S2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.

The sequence classifier in the S4 comprises a malicious model and a legal model, wherein the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool, trains out the malicious model, obtains a legal sample list by collecting a large amount of legal data offline through the legal model, trains out the legal model, and classifies the sequence by using a machine learning algorithm.

A third-party content detection system for malicious websites comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and uses the characteristics as training classification; the sequence classifier module classifies the sequence using machine learning.

The sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool and trains out the malicious model, and the legal model obtains the legal sample list by collecting a large amount of legal data offline and trains out the legal model.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, malicious content detection of a third party is performed on the page resource sequence analyzed by the DOM tree fine granularity, compared with a method based on a traditional security strategy, the method is easier to deploy, meanwhile, the third party cannot find security holes to bypass the method, and the safety of the webpage is further improved.

Drawings

FIG. 1 is a block diagram of the main steps of the present invention;

FIG. 2 is a sequence construction diagram of the present invention;

FIG. 3 is a diagram of a sequence classifier model according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

A method for detecting malicious third-party content of a website, as shown in fig. 1, includes the following steps:

step 1, checking a webpage resource by a Content Security Policy (CSP), and directly judging the webpage resource as malicious content if the webpage resource cannot pass through the Content Security Policy (CSP);

step 2, the content which is not detected by the content security policy CSP enters a sequence construction module to carry out sequence construction;

step 3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;

and 4, classifying the sequences through a sequence classifier.

Further, the Content Security Policy CSP in step 1 sets http-equ to Content-Security-Policy by using META tag, and if the resource cannot pass through the Content Security Policy CSP, it is directly determined as malicious Content.

Further, the method for constructing the sequence in the step 2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and meanwhile, a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.

Further, the sequence classifier in step 4 comprises a malicious model and a legal model, wherein the malicious model obtains a malicious sample list according to a public blacklist and a detection tool which are currently available, trains out the malicious model, the legal model obtains the legal sample list by collecting a large amount of legal data offline, trains out the legal model, and classifies the sequence by using a machine learning algorithm.

A website malicious third-party content detection system comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and uses the characteristics as training classification; the sequence classifier module classifies the sequence using machine learning.

Further, the sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool, the malicious model is trained, the legal model obtains a legal sample list by collecting a large amount of legal data in an off-line mode, and the legal model is trained.

As shown in fig. 1, before rendering a page, a browser requests an HTML document from a remote server, after receiving the HTML document, the browser parses the document into a DOM tree through an HTML interpreter, calculates style information and page layout of a response to the DOM tree using a CSS interpreter, calls a JavaScript engine to execute a JS script if the JS script is encountered in the process, and finally draws the entire page on the browser.

According to the method, after a browser obtains page resources, the content security policy CSP is used for detecting the resources, CSP instruction rules are set by calling a ContentSecuriPocy (options) method in a Helmet module, if the CSP considers that the resources are from a malicious third party, the CSP instruction rules are directly detected, then the remaining resources enter a sequence classification link, in the process of constructing the DOM tree, injection and execution of content scripts are tracked by enhancing a Chromium kernel Blink, page inclusion relations which cannot be recorded by the DOM tree are constructed, and the constructed sequence is shown in figure 2.

The sequence entering feature extraction module extracts features, for example, the DNS features include top-level domains, host types, levels, Alexa ranks, etc., the character string features include the proportion of non-characters, the proportion of unique characters, the frequency of each character in a domain name, the length of the domain name, the entropy of the domain name, etc., and the role features in the sequence of the resource, such as an advertisement network, a CDN, a URL shortening service, etc.

The sequence classifier uses a hidden markov model, as shown in fig. 3, to estimate parameters using the Baum-Welch algorithm, and uses a forward-backward algorithm to detect the quality of a given sequence.

The division of the modules, units or flows in the present invention is only a division of logic functions, and other division manners may be available in actual implementation, for example, a plurality of modules and/or units may be combined or integrated in another system, and the modules and units described as separate components may be separated or not separated in form, so that part or all of the units may be selected according to actual needs to implement the scheme of the embodiment.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A method for detecting malicious third-party content of a website is characterized by comprising the following steps: comprises the following steps:

s4, classifying the sequences through a sequence classifier, wherein the sequence classifier comprises a malicious model and a legal model, the malicious model obtains a malicious sample list according to a current public blacklist and a detection tool, trains out the malicious model, the legal model obtains the legal sample list by collecting a large amount of legal data offline, trains out the legal model, and classifies the sequences through a machine learning algorithm;

the website malicious third-party content detection system for executing the method comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and is used for training classification; the sequence classifier module classifies a sequence using machine learning; the sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool and trains out the malicious model, and the legal model obtains the legal sample list by collecting a large amount of legal data offline and trains out the legal model.

2. The method for detecting malicious third-party content of a website as claimed in claim 1, wherein: the Content Security Policy CSP in the S1 sets the http-equ to Content-Security-Policy by using the META label, and if the resource can not pass through the Content Security Policy CSP, the resource is directly determined as malicious Content.

3. The method for detecting malicious third-party content of a website as claimed in claim 1, wherein: the method for constructing the sequence in the S2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and meanwhile, a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.