CN112511525B - Website malicious third-party content detection method and system - Google Patents
Website malicious third-party content detection method and system Download PDFInfo
- Publication number
- CN112511525B CN112511525B CN202011332352.9A CN202011332352A CN112511525B CN 112511525 B CN112511525 B CN 112511525B CN 202011332352 A CN202011332352 A CN 202011332352A CN 112511525 B CN112511525 B CN 112511525B
- Authority
- CN
- China
- Prior art keywords
- sequence
- malicious
- content
- module
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1433—Vulnerability analysis
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Probability & Statistics with Applications (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention belongs to the technical field of website content detection, and particularly relates to a method and a system for detecting malicious third-party content of a website, which comprise the following steps: firstly, the webpage resources are checked by a content security policy CSP, and if the resources cannot pass through the content security policy CSP, the webpage resources are directly judged to be malicious content; the content which is not detected by the content security policy CSP enters an inclusion sequence construction module to construct an inclusion sequence; extracting characteristics; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification; the inclusion sequences are classified by an inclusion sequence classifier. According to the method, malicious content detection of a third party is performed on the page resource containing sequence analyzed by the DOM tree fine granularity, compared with a method based on a traditional security strategy, the method is easier to deploy, meanwhile, the third party cannot find security holes to bypass the method, and the safety of the webpage is further improved. The method and the device are used for detecting the malicious third-party content of the website.
Description
Technical Field
The invention belongs to the technical field of website content detection, and particularly relates to a method and a system for detecting malicious third-party content of a website.
Background
Under the influence of the same source policy, isolation is enforced between codes and data from different sources, and the current security mechanisms for protecting websites from malicious third parties include Content Security Policy (CSP), cross-source resource sharing (CORS) and cross-domain communication based on POST messages, but because these policies are difficult to be safely applied in practice and cannot solve the trust problem on dynamic networks, the third parties can also bypass these security mechanisms by using their capabilities.
Problems or disadvantages of the prior art: existing security policies are difficult to deploy in practice, fail to address trust issues on dynamic networks, and third parties can also use their capabilities to bypass these security mechanisms.
Disclosure of Invention
Aiming at the technical problem that the existing security strategy cannot solve the trust problem on a dynamic network, the invention provides a method and a system for detecting the malicious third-party content of a website, which are easy to deploy, strong in security and high in efficiency.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method for detecting malicious third-party content of a website comprises the following steps:
s1, the webpage resource is checked by the content security policy CSP, and if the resource can not pass through the content security policy CSP, the webpage resource is directly judged to be malicious content;
s2, the content undetected by the content security policy CSP enters a sequence construction module to construct a sequence;
s3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;
and S4, classifying the sequence through a sequence classifier.
The Content Security Policy CSP in S1 sets http-equ to Content-Security-Policy by using META tag, and if the resource cannot pass through the Content Security Policy CSP, it is directly determined as malicious Content.
The method for constructing the sequence in S2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.
The sequence classifier in the S4 comprises a malicious model and a legal model, wherein the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool, trains out the malicious model, obtains a legal sample list by collecting a large amount of legal data offline through the legal model, trains out the legal model, and classifies the sequence by using a machine learning algorithm.
A third-party content detection system for malicious websites comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and uses the characteristics as training classification; the sequence classifier module classifies the sequence using machine learning.
The sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool and trains out the malicious model, and the legal model obtains the legal sample list by collecting a large amount of legal data offline and trains out the legal model.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, malicious content detection of a third party is performed on the page resource sequence analyzed by the DOM tree fine granularity, compared with a method based on a traditional security strategy, the method is easier to deploy, meanwhile, the third party cannot find security holes to bypass the method, and the safety of the webpage is further improved.
Drawings
FIG. 1 is a block diagram of the main steps of the present invention;
FIG. 2 is a sequence construction diagram of the present invention;
FIG. 3 is a diagram of a sequence classifier model according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
A method for detecting malicious third-party content of a website, as shown in fig. 1, includes the following steps:
step 2, the content which is not detected by the content security policy CSP enters a sequence construction module to carry out sequence construction;
step 3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;
and 4, classifying the sequences through a sequence classifier.
Further, the Content Security Policy CSP in step 1 sets http-equ to Content-Security-Policy by using META tag, and if the resource cannot pass through the Content Security Policy CSP, it is directly determined as malicious Content.
Further, the method for constructing the sequence in the step 2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and meanwhile, a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.
Further, the sequence classifier in step 4 comprises a malicious model and a legal model, wherein the malicious model obtains a malicious sample list according to a public blacklist and a detection tool which are currently available, trains out the malicious model, the legal model obtains the legal sample list by collecting a large amount of legal data offline, trains out the legal model, and classifies the sequence by using a machine learning algorithm.
A website malicious third-party content detection system comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and uses the characteristics as training classification; the sequence classifier module classifies the sequence using machine learning.
Further, the sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool, the malicious model is trained, the legal model obtains a legal sample list by collecting a large amount of legal data in an off-line mode, and the legal model is trained.
As shown in fig. 1, before rendering a page, a browser requests an HTML document from a remote server, after receiving the HTML document, the browser parses the document into a DOM tree through an HTML interpreter, calculates style information and page layout of a response to the DOM tree using a CSS interpreter, calls a JavaScript engine to execute a JS script if the JS script is encountered in the process, and finally draws the entire page on the browser.
According to the method, after a browser obtains page resources, the content security policy CSP is used for detecting the resources, CSP instruction rules are set by calling a ContentSecuriPocy (options) method in a Helmet module, if the CSP considers that the resources are from a malicious third party, the CSP instruction rules are directly detected, then the remaining resources enter a sequence classification link, in the process of constructing the DOM tree, injection and execution of content scripts are tracked by enhancing a Chromium kernel Blink, page inclusion relations which cannot be recorded by the DOM tree are constructed, and the constructed sequence is shown in figure 2.
The sequence entering feature extraction module extracts features, for example, the DNS features include top-level domains, host types, levels, Alexa ranks, etc., the character string features include the proportion of non-characters, the proportion of unique characters, the frequency of each character in a domain name, the length of the domain name, the entropy of the domain name, etc., and the role features in the sequence of the resource, such as an advertisement network, a CDN, a URL shortening service, etc.
The sequence classifier uses a hidden markov model, as shown in fig. 3, to estimate parameters using the Baum-Welch algorithm, and uses a forward-backward algorithm to detect the quality of a given sequence.
The division of the modules, units or flows in the present invention is only a division of logic functions, and other division manners may be available in actual implementation, for example, a plurality of modules and/or units may be combined or integrated in another system, and the modules and units described as separate components may be separated or not separated in form, so that part or all of the units may be selected according to actual needs to implement the scheme of the embodiment.
Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.
Claims (3)
1. A method for detecting malicious third-party content of a website is characterized by comprising the following steps: comprises the following steps:
s1, the webpage resource is checked by the content security policy CSP, and if the resource can not pass through the content security policy CSP, the webpage resource is directly judged to be malicious content;
s2, the content undetected by the content security policy CSP enters a sequence construction module to construct a sequence;
s3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;
s4, classifying the sequences through a sequence classifier, wherein the sequence classifier comprises a malicious model and a legal model, the malicious model obtains a malicious sample list according to a current public blacklist and a detection tool, trains out the malicious model, the legal model obtains the legal sample list by collecting a large amount of legal data offline, trains out the legal model, and classifies the sequences through a machine learning algorithm;
the website malicious third-party content detection system for executing the method comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and is used for training classification; the sequence classifier module classifies a sequence using machine learning; the sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool and trains out the malicious model, and the legal model obtains the legal sample list by collecting a large amount of legal data offline and trains out the legal model.
2. The method for detecting malicious third-party content of a website as claimed in claim 1, wherein: the Content Security Policy CSP in the S1 sets the http-equ to Content-Security-Policy by using the META label, and if the resource can not pass through the Content Security Policy CSP, the resource is directly determined as malicious Content.
3. The method for detecting malicious third-party content of a website as claimed in claim 1, wherein: the method for constructing the sequence in the S2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and meanwhile, a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011332352.9A CN112511525B (en) | 2020-11-24 | 2020-11-24 | Website malicious third-party content detection method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202011332352.9A CN112511525B (en) | 2020-11-24 | 2020-11-24 | Website malicious third-party content detection method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN112511525A CN112511525A (en) | 2021-03-16 |
| CN112511525B true CN112511525B (en) | 2022-07-22 |
Family
ID=74958316
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202011332352.9A Active CN112511525B (en) | 2020-11-24 | 2020-11-24 | Website malicious third-party content detection method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112511525B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240020476A1 (en) * | 2022-07-15 | 2024-01-18 | Pinterest, Inc. | Determining linked spam content |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9521162B1 (en) * | 2014-11-21 | 2016-12-13 | Narus, Inc. | Application-level DDoS detection using service profiling |
| CN107679403A (en) * | 2017-10-11 | 2018-02-09 | 北京理工大学 | It is a kind of to extort software mutation detection method based on sequence alignment algorithms |
| CN110022311A (en) * | 2019-03-18 | 2019-07-16 | 北京工业大学 | A kind of cloud outsourcing service leaking data safety test use-case automatic generating method based on attack graph |
| US10397255B1 (en) * | 2015-09-23 | 2019-08-27 | StackRox, Inc. | System and method for providing security in a distributed computation system utilizing containers |
| CN111259440A (en) * | 2020-01-14 | 2020-06-09 | 中国人民解放军国防科技大学 | Privacy protection decision tree classification method for cloud outsourcing data |
| CN111368297A (en) * | 2020-02-02 | 2020-07-03 | 西安电子科技大学 | Privacy protection mobile malware detection method, system, storage medium and application |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104468546B (en) * | 2014-11-27 | 2018-01-09 | 微梦创科网络科技(中国)有限公司 | A kind of web information processing method and firewall device, system |
| US10432662B2 (en) * | 2015-04-30 | 2019-10-01 | Oath, Inc. | Method and system for blocking malicious third party site tagging |
| US11503070B2 (en) * | 2016-11-02 | 2022-11-15 | Microsoft Technology Licensing, Llc | Techniques for classifying a web page based upon functions used to render the web page |
| CN107948168A (en) * | 2017-11-29 | 2018-04-20 | 四川无声信息技术有限公司 | Page detection method and device |
| CN108509794A (en) * | 2018-03-09 | 2018-09-07 | 中山大学 | A kind of malicious web pages defence detection method based on classification learning algorithm |
| CN109218296B (en) * | 2018-08-29 | 2021-03-23 | 天津大学 | XSS (XSS) defense system and method based on improved CSP (chip size service) strategy |
| US10972507B2 (en) * | 2018-09-16 | 2021-04-06 | Microsoft Technology Licensing, Llc | Content policy based notification of application users about malicious browser plugins |
| US10521583B1 (en) * | 2018-10-25 | 2019-12-31 | BitSight Technologies, Inc. | Systems and methods for remote detection of software through browser webinjects |
| US10599834B1 (en) * | 2019-05-10 | 2020-03-24 | Clean.io, Inc. | Detecting malicious code existing in internet advertisements |
| CN110336812A (en) * | 2019-07-03 | 2019-10-15 | 深圳市珍爱捷云信息技术有限公司 | Resource intercepting processing method, device, computer equipment and storage medium |
-
2020
- 2020-11-24 CN CN202011332352.9A patent/CN112511525B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9521162B1 (en) * | 2014-11-21 | 2016-12-13 | Narus, Inc. | Application-level DDoS detection using service profiling |
| US10397255B1 (en) * | 2015-09-23 | 2019-08-27 | StackRox, Inc. | System and method for providing security in a distributed computation system utilizing containers |
| CN107679403A (en) * | 2017-10-11 | 2018-02-09 | 北京理工大学 | It is a kind of to extort software mutation detection method based on sequence alignment algorithms |
| CN110022311A (en) * | 2019-03-18 | 2019-07-16 | 北京工业大学 | A kind of cloud outsourcing service leaking data safety test use-case automatic generating method based on attack graph |
| CN111259440A (en) * | 2020-01-14 | 2020-06-09 | 中国人民解放军国防科技大学 | Privacy protection decision tree classification method for cloud outsourcing data |
| CN111368297A (en) * | 2020-02-02 | 2020-07-03 | 西安电子科技大学 | Privacy protection mobile malware detection method, system, storage medium and application |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112511525A (en) | 2021-03-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Khan et al. | Detecting malicious URLs using binary classification through ada boost algorithm. | |
| US10033757B2 (en) | Identifying malicious identifiers | |
| CN103888490B (en) | A kind of man-machine knowledge method for distinguishing of full automatic WEB client side | |
| CN104954372B (en) | A kind of evidence obtaining of fishing website and verification method and system | |
| US20220030029A1 (en) | Phishing Protection Methods and Systems | |
| CN101964025B (en) | XSS detection method and equipment | |
| CN104348803B (en) | Link kidnaps detection method, device, user equipment, Analysis server and system | |
| CN112989348B (en) | Attack detection method, model training method, device, server and storage medium | |
| CN104217160A (en) | Method and system for detecting Chinese phishing website | |
| CN107948168A (en) | Page detection method and device | |
| CN103544436A (en) | System and method for distinguishing phishing websites | |
| CN111245784A (en) | Method for multi-dimensional detection of malicious domain name | |
| CN110035075A (en) | Detection method, device, computer equipment and the storage medium of fishing website | |
| CN109756467B (en) | Method and device for identifying a phishing website | |
| CN107463844B (en) | WEB Trojan horse detection method and system | |
| Geng et al. | RRPhish: Anti-phishing via mining brand resources request | |
| Zaimi et al. | Survey paper: Taxonomy of website anti-phishing solutions | |
| Tanaka et al. | Phishing site detection using similarity of website structure | |
| CN114244564A (en) | Attack defense method, device, equipment and readable storage medium | |
| CN107508832A (en) | A kind of device-fingerprint recognition methods and system | |
| CN118337453A (en) | Automatic attack tracing method, terminal device and storage medium | |
| CN112511525B (en) | Website malicious third-party content detection method and system | |
| CN115459946A (en) | Abnormal webpage identification method, device, equipment and computer storage medium | |
| CN105653941A (en) | Heuristic detection method and system for phishing website | |
| CN118626982A (en) | A multi-modal anomaly detection method and system for big data network traffic |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |