+

CN112511525B - Website malicious third-party content detection method and system - Google Patents

Website malicious third-party content detection method and system Download PDF

Info

Publication number
CN112511525B
CN112511525B CN202011332352.9A CN202011332352A CN112511525B CN 112511525 B CN112511525 B CN 112511525B CN 202011332352 A CN202011332352 A CN 202011332352A CN 112511525 B CN112511525 B CN 112511525B
Authority
CN
China
Prior art keywords
sequence
malicious
content
module
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011332352.9A
Other languages
Chinese (zh)
Other versions
CN112511525A (en
Inventor
潘晓光
马泽宇
焦璐璐
韩锋
李娟�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Sanyouhe Smart Information Technology Co Ltd
Original Assignee
Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Sanyouhe Smart Information Technology Co Ltd filed Critical Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority to CN202011332352.9A priority Critical patent/CN112511525B/en
Publication of CN112511525A publication Critical patent/CN112511525A/en
Application granted granted Critical
Publication of CN112511525B publication Critical patent/CN112511525B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention belongs to the technical field of website content detection, and particularly relates to a method and a system for detecting malicious third-party content of a website, which comprise the following steps: firstly, the webpage resources are checked by a content security policy CSP, and if the resources cannot pass through the content security policy CSP, the webpage resources are directly judged to be malicious content; the content which is not detected by the content security policy CSP enters an inclusion sequence construction module to construct an inclusion sequence; extracting characteristics; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification; the inclusion sequences are classified by an inclusion sequence classifier. According to the method, malicious content detection of a third party is performed on the page resource containing sequence analyzed by the DOM tree fine granularity, compared with a method based on a traditional security strategy, the method is easier to deploy, meanwhile, the third party cannot find security holes to bypass the method, and the safety of the webpage is further improved. The method and the device are used for detecting the malicious third-party content of the website.

Description

Website malicious third-party content detection method and system
Technical Field
The invention belongs to the technical field of website content detection, and particularly relates to a method and a system for detecting malicious third-party content of a website.
Background
Under the influence of the same source policy, isolation is enforced between codes and data from different sources, and the current security mechanisms for protecting websites from malicious third parties include Content Security Policy (CSP), cross-source resource sharing (CORS) and cross-domain communication based on POST messages, but because these policies are difficult to be safely applied in practice and cannot solve the trust problem on dynamic networks, the third parties can also bypass these security mechanisms by using their capabilities.
Problems or disadvantages of the prior art: existing security policies are difficult to deploy in practice, fail to address trust issues on dynamic networks, and third parties can also use their capabilities to bypass these security mechanisms.
Disclosure of Invention
Aiming at the technical problem that the existing security strategy cannot solve the trust problem on a dynamic network, the invention provides a method and a system for detecting the malicious third-party content of a website, which are easy to deploy, strong in security and high in efficiency.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method for detecting malicious third-party content of a website comprises the following steps:
s1, the webpage resource is checked by the content security policy CSP, and if the resource can not pass through the content security policy CSP, the webpage resource is directly judged to be malicious content;
s2, the content undetected by the content security policy CSP enters a sequence construction module to construct a sequence;
s3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;
and S4, classifying the sequence through a sequence classifier.
The Content Security Policy CSP in S1 sets http-equ to Content-Security-Policy by using META tag, and if the resource cannot pass through the Content Security Policy CSP, it is directly determined as malicious Content.
The method for constructing the sequence in S2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.
The sequence classifier in the S4 comprises a malicious model and a legal model, wherein the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool, trains out the malicious model, obtains a legal sample list by collecting a large amount of legal data offline through the legal model, trains out the legal model, and classifies the sequence by using a machine learning algorithm.
A third-party content detection system for malicious websites comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and uses the characteristics as training classification; the sequence classifier module classifies the sequence using machine learning.
The sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool and trains out the malicious model, and the legal model obtains the legal sample list by collecting a large amount of legal data offline and trains out the legal model.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, malicious content detection of a third party is performed on the page resource sequence analyzed by the DOM tree fine granularity, compared with a method based on a traditional security strategy, the method is easier to deploy, meanwhile, the third party cannot find security holes to bypass the method, and the safety of the webpage is further improved.
Drawings
FIG. 1 is a block diagram of the main steps of the present invention;
FIG. 2 is a sequence construction diagram of the present invention;
FIG. 3 is a diagram of a sequence classifier model according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
A method for detecting malicious third-party content of a website, as shown in fig. 1, includes the following steps:
step 1, checking a webpage resource by a Content Security Policy (CSP), and directly judging the webpage resource as malicious content if the webpage resource cannot pass through the Content Security Policy (CSP);
step 2, the content which is not detected by the content security policy CSP enters a sequence construction module to carry out sequence construction;
step 3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;
and 4, classifying the sequences through a sequence classifier.
Further, the Content Security Policy CSP in step 1 sets http-equ to Content-Security-Policy by using META tag, and if the resource cannot pass through the Content Security Policy CSP, it is directly determined as malicious Content.
Further, the method for constructing the sequence in the step 2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and meanwhile, a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.
Further, the sequence classifier in step 4 comprises a malicious model and a legal model, wherein the malicious model obtains a malicious sample list according to a public blacklist and a detection tool which are currently available, trains out the malicious model, the legal model obtains the legal sample list by collecting a large amount of legal data offline, trains out the legal model, and classifies the sequence by using a machine learning algorithm.
A website malicious third-party content detection system comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and uses the characteristics as training classification; the sequence classifier module classifies the sequence using machine learning.
Further, the sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool, the malicious model is trained, the legal model obtains a legal sample list by collecting a large amount of legal data in an off-line mode, and the legal model is trained.
As shown in fig. 1, before rendering a page, a browser requests an HTML document from a remote server, after receiving the HTML document, the browser parses the document into a DOM tree through an HTML interpreter, calculates style information and page layout of a response to the DOM tree using a CSS interpreter, calls a JavaScript engine to execute a JS script if the JS script is encountered in the process, and finally draws the entire page on the browser.
According to the method, after a browser obtains page resources, the content security policy CSP is used for detecting the resources, CSP instruction rules are set by calling a ContentSecuriPocy (options) method in a Helmet module, if the CSP considers that the resources are from a malicious third party, the CSP instruction rules are directly detected, then the remaining resources enter a sequence classification link, in the process of constructing the DOM tree, injection and execution of content scripts are tracked by enhancing a Chromium kernel Blink, page inclusion relations which cannot be recorded by the DOM tree are constructed, and the constructed sequence is shown in figure 2.
The sequence entering feature extraction module extracts features, for example, the DNS features include top-level domains, host types, levels, Alexa ranks, etc., the character string features include the proportion of non-characters, the proportion of unique characters, the frequency of each character in a domain name, the length of the domain name, the entropy of the domain name, etc., and the role features in the sequence of the resource, such as an advertisement network, a CDN, a URL shortening service, etc.
The sequence classifier uses a hidden markov model, as shown in fig. 3, to estimate parameters using the Baum-Welch algorithm, and uses a forward-backward algorithm to detect the quality of a given sequence.
The division of the modules, units or flows in the present invention is only a division of logic functions, and other division manners may be available in actual implementation, for example, a plurality of modules and/or units may be combined or integrated in another system, and the modules and units described as separate components may be separated or not separated in form, so that part or all of the units may be selected according to actual needs to implement the scheme of the embodiment.
Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims (3)

1. A method for detecting malicious third-party content of a website is characterized by comprising the following steps: comprises the following steps:
s1, the webpage resource is checked by the content security policy CSP, and if the resource can not pass through the content security policy CSP, the webpage resource is directly judged to be malicious content;
s2, the content undetected by the content security policy CSP enters a sequence construction module to construct a sequence;
s3, feature extraction; extracting corresponding characteristics aiming at the sequence constructed in the sequence construction, and using the characteristics as training classification;
s4, classifying the sequences through a sequence classifier, wherein the sequence classifier comprises a malicious model and a legal model, the malicious model obtains a malicious sample list according to a current public blacklist and a detection tool, trains out the malicious model, the legal model obtains the legal sample list by collecting a large amount of legal data offline, trains out the legal model, and classifies the sequences through a machine learning algorithm;
the website malicious third-party content detection system for executing the method comprises a content security policy CSP module, a sequence construction module, a feature extraction module and a sequence classifier module, wherein the content security policy CSP module is sequentially connected with the sequence construction module, the feature extraction module and the sequence classifier module and is used for judging webpage resources; the sequence building module builds a page resource sequence through an HTML interpreter, a JavaScript engine and a browser expansion engine; the characteristic extraction module extracts corresponding characteristics aiming at the sequence constructed in the sequence construction module and is used for training classification; the sequence classifier module classifies a sequence using machine learning; the sequence classifier module comprises a malicious model and a legal model, the feature extraction module is connected with the malicious model and the legal model in parallel, the malicious model obtains a malicious sample list according to a currently existing public blacklist and a detection tool and trains out the malicious model, and the legal model obtains the legal sample list by collecting a large amount of legal data offline and trains out the legal model.
2. The method for detecting malicious third-party content of a website as claimed in claim 1, wherein: the Content Security Policy CSP in the S1 sets the http-equ to Content-Security-Policy by using the META label, and if the resource can not pass through the Content Security Policy CSP, the resource is directly determined as malicious Content.
3. The method for detecting malicious third-party content of a website as claimed in claim 1, wherein: the method for constructing the sequence in the S2 comprises the following steps: the DOM tree construction and page rendering are realized through an HTML interpreter and a JavaScript engine, and meanwhile, a browser expansion engine is added to construct the inclusion relation of page resources to form a sequence.
CN202011332352.9A 2020-11-24 2020-11-24 Website malicious third-party content detection method and system Active CN112511525B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011332352.9A CN112511525B (en) 2020-11-24 2020-11-24 Website malicious third-party content detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011332352.9A CN112511525B (en) 2020-11-24 2020-11-24 Website malicious third-party content detection method and system

Publications (2)

Publication Number Publication Date
CN112511525A CN112511525A (en) 2021-03-16
CN112511525B true CN112511525B (en) 2022-07-22

Family

ID=74958316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011332352.9A Active CN112511525B (en) 2020-11-24 2020-11-24 Website malicious third-party content detection method and system

Country Status (1)

Country Link
CN (1) CN112511525B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240020476A1 (en) * 2022-07-15 2024-01-18 Pinterest, Inc. Determining linked spam content

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9521162B1 (en) * 2014-11-21 2016-12-13 Narus, Inc. Application-level DDoS detection using service profiling
CN107679403A (en) * 2017-10-11 2018-02-09 北京理工大学 It is a kind of to extort software mutation detection method based on sequence alignment algorithms
CN110022311A (en) * 2019-03-18 2019-07-16 北京工业大学 A kind of cloud outsourcing service leaking data safety test use-case automatic generating method based on attack graph
US10397255B1 (en) * 2015-09-23 2019-08-27 StackRox, Inc. System and method for providing security in a distributed computation system utilizing containers
CN111259440A (en) * 2020-01-14 2020-06-09 中国人民解放军国防科技大学 Privacy protection decision tree classification method for cloud outsourcing data
CN111368297A (en) * 2020-02-02 2020-07-03 西安电子科技大学 Privacy protection mobile malware detection method, system, storage medium and application

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468546B (en) * 2014-11-27 2018-01-09 微梦创科网络科技(中国)有限公司 A kind of web information processing method and firewall device, system
US10432662B2 (en) * 2015-04-30 2019-10-01 Oath, Inc. Method and system for blocking malicious third party site tagging
US11503070B2 (en) * 2016-11-02 2022-11-15 Microsoft Technology Licensing, Llc Techniques for classifying a web page based upon functions used to render the web page
CN107948168A (en) * 2017-11-29 2018-04-20 四川无声信息技术有限公司 Page detection method and device
CN108509794A (en) * 2018-03-09 2018-09-07 中山大学 A kind of malicious web pages defence detection method based on classification learning algorithm
CN109218296B (en) * 2018-08-29 2021-03-23 天津大学 XSS (XSS) defense system and method based on improved CSP (chip size service) strategy
US10972507B2 (en) * 2018-09-16 2021-04-06 Microsoft Technology Licensing, Llc Content policy based notification of application users about malicious browser plugins
US10521583B1 (en) * 2018-10-25 2019-12-31 BitSight Technologies, Inc. Systems and methods for remote detection of software through browser webinjects
US10599834B1 (en) * 2019-05-10 2020-03-24 Clean.io, Inc. Detecting malicious code existing in internet advertisements
CN110336812A (en) * 2019-07-03 2019-10-15 深圳市珍爱捷云信息技术有限公司 Resource intercepting processing method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9521162B1 (en) * 2014-11-21 2016-12-13 Narus, Inc. Application-level DDoS detection using service profiling
US10397255B1 (en) * 2015-09-23 2019-08-27 StackRox, Inc. System and method for providing security in a distributed computation system utilizing containers
CN107679403A (en) * 2017-10-11 2018-02-09 北京理工大学 It is a kind of to extort software mutation detection method based on sequence alignment algorithms
CN110022311A (en) * 2019-03-18 2019-07-16 北京工业大学 A kind of cloud outsourcing service leaking data safety test use-case automatic generating method based on attack graph
CN111259440A (en) * 2020-01-14 2020-06-09 中国人民解放军国防科技大学 Privacy protection decision tree classification method for cloud outsourcing data
CN111368297A (en) * 2020-02-02 2020-07-03 西安电子科技大学 Privacy protection mobile malware detection method, system, storage medium and application

Also Published As

Publication number Publication date
CN112511525A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
Khan et al. Detecting malicious URLs using binary classification through ada boost algorithm.
US10033757B2 (en) Identifying malicious identifiers
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
CN104954372B (en) A kind of evidence obtaining of fishing website and verification method and system
US20220030029A1 (en) Phishing Protection Methods and Systems
CN101964025B (en) XSS detection method and equipment
CN104348803B (en) Link kidnaps detection method, device, user equipment, Analysis server and system
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN104217160A (en) Method and system for detecting Chinese phishing website
CN107948168A (en) Page detection method and device
CN103544436A (en) System and method for distinguishing phishing websites
CN111245784A (en) Method for multi-dimensional detection of malicious domain name
CN110035075A (en) Detection method, device, computer equipment and the storage medium of fishing website
CN109756467B (en) Method and device for identifying a phishing website
CN107463844B (en) WEB Trojan horse detection method and system
Geng et al. RRPhish: Anti-phishing via mining brand resources request
Zaimi et al. Survey paper: Taxonomy of website anti-phishing solutions
Tanaka et al. Phishing site detection using similarity of website structure
CN114244564A (en) Attack defense method, device, equipment and readable storage medium
CN107508832A (en) A kind of device-fingerprint recognition methods and system
CN118337453A (en) Automatic attack tracing method, terminal device and storage medium
CN112511525B (en) Website malicious third-party content detection method and system
CN115459946A (en) Abnormal webpage identification method, device, equipment and computer storage medium
CN105653941A (en) Heuristic detection method and system for phishing website
CN118626982A (en) A multi-modal anomaly detection method and system for big data network traffic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载