+

CN116775974A - Information screening method - Google Patents

Information screening method Download PDF

Info

Publication number
CN116775974A
CN116775974A CN202310781347.3A CN202310781347A CN116775974A CN 116775974 A CN116775974 A CN 116775974A CN 202310781347 A CN202310781347 A CN 202310781347A CN 116775974 A CN116775974 A CN 116775974A
Authority
CN
China
Prior art keywords
information
keyword
website
representing
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310781347.3A
Other languages
Chinese (zh)
Other versions
CN116775974B (en
Inventor
侯天宇
石伟
闫文敏
卢漫天
姚凯义
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongzi High Tech Consulting Center Co ltd
Original Assignee
Zhongzi High Tech Consulting Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongzi High Tech Consulting Center Co ltd filed Critical Zhongzi High Tech Consulting Center Co ltd
Priority to CN202310781347.3A priority Critical patent/CN116775974B/en
Publication of CN116775974A publication Critical patent/CN116775974A/en
Application granted granted Critical
Publication of CN116775974B publication Critical patent/CN116775974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an information screening method, which relates to the technical field of information screening and comprises the following steps: acquiring and analyzing search content input by a user to obtain search keywords; acquiring first information related to the existence of the search keyword, analyzing the timeliness of each first information, and screening to obtain second information; weighting calculation is carried out on the reliability of each piece of second information, and third information meeting the first preset condition is screened; calculating the correlation degree of the content and the title in each piece of third information, and screening fourth information meeting a second preset condition; sorting the relevance of the screened fourth information to obtain a final screening result; by judging the reliability, timeliness and relativity of the information sources, irrelevant information and expiration information are filtered and removed, and the user is ensured to obtain high-quality and high-credibility information.

Description

Information screening method
Technical Field
The invention relates to the technical field of information screening, in particular to a method for screening information.
Background
With the rapid development of internet technology, the internet has become a global information data platform, and more users use the internet as a main source for obtaining information. In the face of massive web page information resources on the internet, users typically utilize search engine services to obtain the required information. However, although the search engine can assist the user to obtain the required relevant webpage information resources to a certain extent, due to the large degree of freedom of webpage content distribution and the characteristics of openness, unbounded property and the like of the internet, people have difficulty in effectively controlling and managing the quality of the information resources, so that a large amount of junk information is obtained, that is, in a large amount of information resources provided by the search engine, the information in the front ranking is not high in quality and high in credibility, and even various false, wrong or outdated information exists.
Therefore, the invention provides a method for screening information.
Disclosure of Invention
The invention provides an information screening method, which is used for acquiring search information by analyzing search content input by a user, judging and screening the reliability, timeliness and relativity of the search information, and ensuring that the user obtains high-quality and high-credibility information.
The invention provides an information screening method, which comprises the following steps:
step 1: acquiring and analyzing search content input by a user to obtain search keywords;
step 2: acquiring first information related to the search keywords from an information data platform, analyzing the timeliness of each first information, and screening to obtain second information;
step 3: weighting calculation is carried out on the reliability of each second information based on the source information corresponding to each second information, and third information meeting the first preset condition is screened;
step 4: calculating the relativity of the content and the title in each third information based on the page layout condition of each third information, and screening fourth information meeting a second preset condition;
step 5: and carrying out relevance ranking on the screened fourth information to serve as a final screening result.
Preferably, acquiring and analyzing search content input by a user to obtain a search keyword includes:
acquiring search content input by a user, and marking the part of speech of the search content;
based on the part-of-speech tagging result, the search content is segmented by using the combination of the related words, the stop words, the number words and the graduated words and the punctuation marks, and the search keywords are generated.
Preferably, first information related to the search keyword is obtained from an information data platform, timeliness of each first information is analyzed, and second information is obtained through screening, including:
acquiring first information associated with the existence of the search keyword from an information data platform, wherein the information in the information data platform comprises information of a non-fixed type and information of a fixed type;
judging the first information type, and if the first information is a fixed type, regarding the first information as second information, wherein the fixed type comprises: fixed information category, long-term unchanged information category, updated or explicitly pointed information category;
if the first information is of a non-fixed type, acquiring the release time of each first information, classifying according to year, month and day rules, and determining the first quantity released in a time interval and a release information heat list, wherein the non-fixed type comprises: news event category, periodic update information category, and continuous update information category;
based on the determination result, judging the timeliness of each first information, and screening and obtaining second information based on the judgment result.
Preferably, the weighting calculation is performed on the reliability of each piece of second information based on the source information corresponding to each piece of second information, and third information meeting the first preset condition is screened, including:
tracing source information corresponding to each piece of second information, and obtaining a website domain name, a network security protection grade, a website collapse frequency and website backup information to carry out first evaluation;
acquiring user access information of a website corresponding to the source information, and performing second evaluation;
performing third evaluation according to the historical advertisement receiving quantity and the effective link quantity of the website corresponding to the source information;
weighting calculation is carried out on the reliability of the corresponding second information based on the first evaluation result, the second evaluation result and the third evaluation result;
and screening the second information based on the calculation result and the first preset condition to obtain third information.
Preferably, the weighting calculation for the reliability of the corresponding second information based on the first evaluation result, the second evaluation result, and the third evaluation result includes:
y i =αP1+βP2+γP3
wherein alpha represents the weight corresponding to the first evaluation result, P1 represents the first evaluation result of the ith second information, y i Scoring, y, a website domain name representing the website corresponding to the ith second information max Maximum score, p, representing website domain name i Representing the network security protection level of the website corresponding to the ith second information, p max Representing the highest level of network security protection,e i representing the website breakdown frequency of the website corresponding to the ith second information, e max Representing the maximum allowable site collapse frequency, z i Scoring, z, website backup information representing website corresponding to the ith second information max Representing the maximum score of website backup information, beta represents the weight corresponding to the second evaluation result, P2 represents the second evaluation result of the ith second information, and x i Representing the average daily user access amount, t, of the website corresponding to the ith second information i Represents the average access time, x of the website corresponding to the ith second information max Representing the maximum value, t, of the average daily user access amount of all the second information corresponding to the websites max Represents the maximum value in the average access time of all the websites corresponding to the second information, gamma represents the weight corresponding to the third evaluation result, P3 represents the third evaluation result of the ith second information, and C max Representing the maximum historical advertisement accepting quantity of websites corresponding to all second information, C i Indicating the historical advertisement accepting quantity of the website corresponding to the ith second information, l max Indicating the maximum number of links set in the second information web site, l i Representing the actual number of links in the website corresponding to the ith second information, and α+β+γ=1; c (C) ave Indicating the average historical advertisement receiving quantity of all the websites corresponding to the second information.
Preferably, calculating a correlation degree between the content and the title in each third information based on the page layout condition of each third information, and screening fourth information meeting a second preset condition, including:
acquiring title content and text content in a page corresponding to each third information based on the page layout condition of each third information, and marking the parts of speech;
analyzing the title structure, and determining first weight values of words at different positions in the title;
segmenting the title content based on the part-of-speech tagging result, and determining a second weight value of each word in the segmentation result based on the segmentation result and a preset part-of-speech priority order;
determining the total weight value of each word in the segmentation result based on the first weight value and the second weight value;
based on the search keywords and the segmentation results, determining matching words which are associated with the search keywords in the segmentation results, acquiring the total weight value of each matching word, determining the lowest total weight value in the matching words, and taking the lowest total weight value as a title keyword acquisition standard;
determining a first keyword set corresponding to the title content based on an acquisition standard;
setting a corresponding number of second keyword sets to the text content based on the number of elements in the first keyword sets;
based on part of speech tagging results, determining occurrence frequencies of words with consistent parts of speech in text content, extracting all words with part of speech corresponding to the occurrence frequencies of the first n1 words, and respectively filling the words into corresponding second keyword sets, wherein the element number is also n1, and each second keyword set corresponds to all words with one part of speech;
acquiring keyword intersections of the first keyword set and each filled second keyword set, and determining a first correlation degree between the corresponding keyword intersection and the first keyword set based on a total weight value of the keyword intersections;
determining a second relevance between the corresponding keyword intersection and the text content based on the word frequency of the second keyword set after the corresponding filling of each keyword in the corresponding keyword intersection;
and calculating the relevance of the title and the text content in each third information corresponding page based on all the first relevance and all the second relevance.
Preferably, calculating the relevance between the title and the text content in the page corresponding to each third information based on all the first relevance and all the second relevance includes:
wherein P is d Representing the relevance of the title to the text content in the d-th page, qd k Represents the d-thFd represents the total weight value corresponding to the first keyword set corresponding to the d-th page, s represents the set number of keyword intersections in the d-th page, and is consistent with the set number of keyword intersections, fd n Representing the total word frequency, dd, of all keywords in the nth second keyword set in the d-th page in the text content n And (3) representing the total word frequency of all keywords in the keyword intersection corresponding to the nth second keyword set in the d-th page in the text content, wherein n1 represents the number of sets of the nth second keyword set in the d-th page and is consistent with the number of sets of the keyword intersection, a1 represents a first duty ratio coefficient, and a2 represents a second duty ratio coefficient.
Preferably, the sorting of the relevance of the screened fourth information is performed, and as a final screening result, the method includes:
acquiring the correlation degree of the title content and the text content in each piece of fourth information, and comparing the correlation degree of the title content and the text content corresponding to each piece of fourth information;
based on the comparison result, sorting is carried out according to the rule from big to small, and a final sorting result is obtained and displayed as a final screening result.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
fig. 1 is a flowchart of a method for screening information according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Example 1
The invention provides an information screening method, as shown in figure 1, comprising the following steps:
step 1: acquiring and analyzing search content input by a user to obtain search keywords;
step 2: acquiring first information related to the search keywords from an information data platform, analyzing the timeliness of each first information, and screening to obtain second information;
step 3: weighting calculation is carried out on the reliability of each second information based on the source information corresponding to each second information, and third information meeting the first preset condition is screened;
step 4: calculating the relativity of the content and the title in each third information based on the page layout condition of each third information, and screening fourth information meeting a second preset condition;
step 5: and carrying out relevance ranking on the screened fourth information to serve as a final screening result.
In this embodiment, the information data platform refers to a platform that gathers all basic information and serves users, and the information data platform retrieves search keywords through a search engine, such as hundred degrees, google, and the like;
in this embodiment, the first information refers to network information with association between an information title and a search keyword, where the search content is: the method comprises the steps that if a dell computer is detached, the retrieval keywords are dell, the computer and detached, namely, the release time, the release website and the release page of the information with dell, the computer and detached exist in a webpage title, and the first information is obtained;
in this embodiment, timeliness refers to, for example: first information 1 and first information 2 associated with event 1 are respectively issued at 2023.1.4 and 2023.5.6, and then first information 2 issued on 2023.5.6 is necessarily more comprehensive than first information 1 issued on 2023.1.4, and timeliness of first information 2 is necessarily stronger than 2023.1.4;
in this embodiment, the source information corresponding to the second information refers to the website domain name, the network security protection level, the website collapse frequency, the website backup information, the daily average user access amount and average access time, the historical advertisement receiving number and the effective link number of the corresponding website;
in the embodiment, the reliability of each second information is calculated by weighting, and the reliability of each second information source website is calculated by weighting, and the result of the weighting calculation is within [0,1];
in this embodiment, the first preset condition refers to, for example: the second information weighting result is 0.85, and the first preset condition is 0.6, and the second information can be used as third information through screening;
in this embodiment, the page layout condition of each third information refers to the title and text distribution condition of the page where each third information is located, and title information and corresponding text content information in the page are obtained according to the title and text distribution condition;
in this embodiment, the second preset condition refers to that if the correlation degree between the third information title content and the text content is 0.7 and the second preset condition is 0.6, the information is filtered and can be used as fourth information;
in this embodiment, if the correlation between the title and the text content in the fourth information 1 is 0.7, the correlation of the fourth information 1 is 0.7.
The beneficial effects of the technical scheme are as follows: the retrieval information is obtained by analyzing the retrieval content input by the user, the reliability, timeliness and relativity of the retrieval information source are judged, irrelevant information and expiration information are filtered and removed, and the user is ensured to obtain high-quality and high-credibility information.
Example 2
The invention provides an information screening method, which is used for acquiring and analyzing search contents input by a user to obtain search keywords, and comprises the following steps:
acquiring search content input by a user, and marking the part of speech of the search content;
based on the part-of-speech tagging result, the search content is segmented by using the combination of the related words, the stop words, the number words and the graduated words and the punctuation marks, and the search keywords are generated.
In this embodiment, if the search content input by the user is: an intelligent control system and method for poultry breeding environment; the part of speech analysis results are: the method comprises the following steps of (1) counting words and measuring words, namely poultry/noun, cultivation/verb, environment/noun,/stop word, intelligent/noun, control/verb, system/noun and/stop word, method/noun, and obtaining search keywords after segmentation: birds/nouns, breeds/verbs, environments/nouns, intelligence/nouns, controls/verbs, systems/nouns, methods/nouns.
The beneficial effects of the technical scheme are as follows: by analyzing the search content input by the user, the search keyword is determined, the search information is acquired, and the search information which has relevance with the search information and the search content of the user is acquired.
Example 3
The invention provides a screening method of information, which acquires first information related to search keywords from an information data platform, analyzes the timeliness of each first information, screens to obtain second information, and comprises the following steps:
acquiring first information associated with the existence of the search keyword from an information data platform, wherein the information in the information data platform comprises information of a non-fixed type and information of a fixed type;
judging the first information type, and if the first information is a fixed type, regarding the first information as second information, wherein the fixed type comprises: fixed information category, long-term unchanged information category, updated or explicitly pointed information category;
if the first information is of a non-fixed type, acquiring the release time of each first information, classifying according to year, month and day rules, and determining the first quantity released in a time interval and a release information heat list, wherein the non-fixed type comprises: news event category, periodic update information category, and continuous update information category;
based on the determination result, judging the timeliness of each first information, and screening and obtaining second information based on the judgment result.
In this embodiment, if the first information is a fixed type, i.e. a fixed type, a long-term unchanged type, a type of information for which update has ended or is explicitly pointed, the validity of the first information will not change with time, so that it is not necessary to determine the timeliness of the first information;
in this embodiment, if the first information is of a non-fixed type, the judgment needs to be performed based on the first number of distributed information and the distribution information hotness list in the time interval, for example: if the first quantity and the heat degree of the information 1 are not periodically changed along with time, the heat degree of the information 1 is highest in 1 month, the first quantity of the information 1 is most released in a 1.25 time interval, the change amplitude of the first quantity of the information 1 in a 1.26 and 1.27 time interval is not more than 5%, the timeliness of the first information in the 1.27 time interval is the greatest, and the second information obtained by screening is the first information in the 1.27 time interval;
if the first quantity and the heat degree released by the information 2 are periodically changed along with time and the first quantity and the heat degree released in the three sections of 1.14, 1.21 and 1.28 are the maximum value of the same period, the second information obtained by screening is the first information in the time section of 1.28.
The beneficial effects of the technical scheme are as follows: through analyzing the timeliness of the retrieval information, the expired information is screened and filtered, so that the information with stronger timeliness can be obtained by the user, and the quality of the information obtained by the user is ensured.
Example 4
The invention provides an information screening method, which carries out weighted calculation on the reliability of each second information based on source information corresponding to each second information and screens third information meeting a first preset condition, and comprises the following steps:
tracing source information corresponding to each piece of second information, and obtaining a website domain name, a network security protection grade, a website collapse frequency and website backup information to carry out first evaluation;
acquiring user access information of a website corresponding to the source information, and performing second evaluation;
performing third evaluation according to the historical advertisement receiving quantity and the effective link quantity of the website corresponding to the source information;
weighting calculation is carried out on the reliability of the corresponding second information based on the first evaluation result, the second evaluation result and the third evaluation result;
and screening the second information based on the calculation result and the first preset condition to obtain third information.
In this embodiment, the first evaluation is performed on the website, and scoring is required to be performed on the website domain name and the website backup information;
in this embodiment, website domain names are classified by institutions into 4 types of government institutions/organizations (. Gov), non-profit websites (. Org), educational institutions (. Edu) and commercial websites (. Net.com), since different institutions have different reliability, such as reliability: government agency/organization (.gov) > non-profit web site (.org) > educational agency (.edu) > business web site (.net.com), so different agencies score differently: government agencies/organizations (. Gov) for 100 points, non-profit web sites (. Org) for 80 points, educational institutions (. Edu) for 60 points, and commercial web sites (. Net.com) for 40 points;
in this embodiment, scoring the website backup information includes a number of aspects: whether the backup is automatic, whether the backup is daily, whether the backup is external to the server or not, and whether the backup is incremental or not; if the backup information of the website 1 meets the requirements of automatic backup, daily backup and server external backup, the score of the website 1 is 75 points; if the backup information of the website 2 meets the automatic backup and the daily backup, the score of the website 2 is 50 points;
in the embodiment, the level of the network security protection is classified into 10 levels, and the higher the level of the network security protection is, the stronger the reliability of the website is;
in this embodiment, the user access information includes: the average daily user visit amount and the average visit time of each user of the website; the average daily user access amount of the website is obtained by taking multiple accesses of the same user in one day as one access in the process of obtaining;
in the embodiment, the value ranges of the first evaluation result, the second evaluation result and the third evaluation result are the same and are all in the range of the [0,1] interval;
in this embodiment, the result range of the weighting calculation is within the interval [0.1], and the first preset condition is generally set to 0.6.
The beneficial effects of the technical scheme are as follows: the reliability of the information source website is calculated, and the information is screened based on the calculation result, so that the user can obtain the correct information with higher reliability, and the interference of errors and false information to the user is avoided.
Example 5
The invention provides an information screening method, which carries out weighted calculation on the reliability of corresponding second information based on a first evaluation result, a second evaluation result and a third evaluation result, and comprises the following steps:
y i =αP1+βP2+γP3
wherein alpha represents the weight corresponding to the first evaluation result, P1 represents the first evaluation result of the ith second information, y i Scoring, y, a website domain name representing the website corresponding to the ith second information max Maximum score, p, representing website domain name i Representing the network security protection level of the website corresponding to the ith second information, p max Representing the highest level of network security, e i Representing the website breakdown frequency of the website corresponding to the ith second information, e max Representing the maximum allowable site collapse frequency, z i Scoring, z, website backup information representing website corresponding to the ith second information max Representing the maximum score of website backup information, beta represents the weight corresponding to the second evaluation result, P2 represents the second evaluation result of the ith second information, and x i Representing the average daily user access amount, t, of the website corresponding to the ith second information i Represents the average access time, x of the website corresponding to the ith second information max Representing the maximum value, t, of the average daily user access amount of all the second information corresponding to the websites max Represents the maximum value in the average access time of all the websites corresponding to the second information, gamma represents the weight corresponding to the third evaluation result, P3 represents the third evaluation result of the ith second information, and C max Representing the maximum historical advertisement accepting quantity of websites corresponding to all second information, C i Indicating the historical advertisement accepting quantity of the website corresponding to the ith second information, l max Indicating the maximum number of links set in the second information web site, l i Representing the actual number of links in the website corresponding to the ith second information, and α+β+γ=1; c (C) ave Indicating the average historical advertisement receiving quantity of all the websites corresponding to the second information.
In this embodiment, α is generally 0.4, β is generally 0.3, and γ is generally 0.3.
The beneficial effects of the technical scheme are as follows: by calculating the reliability of the information source website, the user is facilitated to acquire correct information with higher reliability, and the interference of error information and false information to the user is avoided.
Example 6
The invention provides a screening method of information, which calculates the relativity between the content and the title in each third information based on the page layout condition of each third information, screens fourth information meeting a second preset condition, and comprises the following steps:
acquiring title content and text content in a page corresponding to each third information based on the page layout condition of each third information, and marking the parts of speech;
analyzing the title structure, and determining first weight values of words at different positions in the title;
segmenting the title content based on the part-of-speech tagging result, and determining a second weight value of each word in the segmentation result based on the segmentation result and a preset part-of-speech priority order;
determining the total weight value of each word in the segmentation result based on the first weight value and the second weight value;
based on the search keywords and the segmentation results, determining matching words which are associated with the search keywords in the segmentation results, acquiring the total weight value of each matching word, determining the lowest total weight value in the matching words, and taking the lowest total weight value as a title keyword acquisition standard;
determining a first keyword set corresponding to the title content based on an acquisition standard;
setting a corresponding number of second keyword sets to the text content based on the number of elements in the first keyword sets;
based on part of speech tagging results, determining occurrence frequencies of words with consistent parts of speech in text content, extracting all words with part of speech corresponding to the occurrence frequencies of the first n1 words, and respectively filling the words into corresponding second keyword sets, wherein the element number is also n1, and each second keyword set corresponds to all words with one part of speech;
acquiring keyword intersections of the first keyword set and each filled second keyword set, and determining a first correlation degree between the corresponding keyword intersection and the first keyword set based on a total weight value of the keyword intersections;
determining a second relevance between the corresponding keyword intersection and the text content based on the word frequency of the second keyword set after the corresponding filling of each keyword in the corresponding keyword intersection;
and calculating the relevance of the title and the text content in each third information corresponding page based on all the first relevance and all the second relevance.
In this embodiment, the first weight values of words at different positions of the title are different, for example: an intelligent control system and method for poultry breeding environment, wherein the first weight value of the breeding environment is larger than the first weight value of intelligent control, and the value range of the first weight value is [0,1];
in this embodiment, based on the part-of-speech tagging result, the title content is segmented by using a combination of related words, stop words, number words and stop words, and punctuation marks, for example: an intelligent control system and method for poultry breeding environment, the segmentation result is: birds/nouns, breeds/verbs, environments/nouns, intelligence/nouns, controls/verbs, systems/nouns, methods/nouns;
in this embodiment, the priorities of the different parts of speech are different, and if the noun priority is greater than the verb priority, the second weight value of the noun is greater than the second weight value of the verb, and the range of the second weight value is [0,1];
in this embodiment, if the first weight value of the segmentation word 1 is 0.4 and the second weight value is 0.3, the total weight value of the segmentation word 1 is 0.7;
in this embodiment, if the segmentation result is poultry/noun, cultivation/verb, environment/noun, intelligence/noun, control/verb, system/noun, method/noun, the search keyword is [ poultry, cultivation, gas, temperature, control, system ], the matching word is [ poultry, cultivation, control, system ];
in the embodiment, if the matching words are poultry, breeding, control and system, and the corresponding total weight values are 0.6, 0.7, 0.4 and 0.8 respectively, the obtaining standard of the first keywords is that the total weight value is greater than 0.4, the words with the total weight value greater than 0.4 in the segmentation result are the first keywords, and all the first keywords are placed in a set, and the set is the first keyword set corresponding to the title content;
in this embodiment, the word occurrence frequency of the word consistent with the word in the text 1 is as follows: poultry, temperature, gas, environment, intelligence, method, and the number of elements in the first keyword set is 4, the second keyword set is { poultry, temperature, gas, environment };
in this embodiment, if the number of elements in the first keyword set in the page is 10, there are 10 second keyword sets in the page, and the number of keywords in each second keyword set is 10;
in the embodiment, if the word matched with the search keyword in the title is environment, intelligent and controlled, and the total weight value of the control is the lowest and is 0.6, the word with the total weight value higher than 0.6 in the segmentation result is the first keyword;
in this embodiment, each second keyword has a keyword intersection corresponding to the second keyword, for example, the first keyword set is { bird, cultivation, environment, intelligence, control, system, method }, the second keyword set 1 is { bird, temperature, gas, environment, intelligence, method }, and the corresponding keyword intersection is { bird, environment, intelligence, method };
in this embodiment, if the total weight value of each keyword in the keyword intersection 1 is 0.8,0.8,0.6, the total weight value of the keyword intersection 1 is 2.2, and if the total weight value of the first keyword set is 10, the correlation between the keyword intersection 1 and the title is 0.22;
in this embodiment, the second correlation degree is calculated, and the correlation degree between each keyword intersection and the corresponding second keyword set needs to be obtained, for example: the total word frequency of the keyword intersection is 30, and the corresponding total word frequency of the second keyword set is 200, so that the correlation degree between the intersection and the corresponding second keyword set is 0.15.
The beneficial effects of the technical scheme are as follows: and by analyzing and screening the relevance of the title and the text content in the page where the third information is located, the method is favorable for filtering and eliminating irrelevant information, and ensures that a user obtains retrieval information relevant to retrieval content.
Example 7
The invention provides an information screening method, which calculates the relevance of a title and text content in a page corresponding to each third information based on all first relevance and all second relevance, and comprises the following steps:
wherein P is d Representing the label in the d-th pageRelevance of questions to text content, qd k Representing the total weight value corresponding to the k-th keyword intersection in the d-th page, fd represents the total weight value corresponding to the first keyword set corresponding to the d-th page, s represents the set number of the keyword intersection in the d-th page, and is consistent with the set number of the keyword intersection, fd n Representing the total word frequency, dd, of all keywords in the nth second keyword set in the d-th page in the text content n And (3) representing the total word frequency of all keywords in the keyword intersection corresponding to the nth second keyword set in the d-th page in the text content, wherein n1 represents the number of sets of the nth second keyword set in the d-th page and is consistent with the number of sets of the keyword intersection, a1 represents a first duty ratio coefficient, and a2 represents a second duty ratio coefficient.
In this embodiment, the number of keyword intersections is equal to the number of second keyword sets, i.e., s=n1;
in this embodiment, a1+a2=1, and a1 generally takes a value of 0.5, and a2 generally takes a value of 0.5.
In this embodiment of the present invention, the process is performed,representing a corresponding first degree of correlation; />Representing the corresponding second degree of correlation.
The beneficial effects of the technical scheme are as follows: and by calculating the relevance of the title and the text content corresponding to each third information, false information and error information can be eliminated, and the user can obtain the relevant information with high quality and high credibility.
Example 8
The invention provides a method for screening information, which is used for sorting the relevance of screened fourth information, and comprises the following steps of:
acquiring the correlation degree of the title content and the text content in each piece of fourth information, and comparing the correlation degree of the title content and the text content corresponding to each piece of fourth information;
based on the comparison result, sorting is carried out according to the rule from big to small, and a final sorting result is obtained and displayed as a final screening result.
In this embodiment, if the correlation between the content of the title and the content of the text in the fourth information 1 is 0.6 and the correlation between the content of the title and the content of the text in the fourth information 2 is 0.7, the result of the ranking is that: fourth information 2, fourth information 1.
The beneficial effects of the technical scheme are as follows: and the final screening result is obtained by sorting the fourth information, so that the user can quickly obtain the high-quality and high-credibility information which is most relevant to the retrieval content.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (8)

1. A method for screening information, comprising:
step 1: acquiring and analyzing search content input by a user to obtain search keywords;
step 2: acquiring first information related to the search keywords from an information data platform, analyzing the timeliness of each first information, and screening to obtain second information;
step 3: weighting calculation is carried out on the reliability of each second information based on the source information corresponding to each second information, and third information meeting the first preset condition is screened;
step 4: calculating the relativity of the content and the title in each third information based on the page layout condition of each third information, and screening fourth information meeting a second preset condition;
step 5: and carrying out relevance ranking on the screened fourth information to serve as a final screening result.
2. The method of claim 1, wherein obtaining and analyzing the search content input by the user to obtain the search keyword comprises:
acquiring search content input by a user, and marking the part of speech of the search content;
based on the part-of-speech tagging result, the search content is segmented by using the combination of the related words, the stop words, the number words and the graduated words and the punctuation marks, and the search keywords are generated.
3. The method of claim 1, wherein obtaining first information associated with the search keyword from the information data platform, analyzing the timeliness of each first information, and screening to obtain second information, comprises:
acquiring first information associated with the existence of the search keyword from an information data platform, wherein the information in the information data platform comprises information of a non-fixed type and information of a fixed type;
judging the first information type, and if the first information is a fixed type, regarding the first information as second information, wherein the fixed type comprises: fixed information category, long-term unchanged information category, updated or explicitly pointed information category;
if the first information is of a non-fixed type, acquiring the release time of each first information, classifying according to year, month and day rules, and determining the first quantity released in a time interval and a release information heat list, wherein the non-fixed type comprises: news event category, periodic update information category, and continuous update information category;
based on the determination result, judging the timeliness of each first information, and screening and obtaining second information based on the judgment result.
4. The method of claim 1, wherein weighting the reliability of each second information based on the source information corresponding to each second information, and screening the third information satisfying the first preset condition, comprises:
tracing source information corresponding to each piece of second information, and obtaining a website domain name, a network security protection grade, a website collapse frequency and website backup information to carry out first evaluation;
acquiring user access information of a website corresponding to the source information, and performing second evaluation;
performing third evaluation according to the historical advertisement receiving quantity and the effective link quantity of the website corresponding to the source information;
weighting calculation is carried out on the reliability of the corresponding second information based on the first evaluation result, the second evaluation result and the third evaluation result;
and screening the second information based on the calculation result and the first preset condition to obtain third information.
5. The method of claim 4, wherein weighting the reliability of the respective second information based on the first, second, and third evaluation results comprises:
y i =αP1+βP2+γP3
wherein alpha represents the weight corresponding to the first evaluation result, P1 represents the first evaluation result of the ith second information, y i Scoring, y, a website domain name representing the website corresponding to the ith second information max Maximum score, p, representing website domain name i Representing the network security protection level of the website corresponding to the ith second information, p max Representing the highest level of network security, e i Representing the ith second letterE, reporting the breakdown frequency of the website corresponding to the website max Representing the maximum allowable site collapse frequency, z i Scoring, z, website backup information representing website corresponding to the ith second information max Representing the maximum score of website backup information, beta represents the weight corresponding to the second evaluation result, P2 represents the second evaluation result of the ith second information, and x i Representing the average daily user access amount, t, of the website corresponding to the ith second information i Represents the average access time, x of the website corresponding to the ith second information max Representing the maximum value, t, of the average daily user access amount of all the second information corresponding to the websites max Represents the maximum value in the average access time of all the websites corresponding to the second information, gamma represents the weight corresponding to the third evaluation result, P3 represents the third evaluation result of the ith second information, and C max Representing the maximum historical advertisement accepting quantity of websites corresponding to all second information, C i Indicating the historical advertisement accepting quantity of the website corresponding to the ith second information, l max Indicating the maximum number of links set in the second information web site, l i Representing the actual number of links in the website corresponding to the ith second information, and α+β+γ=1; c (C) ave Indicating the average historical advertisement receiving quantity of all the websites corresponding to the second information.
6. The method of claim 1, wherein calculating a relevance between the content and the title in each third information based on the page layout condition of each third information, and screening fourth information satisfying the second preset condition, comprises:
acquiring title content and text content in a page corresponding to each third information based on the page layout condition of each third information, and marking the parts of speech;
analyzing the title structure, and determining first weight values of words at different positions in the title;
segmenting the title content based on the part-of-speech tagging result, and determining a second weight value of each word in the segmentation result based on the segmentation result and a preset part-of-speech priority order;
determining the total weight value of each word in the segmentation result based on the first weight value and the second weight value;
based on the search keywords and the segmentation results, determining matching words which are associated with the search keywords in the segmentation results, acquiring the total weight value of each matching word, determining the lowest total weight value in the matching words, and taking the lowest total weight value as a title keyword acquisition standard;
determining a first keyword set corresponding to the title content based on an acquisition standard;
setting a corresponding number of second keyword sets to the text content based on the number of elements in the first keyword sets;
based on part of speech tagging results, determining occurrence frequencies of words with consistent parts of speech in text content, extracting all words with part of speech corresponding to the occurrence frequencies of the first n1 words, and respectively filling the words into corresponding second keyword sets, wherein the element number is also n1, and each second keyword set corresponds to all words with one part of speech;
acquiring keyword intersections of the first keyword set and each filled second keyword set, and determining a first correlation degree between the corresponding keyword intersection and the first keyword set based on a total weight value of the keyword intersections;
determining a second relevance between the corresponding keyword intersection and the text content based on the word frequency of the second keyword set after the corresponding filling of each keyword in the corresponding keyword intersection;
and calculating the relevance of the title and the text content in each third information corresponding page based on all the first relevance and all the second relevance.
7. The method of claim 6, wherein calculating the relevance of the title to the text content in each third information corresponding page based on all the first relevance and all the second relevance comprises:
wherein P is d Representing the relevance of the title to the text content in the d-th page, qd k Representing the total weight value corresponding to the k-th keyword intersection in the d-th page, fd represents the total weight value corresponding to the first keyword set corresponding to the d-th page, s represents the set number of the keyword intersection in the d-th page, and is consistent with the set number of the keyword intersection, fd n Representing the total word frequency, dd, of all keywords in the nth second keyword set in the d-th page in the text content n And (3) representing the total word frequency of all keywords in the keyword intersection corresponding to the nth second keyword set in the d-th page in the text content, wherein n1 represents the number of sets of the nth second keyword set in the d-th page and is consistent with the number of sets of the keyword intersection, a1 represents a first duty ratio coefficient, and a2 represents a second duty ratio coefficient.
8. The method of claim 1, wherein the sorting of the correlations of the screened fourth information as a final screening result comprises:
acquiring the correlation degree of the title content and the text content in each piece of fourth information, and comparing the correlation degree of the title content and the text content corresponding to each piece of fourth information;
based on the comparison result, sorting is carried out according to the rule from big to small, and a final sorting result is obtained and displayed as a final screening result.
CN202310781347.3A 2023-06-29 2023-06-29 Information screening method Active CN116775974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310781347.3A CN116775974B (en) 2023-06-29 2023-06-29 Information screening method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310781347.3A CN116775974B (en) 2023-06-29 2023-06-29 Information screening method

Publications (2)

Publication Number Publication Date
CN116775974A true CN116775974A (en) 2023-09-19
CN116775974B CN116775974B (en) 2024-02-23

Family

ID=88006093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310781347.3A Active CN116775974B (en) 2023-06-29 2023-06-29 Information screening method

Country Status (1)

Country Link
CN (1) CN116775974B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390144A (en) * 2023-12-13 2024-01-12 北京搜狐新媒体信息技术有限公司 A method and device for determining news timeliness
CN119782585A (en) * 2025-03-13 2025-04-08 北京八月瓜科技有限公司 An information management method based on intelligent classification and efficient retrieval

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114324A1 (en) * 2003-09-14 2005-05-26 Yaron Mayer System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN101477556A (en) * 2009-01-22 2009-07-08 苏州智讯科技有限公司 Method for discovering hot sport in internet mass information
CN104391955A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Web page correlation detection method and device
CN105468652A (en) * 2014-09-12 2016-04-06 北大方正集团有限公司 Retrieval sorting method and system
CN105574176A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Hot word recommending method and device with combination of multiple data sources
CN105631007A (en) * 2015-12-29 2016-06-01 云南电网有限责任公司电力科学研究院 Industry technical information collecting method and system
US20170049386A1 (en) * 2015-08-21 2017-02-23 Medtronic Minimed, Inc. Personalized event detection methods and related devices and systems
CN107016135A (en) * 2017-06-09 2017-08-04 海南大学 It is a kind of towards non-determined, infidelity, onlap the positive and negative two-way dynamic equilibrium search strategy of miscellaneous resource environment
CN107562722A (en) * 2017-08-14 2018-01-09 上海文军信息技术有限公司 Internet public feelings monitoring analysis system based on big data
CN111260197A (en) * 2020-01-10 2020-06-09 光明网传媒有限公司 Network article evaluation method, system, computer equipment and readable storage medium
CN112084452A (en) * 2020-09-22 2020-12-15 扆亮海 Webpage time efficiency obtaining method for temporal consistency constraint judgment
CN113065070A (en) * 2021-04-23 2021-07-02 武汉瑞通慧行电子商务有限公司 Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval
CN113468868A (en) * 2021-07-07 2021-10-01 西北大学 NLP-based real-time network hotspot content analysis method

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050114324A1 (en) * 2003-09-14 2005-05-26 Yaron Mayer System and method for improved searching on the internet or similar networks and especially improved MetaNews and/or improved automatically generated newspapers
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN101477556A (en) * 2009-01-22 2009-07-08 苏州智讯科技有限公司 Method for discovering hot sport in internet mass information
CN105468652A (en) * 2014-09-12 2016-04-06 北大方正集团有限公司 Retrieval sorting method and system
CN104391955A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Web page correlation detection method and device
US20170049386A1 (en) * 2015-08-21 2017-02-23 Medtronic Minimed, Inc. Personalized event detection methods and related devices and systems
CN105574176A (en) * 2015-12-21 2016-05-11 北京奇虎科技有限公司 Hot word recommending method and device with combination of multiple data sources
CN105631007A (en) * 2015-12-29 2016-06-01 云南电网有限责任公司电力科学研究院 Industry technical information collecting method and system
CN107016135A (en) * 2017-06-09 2017-08-04 海南大学 It is a kind of towards non-determined, infidelity, onlap the positive and negative two-way dynamic equilibrium search strategy of miscellaneous resource environment
CN107562722A (en) * 2017-08-14 2018-01-09 上海文军信息技术有限公司 Internet public feelings monitoring analysis system based on big data
CN111260197A (en) * 2020-01-10 2020-06-09 光明网传媒有限公司 Network article evaluation method, system, computer equipment and readable storage medium
CN112084452A (en) * 2020-09-22 2020-12-15 扆亮海 Webpage time efficiency obtaining method for temporal consistency constraint judgment
CN113065070A (en) * 2021-04-23 2021-07-02 武汉瑞通慧行电子商务有限公司 Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval
CN113468868A (en) * 2021-07-07 2021-10-01 西北大学 NLP-based real-time network hotspot content analysis method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390144A (en) * 2023-12-13 2024-01-12 北京搜狐新媒体信息技术有限公司 A method and device for determining news timeliness
CN117390144B (en) * 2023-12-13 2024-03-08 北京搜狐新媒体信息技术有限公司 News timeliness determining method and device
CN119782585A (en) * 2025-03-13 2025-04-08 北京八月瓜科技有限公司 An information management method based on intelligent classification and efficient retrieval
CN119782585B (en) * 2025-03-13 2025-07-15 北京八月瓜科技有限公司 An information management method based on intelligent classification and efficient retrieval

Also Published As

Publication number Publication date
CN116775974B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN116775974B (en) Information screening method
Bijak et al. Assessing time series models for forecasting international migration: Lessons from the United Kingdom
US10650059B2 (en) Enhanced online user-interaction tracking
US11222310B2 (en) Automatic tagging for online job listings
US9002894B2 (en) Objective and subjective ranking of comments
US8326818B2 (en) Method of managing websites registered in search engine and a system thereof
US8374983B1 (en) Distributed object classification
Boon et al. Regression discontinuity designs in health: a systematic review
CN1770158A (en) Content evaluation
KR20090126241A (en) Personal Information Identification and Change System
KR20210037842A (en) Advertising Decision Making System Using Big Data Processing
CN103970796A (en) Inquiry preference ordering method and device
Stróżyna et al. A framework for the quality-based selection and retrieval of open data-a use case from the maritime domain
KR102124935B1 (en) Disaster Monitoring System, Method Using Crowd Sourcing, and Computer Program therefor
AU2015203108B1 (en) Automated predictive tag management system
CN119003891B (en) Method, device and equipment for generating employee search recommended content
Bar-Ilan et al. Informetric theories and methods for exploring the Internet: An analytical survey of recent research literature
Jepsen et al. Characteristics of scientific Web publications: Preliminary data gathering and analysis
CN110689211A (en) Method and device for evaluating website service capability
CN115796600A (en) Public opinion risk early warning method, system, medium and electronic equipment
US20180189699A1 (en) A method and system for locating regulatory information
US12147763B2 (en) System and method for identifying sentiment in text strings
US20220391445A1 (en) Online content evaluation system and methods
Pratelli et al. A structured analysis of journalistic evaluations for news source reliability
Vysotska et al. Set-theoretic models and unified methods of information resources processing in e-business systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载