CN101984435B - Method and device for distributing texts - Google Patents
Method and device for distributing texts Download PDFInfo
- Publication number
- CN101984435B CN101984435B CN201010549183A CN201010549183A CN101984435B CN 101984435 B CN101984435 B CN 101984435B CN 201010549183 A CN201010549183 A CN 201010549183A CN 201010549183 A CN201010549183 A CN 201010549183A CN 101984435 B CN101984435 B CN 101984435B
- Authority
- CN
- China
- Prior art keywords
- column
- text
- distributed
- columns
- texts
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000009826 distribution Methods 0.000 claims abstract description 115
- 239000013598 vector Substances 0.000 claims abstract description 79
- 230000002441 reversible effect Effects 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000012546 transfer Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005054 agglomeration Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and device for distributing texts, which is applied to a column frame comprising at least two levels of columns, wherein the method comprises the following steps: A, respectively executing the following distributing steps aiming at each grabbed text: matching the degrees of similarity of the keywords of the texts to be distributed currently and the central vectors of each column, and distributing the texts to be distributed currently to the column meeting a distribution matching polity according to the matching result, wherein the central vectors of the column are generated based on seed words set for the column in advance; and B, according to the hierarchy relation of the columns, distributing all or part of the texts under the set columns to an upper level parent column or the lower level sub-column. The method and device of the invention can reduce the workload and cost of text distribution, shortens the text distributing duration, and is convenient for flexibly increasing and decreasing the columns.
Description
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of internet, in particular to a method and a device for distributing texts.
[ background of the invention ]
With the global popularization of the internet and the continuous development of the internet application, the text information on the web page is explosively increased, how to fully and effectively utilize the text information on the web page and how to effectively organize the text information and provide the text information to the user are gradually becoming an important research direction in the field of data mining and have high industrial value. Currently, text classification has been applied in many fields, such as: news page recall of various columns, distributing emails, generating user interest patterns, and so on.
The text classification is to distribute a large amount of texts under different columns, wherein the columns can belong to different classifications and can also belong to different subclasses under the same classification. The existing text distribution method is based on a training sample, that is, a document set which is manually classified and processed is set, and the distribution of the text is realized by training according to the training sample. However, this approach based on training samples has the following drawbacks:
firstly, the establishment of the training samples requires the stages of corpus collection, training model establishment and the like, which requires a large amount of workload, especially the corpus collection requires a large amount of manual labeling in the professional field, which causes the workload and cost of text distribution to be too large.
Secondly, the training time is too long, and the establishment of the training samples usually brings the weekly distribution time.
In addition, since the training samples correspond to the column architecture, once the column architecture changes, the training samples need to be determined again, and the training samples are very difficult to obtain and take a long time, which further causes too high cost of text distribution and too long distribution time, and thus the columns cannot be increased or decreased flexibly.
[ summary of the invention ]
The invention provides a method and a device for distributing texts, which can reduce the cost of text distribution, shorten the distribution time and facilitate the flexible increase and decrease of the column number.
The specific technical scheme is as follows:
a method of distributing text for use in a column framework comprising at least two levels of columns, the method comprising:
A. and respectively executing the following distribution steps aiming at the captured texts:
a distribution step: similarity matching is carried out on the keywords of the current text to be distributed and the central vectors of all columns, and according to the matching result, the current text to be distributed is distributed to the columns meeting the distribution matching strategy; the central vector of the column is generated based on a seed word preset for the column;
B. and distributing all or part of the text under the set column to the parent column at the upper stage or the child column at the lower stage according to the hierarchical relationship among the columns.
Wherein the distribution matching policy of a column at least comprises: the similarity between the keywords of the text to be distributed and the central vector of the column exceeds a similarity threshold value set for the column; or,
and subtracting the similarity between the keyword of the text to be distributed and the central vector of the column by the similarity between the keyword of the text to be distributed and the reverse vector of the same column to obtain a result which exceeds a similarity threshold value set for the column, wherein the reverse vector of the column is generated on the basis of reverse words set for the column in advance.
Preferably, the step B specifically includes one or any combination of the following manners:
all the columns of the texts distributed according to the mode of the step A are child columns, and all the texts under the child columns of the texts distributed according to the mode of the step A or the texts sequenced in the first N1 are summarized to a superior parent column, wherein N1 is a preset positive integer; or,
b, the columns of the texts to be distributed in the mode of the step A are all father columns, and all texts under the father columns of the texts to be distributed in the mode of the step A are distributed to next-level child columns; or,
the column of the text distributed according to the step A comprises a parent column and a child column, and part of the text under the parent column of the text distributed according to the step A is distributed to the next-level child column of the text which is not distributed.
Still further, the column may include: with normal columns showing text properties and with hidden columns not showing text properties.
Preferably, the method further comprises: and extracting keywords of the distributed text from the column with the seed words, and combining the extracted keywords with the seed words of the column to form a new central vector of the column.
Further, after the step B, the following steps are respectively performed for each column:
c1, clustering the texts under the column to form more than one cluster under the column;
and C2, according to a preset top selection strategy, respectively selecting top texts in each cluster as the representation of each cluster.
After the step C2, the method further includes:
calculating the weight of each text under the column according to the text attribute, determining the weight of the cluster by using the weight of each text in the cluster, and sequencing each cluster under the column according to the weight of the cluster; or,
and respectively selecting the focus text from the texts under each column according to a preset focus text selection strategy and displaying the focus text under each column.
Wherein, the head-strip selecting strategy comprises one or any combination of the following strategies: selecting a text with the text release time within a set range, selecting a text with a title meeting set requirements, selecting a text with the cluster center vector similarity within a set range, and selecting a text with the text quality meeting preset requirements.
Specifically, the weight W of each textpageThe calculation formula of (2) is as follows: <math>
<mrow>
<msub>
<mi>W</mi>
<mi>page</mi>
</msub>
<mo>=</mo>
<mfrac>
<mi>α</mi>
<mrow>
<msub>
<mi>Δ</mi>
<mi>t</mi>
</msub>
<mo>+</mo>
<mi>α</mi>
</mrow>
</mfrac>
<mo>×</mo>
<mi>δ</mi>
<mrow>
<mo>(</mo>
<mi>site</mi>
<mo>)</mo>
</mrow>
<mo>×</mo>
<mi>φ</mi>
<mrow>
<mo>(</mo>
<mi>segcount</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
wherein alpha is a preset inverse ratio attenuation time factor, deltatDelta (site) is the calculation function of the text quality factor, and phi (segcount) is the calculation function of the transfer rate factor.
An apparatus for distributing text for use in a column framework comprising at least two levels of columns, the apparatus comprising: the system comprises a text acquisition unit, a first distribution unit and a second distribution unit;
the text acquisition unit is used for respectively taking the grabbed texts as texts to be distributed and sending the texts to the first distribution unit;
the first distribution unit is used for matching the similarity of the keywords of the current text to be distributed and the central vectors of all columns, and distributing the current text to be distributed to the columns meeting the distribution matching strategy according to the matching result; the central vector of the column is generated based on a seed word preset for the column;
and the second distribution unit is used for distributing all the texts to be distributed to the upper-level parent column or the lower-level child column according to the hierarchical relationship among the columns after the first distribution unit finishes distributing all the texts to be distributed.
Wherein the distribution matching policy of a column at least comprises: the similarity between the keywords of the text to be distributed and the central vector of the column exceeds a similarity threshold value set for the column; or,
and subtracting the similarity between the keyword of the text to be distributed and the central vector of the column by the similarity between the keyword of the text to be distributed and the reverse vector of the same column to obtain a result which exceeds a similarity threshold value set for the column, wherein the reverse vector of the column is generated on the basis of reverse words set for the column in advance.
The columns distributed by the first distribution unit are all sub-columns, at this time, the second distribution unit summarizes all texts or texts sequenced in the top N1 under each sub-column distributed by the first distribution unit to the upper-level parent column, wherein N1 is a preset positive integer; or,
the columns distributed by the first distribution unit are all father columns, and at the moment, the second distribution unit distributes all texts under all sub-columns distributed by the first distribution unit to the next-level sub-column; or,
the columns distributed by the first distribution unit comprise a parent column and a child column, and at this time, the second distribution unit distributes part of texts under the parent column distributed by the first distribution unit to the next-level child column without distributed texts.
Specifically, the column includes: with normal columns showing text properties and with hidden columns not showing text properties.
Preferably, the apparatus further comprises: and the keyword extraction unit is used for extracting keywords of the distributed text from the column provided with the seed words, combining the extracted keywords with the seed words of the column to form a new center vector of the column and providing the new center vector to the first distribution unit.
Still further, the apparatus further comprises: the system comprises a text clustering unit and a headline selecting unit;
the text clustering unit is used for clustering the texts under the columns according to the distribution results of the first distribution unit and the second distribution unit to form more than one cluster under each column;
and the top selection unit is used for respectively selecting the top texts in each cluster as the representation of each cluster according to a preset top selection strategy.
Preferably, the apparatus further comprises: one or all of the cluster sorting unit or the focus selecting unit;
the cluster sorting unit is used for calculating the weight of each text under the column according to the text attribute, determining the weight of the cluster by using the weight of each text in the cluster, and sorting each cluster under the column according to the weight of the cluster;
and the focus selecting unit is used for respectively selecting focus texts from the texts under each column according to the distribution results of the first distributing unit and the second distributing unit and a preset focus text selecting strategy and displaying the focus texts under each column.
Wherein, the head-strip selecting strategy comprises one or any combination of the following strategies: selecting a text with the text release time within a set range, selecting a text with a title meeting set requirements, selecting a text with the cluster center vector similarity within a set range, and selecting a text with the text quality meeting preset requirements.
Specifically, the weight W of each textpageThe calculation formula of (2) is as follows: <math>
<mrow>
<msub>
<mi>W</mi>
<mi>page</mi>
</msub>
<mo>=</mo>
<mfrac>
<mi>α</mi>
<mrow>
<msub>
<mi>Δ</mi>
<mi>t</mi>
</msub>
<mo>+</mo>
<mi>α</mi>
</mrow>
</mfrac>
<mo>×</mo>
<mi>δ</mi>
<mrow>
<mo>(</mo>
<mi>site</mi>
<mo>)</mo>
</mrow>
<mo>×</mo>
<mi>φ</mi>
<mrow>
<mo>(</mo>
<mi>segcount</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
wherein alpha is a preset inverse ratio attenuation time factor, deltatDelta (site) is the calculation function of the text quality factor, and phi (segcount) is the calculation function of the transfer rate factor.
According to the technical scheme, the method and the device have the advantages that the central vector generated based on the column seed words is adopted to distribute the text to the columns, and the text distribution among layers is combined, so that the text distribution duration is controlled to be second, and the text classification efficiency is greatly improved. In addition, the method and the device avoid a complex training sample establishing process, once the column architecture is changed, only proper seed words and text distribution rules among levels need to be set for the added columns, and the text distribution rules among the levels need to be modified for the deleted columns.
[ description of the drawings ]
FIG. 1 is a flow chart of the main method provided by the present invention;
fig. 2 is a flow chart of news distribution of each column according to an embodiment of the present invention;
fig. 3a is a first news page distribution manner provided in the first embodiment of the present invention;
fig. 3b is a second news page distribution manner provided in the first embodiment of the present invention;
fig. 3c is a third news page distribution manner provided in the first embodiment of the present invention;
fig. 4 is a schematic diagram of a distribution method using a mixed news page according to an embodiment of the present invention;
fig. 5 is a flowchart of forming a news cluster according to the second embodiment of the present invention;
fig. 6 is a schematic structural diagram of the device provided by the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flow chart of a main method provided by the present invention, as shown in fig. 1, the method may mainly include the following steps:
step 101: and respectively executing the following distribution steps aiming at the captured texts:
a distribution step: similarity matching is carried out on the keywords of the current text to be distributed and the central vectors of all columns, and according to the matching result, the current text to be distributed is distributed to the columns meeting the distribution matching strategy; the central vector of the column is generated based on the seed words preset for the column.
In this step, the distribution matching policy of the column can be flexibly set, and at least includes: the similarity between the keywords of the text to be distributed and the central vector of the column exceeds the similarity threshold set for the column. In addition, the distribution matching policy of the column may further include, but is not limited to, one or any combination of the following policies: the similarity between the keywords of the text to be distributed and the central vector of the column is highest, or the site source of the text to be distributed meets the site requirement of the column, or the author of the text to be distributed meets the author requirement of the column, or the text to be distributed meets the requirement of the column for pictures or videos, or the title regular expression of the text to be distributed meets the requirement of the column for the title regular expression, or the Uniform Resource Locator (URL) type of the text to be distributed meets the URL type requirement of the column.
Step 102: and distributing all or part of news under the set columns to a previous-level father column or a next-level son column according to the hierarchical relation among the columns so as to finish the text distribution of the columns in the column frame.
In the column frame, it may be preset that after some columns are distributed with texts in the step 101 manner or other existing manners, the texts in the column are distributed to the previous parent column or the next child column. News can be distributed to the columns without the seed words through the step, and the content of the sections is described in detail in the first embodiment.
The above method provided by the present invention is described below by using specific embodiments, and the following embodiments all use the example of distributing the text of the news page. First, a news page distribution flow of each column will be described in detail by using an embodiment.
The first embodiment,
Fig. 2 is a news distribution flowchart of each column according to an embodiment of the present invention, and as shown in fig. 2, the method may specifically include the following steps:
step 201: seed words are set for columns in a column framework in advance, and a central vector is formed for the columns with the seed words.
In the column structure, the seed words are usually set manually, and the column in which the seed words are set may be a root column or a sub-column. One or more seed words can be set for one column to form a group of seed words.
Because the manually set seed words are limited and can not exhaust all possible keywords of the column, the center vector of the seed words which are simply set by the user can cause that part of news pages can not be recalled (recall means that the news pages are distributed under a certain column). The number of cycles for extracting keywords using the recalled news page may be set according to an empirical value, for example, set to 3 cycles, corresponding to step 206 described below.
Step 202: step 203 to step 204 are executed one by one for each captured news page.
After the search engine captures a batch of news pages, the captured news pages can be distributed one by one.
Preferably, after the news pages are captured, the captured news pages can be firstly subjected to feature selection, duplicate removal and other processing, so that part of useless or repeated news pages are firstly filtered out, and the news recall efficiency is improved.
Step 203: extracting keywords of the current news page to be distributed, and performing similarity matching on the extracted keywords and the central vectors of the columns to be matched.
Step 204: and distributing the current news page to be distributed to the columns with the highest similarity and exceeding the column similarity threshold according to the matching result.
In this embodiment, the distribution matching policy takes the case that the similarity is the highest and exceeds the column similarity threshold, and any other policy described in step 101 may also be adopted, which is not repeated herein.
In addition, because the granularity of the seed word is usually large, noise is usually introduced when a news page is recalled under each column, therefore, when the news page is recalled under each column, a reverse word can be further set for each column, a reverse vector is formed based on the reverse word, when similarity matching is performed, a result obtained by subtracting the similarity between a keyword of the news page to be distributed and a central vector from the similarity between the keyword and the central vector can be determined, and whether the determined result meets a distribution matching policy is determined, that is, at least: and judging whether the determined result exceeds a similarity threshold set for the column.
In the column framework, the news distribution mode of each column may be configured in the column attribute, specifically, a seed word-based central vector mode may be configured in the column attribute to obtain news pages (the set of the news pages obtained by the columns may be global news page resources captured by a web crawler), or news pages may be obtained from a parent column or a child column (the set of the news pages obtained by the columns may be a set of news pages obtained by the parent column or the child column), or news pages may be obtained by other modes. For example, for a column configured with seed words, recall of a news page can be realized by adopting the manners from step 203 to step 205, and for a column not configured with seed words, a news page can be obtained from other columns. The manner in which the news page is obtained from the parent or child section is as follows.
Step 205: and distributing all or part of news under the columns to the previous parent columns or the next child columns according to the hierarchical relationship among the columns.
Generally, there is a certain hierarchical relationship between the columns, and the following three ways of recalling news pages can be adopted here:
the first mode is as follows: and (4) recalling news pages of all the sub-columns through the modes from step 203 to step 204 by all the sub-columns, and then summarizing and distributing the news pages under all the sub-columns to the parent columns at the upper level. As shown in fig. 3a, the shaded nodes in fig. 3a indicate columns with seed words set, and the arrows point to the distribution direction of the news page. The method is generally suitable for the conditions that the differences of all sub-columns are large and the mutual overlap ratio of the seed words among the columns is not high. For example, the parent column is "entertainment", the child columns are respectively "domestic entertainment", "harbor and australian entertainment", "japanese korean entertainment", and "europe and america entertainment", etc., the seed words of the child columns are set as the artist names of the corresponding regions, and since the seed words between the child columns have a low degree of overlap with each other, the child columns recall the news pages in the manner of steps 203 to 204, and then summarize to the parent column "entertainment".
All news pages under all the sub-columns can be gathered and distributed to the parent column at the upper level, and a plurality of news pages sequenced at the top in all the sub-columns can be gathered to the parent column at the upper level. The news pages in the sub-items can be sorted according to the similarity between the keywords of the news pages and the central vector of the items, and also can be sorted according to the weight value of the news cluster and the relevance of the news cluster, and the specific sorting criterion can be flexibly set. The formation of a news cluster under the column is described in example two.
For example, the total news page amount distributed to the parent column may be set to be N, the number of the child columns may be m, and the number of the news pages distributed to the parent column by each child column may be set not to exceed 2 × N/m.
The second mode is as follows: the parent column realizes the recall of the news page of the parent column through the modes from step 203 to step 204, and then the news page under the parent column is distributed to the next-level child column. As shown in fig. 3b, the shaded nodes in fig. 3b indicate that the father node recalls the news page in a similarity matching manner based on a central vector formed by the seed words, and the arrow points to the distribution direction of the news page. This approach is generally suitable for the case where the difference between the sub-fields is small and the overlap ratio between the seed words in the fields is high. For example, the parent node is an "electronic product", the child columns are a "new product" and a "product guide", and because the difference between the "new product" and the "product guide" is relatively small, the mutual overlapping degree of the seed words between the columns is relatively high, for example, seed words such as "new money" and "electronic" may exist, a mode of configuring the seed words on the parent column and then distributing the seed words to the next level of child columns may be adopted.
The next-level sub-column may also recall a part of the news pages from the news pages distributed from the parent column according to the similarity matching method based on the central vector formed by the seed words shown in steps 203 to 204, and at this time, other matching methods may also be adopted, for example, matching may be performed according to the site source, the author, the picture or video requirement, or the URL type of the news page.
If the news pages issued by the parent column do not belong to any existing sub-column, the news pages can be distributed to an independent sub-column, if m sub-columns exist, m +1 sub-columns are formed, and if the number of the news pages distributed by the parent node is N, the number of the news pages entering each sub-column can be limited to be not more than 2 multiplied by N/(m + 1).
The third mode is as follows: the parent column and part of the sub-columns realize the recall of the news pages of the sub-columns in the modes from step 203 to step 204, and a part of the sub-columns are left to obtain the news pages matched with the sub-columns from the parent column. As shown in fig. 3c, the shaded nodes in fig. 3c represent columns with seed words set, and the arrows point to the distribution direction of the news page. This approach is generally suitable for cases where some sub-column divisions are small and other sub-column divisions are relatively large under the parent column. For example, the parent column is "social", the child columns are "social and legal" and "social everything", and since the child column "social and legal" has a high degree of distinction and a low degree of distinction, the parent column "social" and the child column "social and legal" can be configured with seed words, news pages are recalled in a manner based on a central vector, and the child column "social everything" acquires a part of the news pages from the parent column. It should be noted that, since the column frame may include multiple levels of columns, more than one of the above news capturing manners may be used in combination, and even the news capturing manner and the existing recall manner may be used in combination in one column frame. As an example, as shown in fig. 4, the arrow points to the direction of news distribution, the dashed box is a hidden column (the hidden column will be referred to in the following description), and the solid box is a non-hidden column (i.e., a normal column). In this example, the first-level columns 2, 3, 4 and 5 and the second-level columns a, b and e are all configured with seed words, and are distributed to news pages in a central vector-based manner. The column a and the column b converge the distributed news page to a parent column at the upper level, namely, the column 1, corresponding to the first mode; column 2 further distributes the distributed news page to the next-level sub-columns, namely column c and column d, corresponding to the second mode; column 3 distributes part of the distributed news pages to other next-level sub-columns except for column e, namely column f and column g, corresponding to the third mode.
The method comprises the steps of setting a column, wherein the column is a hidden column, and the hidden column is used for displaying the hidden column, so that the relevant news pages such as the harbor stock and the American stock are filtered. For another example, a hidden column such as yellow or reverse may be set in the column structure, and a news page such as yellow or reverse may be recalled from the captured news and hidden and not displayed. The hidden columns also adopt the mode of center vector based on seed words from step 203 to step 204 to recall the news pages. Similarly, for the hidden column, keywords can be extracted from the recalled news page to expand seed words, so that a better filtering effect is achieved than that of configuring reverse words.
Step 206: and extracting keywords from the news pages under the columns, combining the extracted keywords with the seed words of the columns to form a new central vector, and when the news pages captured in the next round are distributed, adopting the new central vector.
When extracting the keywords, the keywords can be extracted from the news page according to word frequency, word sense weight or part of speech weight, and the like, and the specific extraction mode of the keywords is the prior art and is not described in detail herein.
It can be seen from the above flow that, for each column node in the column frame, the following configuration can be specifically configured: the distribution matching strategy of the column, the node structure (i.e. the information of the parent node at the upper level and the child node at the lower level) of the column, the display attribute (whether the column is hidden) and the like.
So far, the flow shown in the first embodiment is ended. A large number of news pages are recalled under each column, and all the news pages cannot be displayed under the column, so that the focus news needs to be selected for displaying, and the process is described in detail by the embodiment two.
Example II,
Fig. 5 is a flowchart of forming a news cluster according to the second embodiment of the present invention, and as shown in fig. 4, the following steps are performed for a news page under each column:
step 501: and clustering the news pages under the columns to form more than one news cluster.
Because a large number of news pages are recalled under each column, and the classification granularity taking the column as the news page is too large, the news pages under each column can be divided into a plurality of news clusters in a clustering mode, and the news pages in the same news cluster have higher similarity.
The embodiment of the invention can adopt but not limited to a hierarchical clustering mode, an agglomeration clustering mode, a division clustering mode, a density-based clustering mode, a grid clustering mode and the like. Specifically, if the hierarchical clustering manner is adopted in this embodiment, the clustering end condition may be set to be smaller than a preset similarity threshold or the number of news clusters is smaller than a preset threshold.
If news pages under each column are clustered directly, poor clustering effect can be brought: because news pages in the same column are all documents with high similarity to the same central vector, a large amount of news can be grouped into one category, and the rest news become a plurality of subclasses. Therefore, when the news pages under the columns are clustered, the weight of the column center vector in the clustering calculation can be firstly reduced, so that the content of each news outside the center vector can be highlighted, and the news pages are aggregated according to the difference of the content.
Preferably, before step 501 is executed, news pages under each column may be first filtered, for example, only the top M news pages with the greatest similarity to the center vector of the column are retained, where M is a preset positive integer.
Step 502: and selecting headline news from the news cluster as the representation of the news cluster according to a preset headline selection strategy.
The headline selection policy of the news cluster can be flexibly set, and can include but is not limited to one or any combination of the following policies: selecting news pages with news release time within a set range, selecting news pages with titles meeting set requirements, selecting news pages with the similarity of a news cluster center vector within the set range, and selecting news pages with news quality meeting preset requirements. For example, a news page with a newer release time, a longer headline, and a higher similarity to the center vector of the news cluster may be selected as the headline. Among other things, news quality may depend on: site weight, news page traffic, news page response speed, advertisement amount, and the like. It should be noted that, since the text such as a news page is taken as an example in this embodiment, a text quality form suitable for the specific text may be adopted for other texts.
Take an example of selecting headlines in a news cluster: acquiring the first 3 news pages with highest central vector similarity with the news cluster in the news cluster, and then selecting one news page with good title readability as a headline; and if the readability is not good, selecting the next 3 news pages with the similarity to the central vector of the news cluster, selecting one news page with good title readability as a head bar, and repeating the steps until one page with good readability is selected.
Step 503: and calculating the weight of each news page under the column according to the attribute of the news page, determining the weight of each news cluster by using the weight of each news page in each news cluster, and sequencing each news cluster under the column according to the weight of each news cluster.
The attributes of the news page mentioned in this step may include, but are not limited to, one or any combination of the following attributes: news release time, news quality, and reprint rate. As an example of calculating the weight of a news page, the weight W of a news page can be calculated using formula (1)page:
Wherein alpha is a preset inverse ratio attenuation time factor, deltatDelta (site) is a calculation function of a news quality factor, and phi (segcount) is a calculation function of a transfer rate factor, wherein delta (site) is the time difference between the news release time and the current time.
When determining the weight of the news cluster, various ways may be adopted, for example, directly taking the weight of each news page in the news cluster and taking the weight as the weight of the news cluster, or taking the weight average of each news page in the news cluster as the weight of the news cluster, and the like.
Step 504: and selecting the focus news from the news pages under the columns according to a preset focus news selection strategy to display the focus news under the columns.
The focus news selection strategy can be flexibly set, for example: several news pages can be respectively selected from each news cluster as the focus news of the section, or K2 news pages can be respectively selected from the top K1 news clusters as the focus news of the section according to the sorting condition of each news cluster, wherein K1 and K2 are positive integers, and the like, which are not exhaustive.
Step 502, step 503 and step 504 have no fixed sequence, and this flow is only one of the embodiments.
It should be noted that whether the focus news is displayed in each column and whether each news cluster displays the headline news is configurable. That is, it is possible to specifically configure in the display attribute of the column: the text content of the display and the specific manner.
The process flow shown in this second embodiment is completed.
The method provided by the present invention is described above, and the apparatus provided by the present invention is described in detail below. As shown in fig. 6, the apparatus may include: a text acquisition unit 601, a first distribution unit 602, and a second distribution unit 603.
A text acquiring unit 601, configured to send each captured text to the first distributing unit 602 as a text to be distributed.
The first distribution unit 602 is configured to perform similarity matching between the keywords of the current text to be distributed and the central vectors of the columns, and distribute the current text to be distributed to the columns meeting the distribution matching policy according to the matching result; and generating a central vector of the column based on the seed words preset for the column.
And the second distribution unit 603 is configured to distribute all or part of the text under the set column to the parent column at the upper stage or the child column at the lower stage according to the hierarchical relationship between the columns after the first distribution unit 602 completes distribution of all the texts to be distributed.
Wherein, the distribution matching strategy of the above-mentioned column at least includes: the similarity between the keywords of the text to be distributed and the central vector of the column exceeds a similarity threshold value set for the column; or the result obtained by subtracting the similarity between the keyword of the text to be distributed and the central vector of the column from the similarity between the keyword of the text to be distributed and the reverse vector of the same column exceeds a similarity threshold value set for the column, wherein the reverse vector of the column is generated based on the reverse words set for the column in advance.
In addition, the distribution matching policy may further include, but is not limited to, one or any combination of the following policies: the similarity between the keywords of the text to be distributed and the central vector of the column is highest, or the site source of the text to be distributed meets the site requirement of the column, or the author of the text to be distributed meets the author requirement of the column, or the text to be distributed meets the requirement of the column for pictures or videos, or the title regular expression of the text to be distributed meets the requirement of the column for the title regular expression, or the URL type of the text to be distributed meets the URL type requirement of the column.
Specifically, if the columns distributed by the first distribution unit 602 are all child columns, at this time, the second distribution unit 603 may summarize all texts under each child column distributed by the first distribution unit 602 or texts ordered at the top N1 to a top parent column, where N1 is a preset positive integer.
If the columns distributed by the first distribution unit 602 are all parent columns, the second distribution unit 603 may distribute all the texts under the child columns distributed by the first distribution unit 602 to the next-level child columns.
If the column distributed by the first distribution unit 602 includes a parent column and a child column, the second distribution unit 603 may distribute a part of the text under the parent column distributed by the first distribution unit 602 to a next-level child column to which the text is not distributed.
The column related to the invention comprises: with normal columns showing text properties and with hidden columns not showing text properties. The hidden column can be used for realizing a text filtering function.
The apparatus may further include: a keyword extracting unit 604, configured to extract keywords of the distributed text from the column in which the seed word is set, combine the extracted keywords with the seed word of the column to form a new center vector of the column, and provide the new center vector to the first distributing unit 602. By updating the column center vector through the keyword extraction unit 604, the updated center vector can more accurately describe the content guidance of the column, and the accuracy of the distributed text of the column is improved.
Still further, the apparatus may further comprise: a text clustering unit 605 and a top bar selection unit 606.
The text clustering unit 605 is configured to cluster the texts under the columns according to the distribution results of the first distribution unit 602 and the second distribution unit 603, so as to form more than one cluster under each column.
A top selection unit 606, configured to select top texts from the clusters formed by the text clustering unit 605 as representations of the clusters according to a preset top selection policy.
More preferably, the apparatus may further comprise: one or both of the cluster sorting unit 607 and the focus selection unit 608 (in fig. 6, two units are included at the same time for example).
The cluster sorting unit 607 is configured to, after the text clustering unit 605 forms the clusters under each column, calculate the weight of each text under each column according to the text attribute, determine the weight of each cluster by using the weight of each text in the cluster, and sort each cluster under each column according to the weight of the cluster.
A focus selecting unit 608, configured to select a focus text from the texts in each column according to a preset focus text selection policy according to the distribution results of the first distributing unit 602 and the second distributing unit 603, and display the focus text in each column.
Wherein, the above-mentioned head-bar selecting strategy may include one or any combination of the following strategies: selecting a text with the text release time within a set range, selecting a text with a title meeting set requirements, selecting a text with the cluster center vector similarity within a set range, and selecting a text with the text quality meeting preset requirements.
Preferably, the weight W of each textpageThe following calculation formula can be adopted:
According to the technical scheme, the method and the device provided by the invention have the following advantages:
1) the method and the device adopt the central vector generated based on the column seed words to distribute the text to the columns and combine the text distribution among layers, control the text distribution duration at the second level and greatly improve the text distribution efficiency. In addition, the method and the device can avoid a complex training sample establishing process, and once the column architecture is changed, only proper seed words and text distribution rules among levels need to be set for the added columns, and the text distribution rules among the levels need to be modified for the deleted columns.
2) In the invention, text filtering can be realized by setting reverse words or hidden columns in the columns, and the accuracy of text display in the columns is improved.
3) In the invention, keywords can be extracted from the distributed texts in the columns, and the extracted keywords are combined with the seed words of the columns to form a new central vector, so that the central vector of the columns can more accurately describe the content guidance of the columns, and the accuracy and the recall rate of the distributed texts of the columns are improved.
4) The invention provides a plurality of distribution matching strategies, and the text recall rate of the columns can be flexibly controlled according to requirements.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (16)
1. A method for distributing text for use in a column framework comprising at least two levels of columns, the method comprising:
A. and respectively executing the following distribution steps aiming at the captured texts:
a distribution step: similarity matching is carried out on the keywords of the current text to be distributed and the central vectors of all columns, and according to the matching result, the current text to be distributed is distributed to the columns meeting the distribution matching strategy; the central vector of the column is generated based on a seed word preset for the column; the column includes: the method comprises the following steps of (1) having a common column for displaying a text attribute and a hidden column for not displaying the text attribute;
B. and distributing all or part of the text under the set column to the parent column at the upper stage or the child column at the lower stage according to the hierarchical relationship among the columns.
2. The method of claim 1, wherein the distribution matching policy for a hurdle comprises at least: the similarity between the keywords of the text to be distributed and the central vector of the column exceeds a similarity threshold value set for the column; or,
and subtracting the similarity between the keyword of the text to be distributed and the central vector of the column by the similarity between the keyword of the text to be distributed and the reverse vector of the same column to obtain a result which exceeds a similarity threshold value set for the column, wherein the reverse vector of the column is generated on the basis of reverse words set for the column in advance.
3. The method according to claim 1, wherein the step B specifically comprises one or any combination of the following manners:
all the columns of the texts distributed according to the mode of the step A are child columns, and all the texts under the child columns of the texts distributed according to the mode of the step A or the texts sequenced in the first N1 are summarized to a superior parent column, wherein N1 is a preset positive integer; or,
b, the columns of the texts to be distributed in the mode of the step A are all father columns, and all texts under the father columns of the texts to be distributed in the mode of the step A are distributed to next-level child columns; or,
the column of the text distributed according to the step A comprises a parent column and a child column, and part of the text under the parent column of the text distributed according to the step A is distributed to the next-level child column of the text which is not distributed.
4. A method according to any one of claims 1 to 3, characterized in that the method further comprises: and extracting keywords of the distributed text from the column with the seed words, and combining the extracted keywords with the seed words of the column to form a new central vector of the column.
5. A method according to any one of claims 1 to 3, wherein after step B, the following steps are performed separately for each column:
c1, clustering the texts under the column to form more than one cluster under the column;
and C2, according to a preset top selection strategy, respectively selecting top texts in each cluster as the representation of each cluster.
6. The method according to claim 5, further comprising, after step C2:
calculating the weight of each text under the column according to the text attribute, determining the weight of the cluster by using the weight of each text in the cluster, and sequencing each cluster under the column according to the weight of the cluster; or,
and respectively selecting the focus text from the texts under each column according to a preset focus text selection strategy and displaying the focus text under each column.
7. The method of claim 5, wherein the top bar selection policy comprises one or any combination of the following policies: selecting a text with the text release time within a set range, selecting a text with a title meeting set requirements, selecting a text with the cluster center vector similarity within a set range, and selecting a text with the text quality meeting preset requirements.
8. The method of claim 6, wherein the weight W of each textpageThe calculation formula of (2) is as follows: <math>
<mrow>
<msub>
<mi>W</mi>
<mi>page</mi>
</msub>
<mo>=</mo>
<mfrac>
<mi>α</mi>
<mrow>
<msub>
<mi>Δ</mi>
<mi>t</mi>
</msub>
<mo>+</mo>
<mi>α</mi>
</mrow>
</mfrac>
<mo>×</mo>
<mi>δ</mi>
<mrow>
<mo>(</mo>
<mi>site</mi>
<mo>)</mo>
</mrow>
<mo>×</mo>
<mi>φ</mi>
<mrow>
<mo>(</mo>
<mi>segcount</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
wherein alpha is a preset inverse ratio attenuation time factor, deltatDelta (site) is the calculation function of the text quality factor, and phi (segcount) is the calculation function of the transfer rate factor.
9. An apparatus for distributing text for use in a column framework comprising at least two levels of columns, the apparatus comprising: the system comprises a text acquisition unit, a first distribution unit and a second distribution unit;
the text acquisition unit is used for respectively taking the grabbed texts as texts to be distributed and sending the texts to the first distribution unit;
the first distribution unit is used for matching the similarity of the keywords of the current text to be distributed and the central vectors of all columns, and distributing the current text to be distributed to the columns meeting the distribution matching strategy according to the matching result; the central vector of the column is generated based on a seed word preset for the column; the column includes: the method comprises the following steps of (1) having a common column for displaying a text attribute and a hidden column for not displaying the text attribute;
and the second distribution unit is used for distributing all the texts to be distributed to the upper-level parent column or the lower-level child column according to the hierarchical relationship among the columns after the first distribution unit finishes distributing all the texts to be distributed.
10. The apparatus of claim 9, wherein the distribution matching policy for a hurdle comprises at least: the similarity between the keywords of the text to be distributed and the central vector of the column exceeds a similarity threshold value set for the column; or,
and subtracting the similarity between the keyword of the text to be distributed and the central vector of the column by the similarity between the keyword of the text to be distributed and the reverse vector of the same column to obtain a result which exceeds a similarity threshold value set for the column, wherein the reverse vector of the column is generated on the basis of reverse words set for the column in advance.
11. The apparatus according to claim 9, wherein the columns distributed by the first distribution unit are all child columns, and at this time, the second distribution unit summarizes all texts under each child column distributed by the first distribution unit or texts ordered in the top N1 to the upper level parent column, where N1 is a preset positive integer; or,
the columns distributed by the first distribution unit are all father columns, and at the moment, the second distribution unit distributes all texts under all sub-columns distributed by the first distribution unit to the next-level sub-column; or,
the columns distributed by the first distribution unit comprise a parent column and a child column, and at this time, the second distribution unit distributes part of texts under the parent column distributed by the first distribution unit to the next-level child column without distributed texts.
12. The apparatus of any one of claims 9 to 11, further comprising: and the keyword extraction unit is used for extracting keywords of the distributed text from the column provided with the seed words, combining the extracted keywords with the seed words of the column to form a new center vector of the column and providing the new center vector to the first distribution unit.
13. The apparatus of any one of claims 9 to 11, further comprising: the system comprises a text clustering unit and a headline selecting unit;
the text clustering unit is used for clustering the texts under the columns according to the distribution results of the first distribution unit and the second distribution unit to form more than one cluster under each column;
and the top selection unit is used for respectively selecting the top texts in each cluster as the representation of each cluster according to a preset top selection strategy.
14. The apparatus of claim 13, further comprising: one or all of the cluster sorting unit or the focus selecting unit;
the cluster sorting unit is used for calculating the weight of each text under the column according to the text attribute, determining the weight of the cluster by using the weight of each text in the cluster, and sorting each cluster under the column according to the weight of the cluster;
and the focus selecting unit is used for respectively selecting focus texts from the texts under each column according to the distribution results of the first distributing unit and the second distributing unit and a preset focus text selecting strategy and displaying the focus texts under each column.
15. The apparatus of claim 13, wherein the top bar selection policy comprises one or any combination of the following policies: selecting a text with the text release time within a set range, selecting a text with a title meeting set requirements, selecting a text with the cluster center vector similarity within a set range, and selecting a text with the text quality meeting preset requirements.
16. The apparatus of claim 14, wherein the weight W of each textpageThe calculation formula of (2) is as follows: <math>
<mrow>
<msub>
<mi>W</mi>
<mi>page</mi>
</msub>
<mo>=</mo>
<mfrac>
<mi>α</mi>
<mrow>
<msub>
<mi>Δ</mi>
<mi>t</mi>
</msub>
<mo>+</mo>
<mi>α</mi>
</mrow>
</mfrac>
<mo>×</mo>
<mi>δ</mi>
<mrow>
<mo>(</mo>
<mi>site</mi>
<mo>)</mo>
</mrow>
<mo>×</mo>
<mi>φ</mi>
<mrow>
<mo>(</mo>
<mi>segcount</mi>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
</math>
wherein alpha is a preset inverse ratio attenuation time factor, deltatDelta (site) is the calculation function of the text quality factor, and phi (segcount) is the calculation function of the transfer rate factor.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201010549183A CN101984435B (en) | 2010-11-17 | 2010-11-17 | Method and device for distributing texts |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201010549183A CN101984435B (en) | 2010-11-17 | 2010-11-17 | Method and device for distributing texts |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN101984435A CN101984435A (en) | 2011-03-09 |
| CN101984435B true CN101984435B (en) | 2012-10-10 |
Family
ID=43641604
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201010549183A Active CN101984435B (en) | 2010-11-17 | 2010-11-17 | Method and device for distributing texts |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN101984435B (en) |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102629272A (en) * | 2012-03-14 | 2012-08-08 | 北京邮电大学 | Clustering based optimization method for examination system database |
| CN103324628B (en) * | 2012-03-21 | 2016-06-08 | 腾讯科技(深圳)有限公司 | A kind of trade classification method and system for issuing text |
| CN102760156B (en) * | 2012-06-05 | 2016-01-13 | 百度在线网络技术(北京)有限公司 | A kind of for generating the method that release news, device and the equipment corresponding with keyword |
| CN106407210B (en) * | 2015-07-29 | 2019-11-26 | 阿里巴巴集团控股有限公司 | A kind of methods of exhibiting and device of business object |
| CN106776652B (en) * | 2015-11-24 | 2020-09-25 | 北京国双科技有限公司 | Data processing method and device |
| CN108809919B (en) * | 2017-05-04 | 2020-09-04 | 北京大学 | Covert communication method and device for text carrier |
| CN109522414B (en) * | 2018-11-26 | 2021-06-04 | 吉林大学 | A Document Delivery Object Selection System |
| CN109992583A (en) * | 2019-03-15 | 2019-07-09 | 上海益普索信息技术有限公司 | A kind of management platform and method based on DMP label |
| CN113111216B (en) * | 2020-01-13 | 2023-11-03 | 百度在线网络技术(北京)有限公司 | Advertisement recommendation method, device, equipment and storage medium |
| US11907678B2 (en) * | 2020-11-10 | 2024-02-20 | International Business Machines Corporation | Context-aware machine language identification |
| CN112800083B (en) * | 2021-02-24 | 2022-03-18 | 山东省住房和城乡建设发展研究院 | Government decision-oriented government affair big data analysis method and equipment |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1536483A (en) * | 2003-04-04 | 2004-10-13 | 陈文中 | Method and system for extracting and processing network information |
| CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
| CN101727463A (en) * | 2008-10-24 | 2010-06-09 | 中国科学院计算技术研究所 | Text training method and text classifying method |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| AUPR033800A0 (en) * | 2000-09-25 | 2000-10-19 | Telstra R & D Management Pty Ltd | A document categorisation system |
-
2010
- 2010-11-17 CN CN201010549183A patent/CN101984435B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1536483A (en) * | 2003-04-04 | 2004-10-13 | 陈文中 | Method and system for extracting and processing network information |
| CN101727463A (en) * | 2008-10-24 | 2010-06-09 | 中国科学院计算技术研究所 | Text training method and text classifying method |
| CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
Also Published As
| Publication number | Publication date |
|---|---|
| CN101984435A (en) | 2011-03-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101984435B (en) | Method and device for distributing texts | |
| CN101944109B (en) | System and method for extracting picture abstract based on page partitioning | |
| US20090319449A1 (en) | Providing context for web articles | |
| CN106886569B (en) | A MPI-based ML-KNN multi-label Chinese text classification method | |
| CN104794242B (en) | Searching method | |
| US20060206483A1 (en) | Method for domain identification of documents in a document database | |
| CN107885793A (en) | A kind of hot microblog topic analyzing and predicting method and system | |
| CN103106262B (en) | The method and apparatus that document classification, supporting vector machine model generate | |
| JP2009099124A (en) | Data construction method and system | |
| JP2005038386A (en) | Sentence classification apparatus and method | |
| CN108520007B (en) | Web page information extracting method, storage medium and computer equipment | |
| CN102012915A (en) | Keyword recommendation method and system for document sharing platform | |
| CN103577593A (en) | Method and system for video aggregation based on microblog hot topics | |
| US9558185B2 (en) | Method and system to discover and recommend interesting documents | |
| CN100442278C (en) | Method and device for extracting webpage information block | |
| US8090720B2 (en) | Method for merging document clusters | |
| Kumar et al. | Discovering knowledge landscapes: an epistemic analysis of business and management field in Malaysia | |
| CN106445894A (en) | New media intelligent online editing method and apparatus, and network information release platform | |
| CN104899215A (en) | Data processing method, recommendation source information organization, information recommendation method and information recommendation device | |
| CN104679820A (en) | Search result ordering method and search result ordering device | |
| CN102542024A (en) | Calibrating method of semantic tags of video resource | |
| CN105117434A (en) | Webpage classification method and webpage classification system | |
| CN115525765A (en) | Event context generation method and device and network equipment | |
| CN108614825B (en) | Webpage feature extraction method and device | |
| CN118643242A (en) | A method, device, equipment and storage medium for obtaining hotspot data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| C14 | Grant of patent or utility model | ||
| GR01 | Patent grant |