CN118227790A - Text classification method, system, device and medium based on multi-label association - Google Patents
Text classification method, system, device and medium based on multi-label association Download PDFInfo
- Publication number
 - CN118227790A CN118227790A CN202410335568.2A CN202410335568A CN118227790A CN 118227790 A CN118227790 A CN 118227790A CN 202410335568 A CN202410335568 A CN 202410335568A CN 118227790 A CN118227790 A CN 118227790A
 - Authority
 - CN
 - China
 - Prior art keywords
 - text
 - label
 - model
 - training
 - text classification
 - Prior art date
 - Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 - Pending
 
Links
Classifications
- 
        
- G—PHYSICS
 - G06—COMPUTING OR CALCULATING; COUNTING
 - G06F—ELECTRIC DIGITAL DATA PROCESSING
 - G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 - G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
 - G06F16/35—Clustering; Classification
 
 - 
        
- G—PHYSICS
 - G06—COMPUTING OR CALCULATING; COUNTING
 - G06F—ELECTRIC DIGITAL DATA PROCESSING
 - G06F40/00—Handling natural language data
 - G06F40/30—Semantic analysis
 
 - 
        
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
 - Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
 - Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
 - Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
 
 
Landscapes
- Engineering & Computer Science (AREA)
 - Theoretical Computer Science (AREA)
 - Physics & Mathematics (AREA)
 - General Engineering & Computer Science (AREA)
 - General Physics & Mathematics (AREA)
 - Audiology, Speech & Language Pathology (AREA)
 - Computational Linguistics (AREA)
 - General Health & Medical Sciences (AREA)
 - Health & Medical Sciences (AREA)
 - Artificial Intelligence (AREA)
 - Data Mining & Analysis (AREA)
 - Databases & Information Systems (AREA)
 - Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
 
Abstract
Description
技术领域Technical Field
本发明涉及机器学习和自然语言处理技术领域,特别是涉及基于多标签关联的文本分类方法、系统、设备及介质。The present invention relates to the technical field of machine learning and natural language processing, and in particular to a text classification method, system, device and medium based on multi-label association.
背景技术Background technique
在互联网和社交媒体的快速发展背景下,导致了大量非规范文本数据的产生,这些文本数据往往包含大量非正式语言、网络新词、特殊符号以及表情符号,反映了丰富的社会信息和个人情感。文本的非规范性不仅表现在语言的使用上,还体现在其结构和语义的复杂性上,这使理解和处理这些文本数据成为一项艰巨任务。The rapid development of the Internet and social media has led to the generation of a large amount of non-standard text data, which often contains a large amount of informal language, new network words, special symbols and emoticons, reflecting rich social information and personal emotions. The non-standard nature of text is not only reflected in the use of language, but also in the complexity of its structure and semantics, which makes understanding and processing these text data a difficult task.
非规范文本的主要挑战之一在于其语言的非正式性,虽然为语言的丰富性和表达的多样性提供了空间,却也给传统的文本分类方法带来了前所未有的挑战。文本广泛使用网络新词和缩写,这在传统的字典或语言模型中往往难以找到对应项,另外,表情符号和特殊字符的频繁使用也进一步增加了文本的解读难度。One of the main challenges of non-standard text is the informality of its language. Although it provides space for the richness of language and the diversity of expression, it also brings unprecedented challenges to traditional text classification methods. The text widely uses new words and abbreviations on the Internet, which are often difficult to find corresponding items in traditional dictionaries or language models. In addition, the frequent use of emoticons and special characters further increases the difficulty of text interpretation.
非规范文本数据的另一个显著特点是单一文本可能同时涉及多个主题或情感,也就是说,一个文本可能与多个标签相关,同时,这些标签之间存在着一定的关联性,标签关联性的存在,既是一个挑战,也是一个机遇,如果能够有效识别和利用标签之间的关联性,不仅可以提高分类的准确性,还可以帮助加深对文本内容和结构的理解。然而,传统的文本分类方法往往忽视标签之间的这种复杂关系,导致无法充分利用这一信息来优化分类结果。Another notable feature of non-standard text data is that a single text may involve multiple topics or emotions at the same time, that is, a text may be associated with multiple tags. At the same time, there is a certain correlation between these tags. The existence of tag correlation is both a challenge and an opportunity. If the correlation between tags can be effectively identified and utilized, it can not only improve the accuracy of classification, but also help deepen the understanding of text content and structure. However, traditional text classification methods often ignore this complex relationship between tags, resulting in the inability to fully utilize this information to optimize classification results.
发明内容Summary of the invention
为了解决现有技术的不足,本发明提供了基于多标签关联的文本分类方法、系统、设备及介质;旨在优化社交媒体等平台上非规范文本的多标签分类问题,核心在于深入挖掘并利用标签之间的复杂关系,通过整合先进的深度学习技术和标签关联性分析,实现对文本数据的精确分类。In order to address the shortcomings of the prior art, the present invention provides a text classification method, system, device and medium based on multi-label association; it aims to optimize the multi-label classification problem of non-standard text on platforms such as social media. The core lies in deeply exploring and utilizing the complex relationship between labels, and realizing accurate classification of text data by integrating advanced deep learning technology and label association analysis.
一方面,提供了基于多标签关联的文本分类方法;On the one hand, a text classification method based on multi-label association is provided;
基于多标签关联的文本分类方法,包括:Text classification methods based on multi-label association include:
获取已知标签类别的多个文本,构建训练集和测试集;所述训练集和测试集,包括:多个文本和多个标签类别,其中,每个文本的已知标签类别为多个;将训练集分为两部分:第一训练子集和第二训练子集;Acquire multiple texts of known label categories and construct a training set and a test set; the training set and the test set include: multiple texts and multiple label categories, wherein each text has multiple known label categories; divide the training set into two parts: a first training subset and a second training subset;
将第一训练子集中的每个文本和多个标签类别,输入到文本分类模型中,对模型进行训练,得到初步训练后的文本分类模型;Input each text and multiple label categories in the first training subset into the text classification model, train the model, and obtain a preliminarily trained text classification model;
将第二训练子集中的每个文本和多个标签类别,输入到初步训练后的文本分类模型,对模型进行训练,训练的过程中,根据每个文本的标签类别预测概率,计算出标签与标签之间的条件概率,基于条件概率构建标签关联矩阵,基于标签关联矩阵构建关联损失函数,基于关联损失函数和交叉熵损失函数,构建文本分类模型第二损失函数,在第二损失函数的损失值不小于预设阈值时,根据第二损失函数的损失值对初步训练后的文本分类模型的参数进行二次更新,利用二次更新后的文本分类模型获得每个文本的标签类别预测概率,直至计算出第二损失函数的损失值小于预设阈值,得到最终训练后的文本分类模型;Input each text and multiple label categories in the second training subset into the text classification model after preliminary training, and train the model. During the training process, calculate the conditional probability between labels according to the label category prediction probability of each text, construct a label association matrix based on the conditional probability, construct an association loss function based on the label association matrix, and construct a second loss function of the text classification model based on the association loss function and the cross entropy loss function. When the loss value of the second loss function is not less than a preset threshold, perform a second update on the parameters of the text classification model after preliminary training according to the loss value of the second loss function, and use the text classification model after the second update to obtain the label category prediction probability of each text, until the loss value of the second loss function is calculated to be less than the preset threshold, so as to obtain the final trained text classification model;
根据测试集,对最终训练后的文本分类模型进行测试,利用通过测试的网络模型对待分类文本进行分类。According to the test set, the final trained text classification model is tested, and the text to be classified is classified using the tested network model.
另一方面,提供了基于多标签关联的文本分类系统;On the other hand, a text classification system based on multi-label association is provided;
基于多标签关联的文本分类系统,包括:Text classification system based on multi-label association, including:
获取模块,其被配置为:获取已知标签类别的多个文本,构建训练集和测试集;所述训练集和测试集,包括:多个文本和多个标签类别,其中,每个文本的已知标签类别为多个;将训练集分为两部分:第一训练子集和第二训练子集;The acquisition module is configured to: acquire multiple texts of known label categories, and construct a training set and a test set; the training set and the test set include: multiple texts and multiple label categories, wherein each text has multiple known label categories; divide the training set into two parts: a first training subset and a second training subset;
一次训练模块,其被配置为:将第一训练子集中的每个文本和多个标签类别,输入到文本分类模型中,对模型进行训练,得到初步训练后的文本分类模型;A primary training module is configured to: input each text and multiple label categories in the first training subset into the text classification model, train the model, and obtain a preliminarily trained text classification model;
二次训练模块,其被配置为:将第二训练子集中的每个文本和多个标签类别,输入到初步训练后的文本分类模型,对模型进行训练,训练的过程中,根据每个文本的标签类别预测概率,计算出标签与标签之间的条件概率,基于条件概率构建标签关联矩阵,基于标签关联矩阵构建关联损失函数,基于关联损失函数和交叉熵损失函数,构建文本分类模型第二损失函数,在第二损失函数的损失值不小于预设阈值时,根据第二损失函数的损失值对初步训练后的文本分类模型的参数进行二次更新,利用二次更新后的文本分类模型获得每个文本的标签类别预测概率,直至计算出第二损失函数的损失值小于预设阈值,得到最终训练后的文本分类模型;A secondary training module, which is configured to: input each text and multiple label categories in the second training subset into the text classification model after preliminary training, train the model, calculate the conditional probability between labels according to the label category prediction probability of each text during the training process, construct a label association matrix based on the conditional probability, construct an association loss function based on the label association matrix, construct a second loss function of the text classification model based on the association loss function and the cross entropy loss function, and when the loss value of the second loss function is not less than a preset threshold, perform a secondary update on the parameters of the text classification model after preliminary training according to the loss value of the second loss function, and use the text classification model after the secondary update to obtain the label category prediction probability of each text, until the loss value of the second loss function is calculated to be less than the preset threshold, and obtain the final trained text classification model;
测试模块,其被配置为:根据测试集,对最终训练后的文本分类模型进行测试,利用通过测试的网络模型对待分类文本进行分类。The testing module is configured to: test the final trained text classification model according to the test set, and classify the text to be classified using the tested network model.
再一方面,还提供了一种电子设备,包括:On the other hand, an electronic device is provided, comprising:
存储器,用于非暂时性存储计算机可读指令;以及a memory for non-transitory storage of computer-readable instructions; and
处理器,用于运行所述计算机可读指令,a processor for executing the computer readable instructions,
其中,所述计算机可读指令被所述处理器运行时,执行上述第一方面所述的方法。When the computer-readable instructions are executed by the processor, the method described in the first aspect is executed.
再一方面,还提供了一种存储介质,非暂时性存储计算机可读指令,其中,当非暂时性计算机可读指令由计算机执行时,执行第一方面所述方法的指令。On the other hand, a storage medium is provided, which non-temporarily stores computer-readable instructions, wherein when the non-temporary computer-readable instructions are executed by a computer, the instructions of the method described in the first aspect are executed.
再一方面,还提供了一种计算机程序产品,包括计算机程序,所述计算机程序当在一个或多个处理器上运行的时候用于实现上述第一方面所述的方法。On the other hand, a computer program product is provided, comprising a computer program, wherein the computer program is used to implement the method described in the first aspect when running on one or more processors.
上述技术方案具有如下优点或有益效果:The above technical solution has the following advantages or beneficial effects:
本发明采用多标签间的关联性分析技术和模型算法,能够更准确地捕捉非规范文本数据的语义信息,从而有效提高非规范文本的分类效率和准确性。The present invention adopts correlation analysis technology and model algorithm among multiple labels, which can more accurately capture the semantic information of non-standard text data, thereby effectively improving the classification efficiency and accuracy of non-standard text.
本发明采用关联损失函数融合正则化技术,有效地防止了模型过拟合,使模型在面对新未见数据时也能够保持良好的分类性能,增强了模型泛化能力。The present invention adopts the association loss function fusion regularization technology to effectively prevent the model from overfitting, so that the model can maintain good classification performance when facing new and unseen data, and enhance the generalization ability of the model.
本发明还采集模型性能的评估反馈信息,再次输入模型进行迭代,持续调整和优化模型参数,使模型具有更强的适应性,能够更好地满足实际应用场景和用户需求。The present invention also collects evaluation feedback information of model performance, inputs it into the model again for iteration, and continuously adjusts and optimizes model parameters to make the model more adaptable and better meet actual application scenarios and user needs.
总之,本发明不仅在技术性能上取得了突破,也在实际应用的广泛性和深度上展现出巨大潜力。对于推动多标签分类技术的进步和广泛应用具有重要的促进作用。本技术方案的实施将构建一个高效、准确并且用户友好的多标签分类系统的产生,该系统不仅能够提供精确的标签预测服务,还能够处理和理解社交媒体等平台上常见的非规范文本数据,满足用户在不同领域内对于复杂文本分类的需求。总体而言,本发明的实施将极大地丰富多标签分类技术的应用范围,并提高其在实际应用中的性能和用户体验。In short, the present invention has not only achieved breakthroughs in technical performance, but also demonstrated great potential in the breadth and depth of practical applications. It plays an important role in promoting the advancement and widespread application of multi-label classification technology. The implementation of this technical solution will build an efficient, accurate and user-friendly multi-label classification system, which can not only provide accurate label prediction services, but also process and understand non-standard text data commonly found on platforms such as social media, and meet users' needs for complex text classification in different fields. In general, the implementation of the present invention will greatly enrich the scope of application of multi-label classification technology, and improve its performance and user experience in practical applications.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。The accompanying drawings in the specification, which constitute a part of the present invention, are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations on the present invention.
图1为实施例一的方法流程图。FIG1 is a flow chart of a method according to Embodiment 1.
具体实施方式Detailed ways
应该指出,以下详细说明都是示例性的,旨在对本发明提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are exemplary and are intended to provide further explanation of the present invention. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present invention belongs.
实施例一Embodiment 1
本实施例提供了基于多标签关联的文本分类方法;This embodiment provides a text classification method based on multi-label association;
如图1所示,基于多标签关联的文本分类方法,包括:As shown in Figure 1, the text classification method based on multi-label association includes:
S101:获取已知标签类别的多个文本,构建训练集和测试集;所述训练集和测试集,包括:多个文本和多个标签类别,其中,每个文本的已知标签类别为多个;将训练集分为两部分:第一训练子集和第二训练子集;S101: Acquire multiple texts of known label categories, and construct a training set and a test set; the training set and the test set include: multiple texts and multiple label categories, wherein each text has multiple known label categories; divide the training set into two parts: a first training subset and a second training subset;
S102:将第一训练子集中的每个文本和多个标签类别,输入到文本分类模型中,对模型进行训练,得到初步训练后的文本分类模型;S102: input each text and multiple label categories in the first training subset into a text classification model, train the model, and obtain a preliminarily trained text classification model;
S103:将第二训练子集中的每个文本和多个标签类别,输入到初步训练后的文本分类模型,对模型进行训练,训练的过程中,根据每个文本的标签类别预测概率,计算出标签与标签之间的条件概率,基于条件概率构建标签关联矩阵,基于标签关联矩阵构建关联损失函数,基于关联损失函数和交叉熵损失函数,构建文本分类模型第二损失函数,在第二损失函数的损失值不小于预设阈值时,根据第二损失函数的损失值对初步训练后的文本分类模型的参数进行二次更新,利用二次更新后的文本分类模型获得每个文本的标签类别预测概率,直至计算出第二损失函数的损失值小于预设阈值,得到最终训练后的文本分类模型;S103: Input each text and multiple label categories in the second training subset into the text classification model after preliminary training, and train the model. During the training process, according to the label category prediction probability of each text, the conditional probability between labels is calculated, and a label association matrix is constructed based on the conditional probability. The association loss function is constructed based on the label association matrix. The second loss function of the text classification model is constructed based on the association loss function and the cross entropy loss function. When the loss value of the second loss function is not less than a preset threshold, the parameters of the text classification model after preliminary training are updated for the second time according to the loss value of the second loss function. The label category prediction probability of each text is obtained by using the text classification model after the second update, until the loss value of the second loss function is calculated to be less than the preset threshold, so as to obtain the final trained text classification model.
S104:根据测试集,对最终训练后的文本分类模型进行测试,利用通过测试的网络模型对待分类文本进行分类。S104: Testing the finally trained text classification model according to the test set, and classifying the text to be classified using the tested network model.
进一步地,所述S101:获取已知标签类别的多个文本,构建训练集和测试集,具体包括:Furthermore, the S101: obtaining multiple texts of known label categories and constructing a training set and a test set specifically includes:
对获取的文本进行文本清洗、文本规范化、特殊符号处理和表情符号处理。The acquired text is cleaned, normalized, processed with special symbols and emoticons.
所述文本清洗,包括:去除空格或标点符号;The text cleaning includes: removing spaces or punctuation marks;
所述文本规范化,包括:对文本中的缩写词进行扩展,扩展成完整单词;The text normalization includes: expanding abbreviations in the text into complete words;
所述特殊符号,包括:对特殊符号进行删除,特殊符号是指&、¥;The special symbols include: deleting the special symbols, the special symbols are &, ¥;
所述表情符号处理,包括:将表情符号,替换为单词。The emoticon processing includes: replacing the emoticon with a word.
应理解地,对输入文本执行一系列预处理步骤,包括文本清洗、文本规范化、特殊符号和表情符号处理,将原始文本转换成更标准化、更干净的形式,以便模型能更好地理解和处理数据。It should be understood that a series of preprocessing steps including text cleaning, text normalization, special symbols and emoji processing are performed on the input text to convert the raw text into a more standardized and cleaner form so that the model can better understand and process the data.
示例性地,首先,对文本进行基础的清洗工作,移除无关信息如多余的空格、标点符号误用等。接着,对文本中的非正式表达、缩写词进行扩展和标准化,例如,网络用语“lol”将其转换为全称“laughing out loud”(大声笑)。然后,针对文本中常见的特殊符号和表情符号,本发明采用一种特殊的映射策略,具体来说,是将特殊符号和表情符号映射到具有明确语义表示的预定义标签或词汇上,例如,表情符号“:)”被映射为词汇“happy”。Exemplarily, first, basic cleaning work is performed on the text to remove irrelevant information such as extra spaces, misused punctuation marks, etc. Then, informal expressions and abbreviations in the text are expanded and standardized. For example, the network term "lol" is converted into the full name "laughing out loud". Then, for the common special symbols and emoticons in the text, the present invention adopts a special mapping strategy, specifically, mapping the special symbols and emoticons to predefined tags or words with clear semantic representations, for example, the emoticon ":)" is mapped to the word "happy".
进一步地,所述S101:将训练集分为两部分:第一训练子集和第二训练子集,是按照设定比例进行的划分。Furthermore, the S101: dividing the training set into two parts: a first training subset and a second training subset, is divided according to a set ratio.
进一步地,所述S102:文本分类模型,包括:依次连接的transformer模型、全连接层和激活函数层;所述transformer模型,包括:BERT或RoBERTa模型;Furthermore, the S102: text classification model comprises: a transformer model, a fully connected layer and an activation function layer connected in sequence; the transformer model comprises: a BERT or RoBERTa model;
其中,transformer模型,用于对输入的文本进行语义特征提取;Among them, the transformer model is used to extract semantic features from the input text;
其中,全连接层,用于对所提取的特征进行映射,得到映射向量;Among them, the fully connected layer is used to map the extracted features to obtain a mapping vector;
其中,激活函数层,用于输出每个标签的存在概率。Among them, the activation function layer is used to output the existence probability of each label.
应理解地,利用预训练的Transformer Encoder,如BERT或RoBERTa,对文本序列进行编码,以提取文本的深层语义特征,通过先进的编码技术确保模型能够有效处理非规范文本中的新词、未知词及非标准用语,使模型能够充分理解文本上下文含义,为后续标签预测提供坚实的基础。It should be understood that a pre-trained Transformer Encoder, such as BERT or RoBERTa, is used to encode text sequences to extract deep semantic features of the text. Advanced encoding techniques are used to ensure that the model can effectively handle new words, unknown words, and non-standard terms in non-standard texts, so that the model can fully understand the contextual meaning of the text and provide a solid foundation for subsequent label prediction.
应理解地,通过全连接层和sigmoid激活函数,将语义表征映射到与总标签数目相等长度的向量,并输出每个标签的存在概率。这一步骤为确定每个标签是否属于当前文本提供了初始判断。It should be understood that through the fully connected layer and the sigmoid activation function, the semantic representation is mapped to a vector of the same length as the total number of tags, and the existence probability of each tag is output. This step provides an initial judgment for determining whether each tag belongs to the current text.
示例性地,利用预训练的transformer模型,如BERT或RoBERTa,对输入的文本序列进行编码。这一步骤中,使用基于子词的编码方式,同时采用动态子词拆分策略,将不在模型词汇表中的词汇(比如新词或专有名词)拆分成更小的单位(子词)进行处理。Exemplarily, a pre-trained transformer model, such as BERT or RoBERTa, is used to encode the input text sequence. In this step, a subword-based encoding method is used, and a dynamic subword splitting strategy is adopted to split words that are not in the model vocabulary (such as new words or proper nouns) into smaller units (subwords) for processing.
所述基于子词的编码方式包括:先是词汇表构建:从大量文本数据中学习并构建一个词汇表,词汇表包括常见的单词、词根、前缀、后缀以及单个字符等子词单元。然后文本输入文本分类模型之前,使用构建的词汇表将待分类文本中的每个单词分解为一个或多个子词单元,如果待分类文本中的单词直接存在于词汇表中,直接使用;如果待分类文本中的单词不存在于词汇表中,则将其分解为更小的子词单元,直到所有的单词都能用词汇表中的子词单元表示。之后每个子词单元对应一个唯一的数字ID。文本中的单词经过子词拆分后,被转换成一系列数字ID用于模型的输入。The subword-based encoding method includes: first, vocabulary construction: learning and constructing a vocabulary from a large amount of text data, the vocabulary includes subword units such as common words, roots, prefixes, suffixes and single characters. Then, before the text is input into the text classification model, each word in the text to be classified is decomposed into one or more subword units using the constructed vocabulary. If the word in the text to be classified exists directly in the vocabulary, it is used directly; if the word in the text to be classified does not exist in the vocabulary, it is decomposed into smaller subword units until all words can be represented by subword units in the vocabulary. After that, each subword unit corresponds to a unique digital ID. After the words in the text are split into subwords, they are converted into a series of digital IDs for model input.
所述动态子词拆分策略,是指如果待分类文本中不在词汇表中的单词,则将待分类文本拆分成更小的单位(子词)。整个拆分的过程是动态的,根据词汇表中的信息尽可能地将单词分解成有意义的子单元。The dynamic subword splitting strategy means that if there are words in the text to be classified that are not in the vocabulary, the text to be classified will be split into smaller units (subwords). The entire splitting process is dynamic, and the words are decomposed into meaningful subunits as much as possible based on the information in the vocabulary.
假设一个词汇表有单词"happy"和子词单元"un"、"ness"等。如果遇见编码单词"unhappiness",检查这个单词是否直接在词汇表中,如果不在,可以分解为"un"、"happy"、"ness"这样的子词单元序列,然后将这些子词单元转换为对应的数字ID。Suppose a vocabulary has the word "happy" and sub-word units "un", "ness", etc. If you encounter the encoded word "unhappiness", check whether this word is directly in the vocabulary. If not, you can decompose it into a sequence of sub-word units such as "un", "happy", "ness", and then convert these sub-word units into corresponding numeric IDs.
此外,模型根据上下文信息调整子词嵌入表示和增强语义表示,处理后输出维度为768维或1024维的隐藏状态向量,这些向量捕捉子词级别的丰富语义信息,为后续的标签预测提供丰富的上下文表征,为模型理解文本的语义信息奠定基础。例如,BERT或RoBERTa这样的预训练Transformer模型,使用预训练Transformer模型的自注意力层(Self-Attention layer)内部的自注意力机制可以帮助模型学会如何根据上下文调整词嵌入表示以及如何增强语义表示。In addition, the model adjusts the subword embedding representation and enhances the semantic representation according to the context information, and outputs a hidden state vector with a dimension of 768 or 1024 after processing. These vectors capture rich semantic information at the subword level, provide rich context representation for subsequent label prediction, and lay the foundation for the model to understand the semantic information of the text. For example, pre-trained Transformer models such as BERT or RoBERTa use the self-attention mechanism inside the self-attention layer of the pre-trained Transformer model to help the model learn how to adjust the word embedding representation according to the context and how to enhance the semantic representation.
示例性地,将文本语义表征通过一个全连接层传递,使之映射到一个维度与标签总数相等的向量。之后,使用sigmoid激活函数输出每个标签的存在概率,举例而言,如果总共有10个标签,那么网络将输出一个10维的概率向量,每个维度代表一个标签的预测概率。For example, the text semantic representation is passed through a fully connected layer to map it to a vector with a dimension equal to the total number of tags. After that, the sigmoid activation function is used to output the existence probability of each tag. For example, if there are 10 tags in total, the network will output a 10-dimensional probability vector, where each dimension represents the predicted probability of a tag.
进一步地,所述S102:根据每个文本的标签类别预测结果与每个文本的已知标签类别,计算文本分类模型第一损失函数的损失值,所述第一损失函数,采用交叉熵损失函数来实现。Furthermore, the S102: calculates the loss value of the first loss function of the text classification model according to the label category prediction result of each text and the known label category of each text, and the first loss function is implemented by using a cross entropy loss function.
应理解地,设定阈值并根据预测概率进行初步分类,为后续的关联性分析和模型训练提供初始标签分配情况。It should be understood that setting a threshold and performing preliminary classification based on the predicted probability provides initial label assignment for subsequent association analysis and model training.
示例性地,阈值判定与初步分类:设定一个阈值(例如0.5),用于将sigmoid函数输出的概率与该阈值比较,通过这一比较,从而确定每个标签是否应该被分配给当前文本。这一步骤实现了初步的分类决策,但未考虑标签之间的相关性。Exemplarily, threshold determination and preliminary classification: a threshold (e.g., 0.5) is set to compare the probability output by the sigmoid function with the threshold, and through this comparison, it is determined whether each label should be assigned to the current text. This step implements a preliminary classification decision, but does not consider the correlation between labels.
进一步地,所述S102:将第一训练子集中的每个文本和多个标签类别,输入到文本分类模型中,对模型进行训练,得到初步训练后的文本分类模型,具体包括:Furthermore, the step S102: inputting each text and multiple label categories in the first training subset into a text classification model, training the model, and obtaining a preliminarily trained text classification model specifically includes:
训练的过程中,根据每个文本的标签类别预测结果与每个文本的已知标签类别,计算文本分类模型第一损失函数的损失值,在第一损失函数的损失值不小于预设阈值时,根据第一损失函数的损失值更新文本分类模型的参数,利用更新后的文本分类模型获得每个文本的标签类别预测结果,直至计算出文本分类模型的第一损失函数的损失值小于预设阈值,得到初步训练后的文本分类模型。During the training process, the loss value of the first loss function of the text classification model is calculated based on the label category prediction results of each text and the known label category of each text. When the loss value of the first loss function is not less than a preset threshold, the parameters of the text classification model are updated according to the loss value of the first loss function, and the updated text classification model is used to obtain the label category prediction results of each text, until the loss value of the first loss function of the text classification model is calculated to be less than the preset threshold, thereby obtaining a text classification model after preliminary training.
进一步地,所述S103:将第二训练子集中的每个文本和多个标签类别,输入到初步训练后的文本分类模型,对模型进行训练,训练的过程中,根据每个文本的标签类别预测概率,计算出标签与标签之间的条件概率,具体条件概率表示为:Furthermore, in S103, each text and multiple label categories in the second training subset are input into the text classification model after preliminary training, and the model is trained. During the training process, the conditional probability between labels is calculated according to the label category prediction probability of each text. The specific conditional probability is expressed as:
其中,P(i)表示对于文本X,模型预测文本X属于标签i的概率。P(i∩j)表示文本X属于标签i同时属于标签j的概率。Among them, P(i) represents the probability that the model predicts that text X belongs to label i for text X. P(i∩j) represents the probability that text X belongs to label i and label j at the same time.
应理解地,计算标签之间的条件概率,条件概率表示在已经出现一个标签的情况下,另一个标签出现的概率,之后使用条件概率构建标签关联矩阵,深入分析标签间的共现和依赖关系。这一步是本发明的核心,它揭示了标签之间的复杂联系,为模型训练提供重要的关联性信息。It should be understood that the conditional probability between labels is calculated, which represents the probability of another label appearing when one label has already appeared. The conditional probability is then used to construct a label association matrix to deeply analyze the co-occurrence and dependency relationship between labels. This step is the core of the present invention, which reveals the complex relationship between labels and provides important correlation information for model training.
标签关联性分析:利用统计方法计算标签对之间的条件概率,构建标签关联矩阵。对于标签i和标签j,计算P(j∣i)和P(i∣j),即在标签i存在的情况下标签j存在的概率以及在标签j存在的情况下标签i存在的概率。Tag association analysis: Use statistical methods to calculate the conditional probability between tag pairs and construct a tag association matrix. For tags i and j, calculate P(j|i) and P(i|j), that is, the probability that tag j exists when tag i exists and the probability that tag i exists when tag j exists.
这些条件概率反映了标签之间的依赖性和共现关系,将这些条件概率组成一个矩阵,其中每个元素代表了对应标签对之间的关联程度。该关联矩阵在后续模型训练中被用于关联损失函数的计算。These conditional probabilities reflect the dependency and co-occurrence relationship between labels. These conditional probabilities are combined into a matrix, in which each element represents the degree of association between the corresponding label pairs. This association matrix is used to calculate the association loss function in subsequent model training.
进一步地,所述S103:基于条件概率构建标签关联矩阵,其中标签关联矩阵A是一个n×n的矩阵,每个元素Aij表示在标签i出现的条件下,标签j出现的条件概率Aij,公式表示为:Furthermore, the S103: constructs a tag association matrix based on conditional probability, wherein the tag association matrix A is an n×n matrix, each element A ij represents the conditional probability A ij of tag j appearing under the condition that tag i appears, and the formula is expressed as:
Aij=P(j∣i); Aij = P(j|i);
这意味着矩阵的第i行第j列的元素表示在给定标签i的情况下,标签j出现的概率。This means that the element in the i-th row and j-th column of the matrix represents the probability of label j appearing given label i.
进一步地,所述S103:基于标签关联矩阵构建关联损失函数,关联损失函数用于衡量模型预测的标签关联度与实际标签关联矩阵之间的差异,具体关联损失函数表示如下:Furthermore, the S103: constructs an association loss function based on the label association matrix. The association loss function is used to measure the difference between the label association degree predicted by the model and the actual label association matrix. The specific association loss function is expressed as follows:
其中,n表示标签的总数,Aij是标签关联矩阵中的元素,表示标签i出现的条件下,标签j出现的概率,pi表示模型预测实例属于标签i的概率,pj表示模型预测实例属于标签j的概率。|pi*pj-Aij|表示绝对误差项,衡量模型预测的标签i和标签j同时出现的概率与标签关联矩阵中对应元素Aii之间的差异。Where n represents the total number of labels, A ij is an element in the label association matrix, which represents the probability of label j appearing under the condition that label i appears, pi represents the probability that the model predicts that the instance belongs to label i, and p j represents the probability that the model predicts that the instance belongs to label j. | pi * pj - Aij | represents the absolute error term, which measures the difference between the probability of label i and label j appearing at the same time predicted by the model and the corresponding element Aii in the label association matrix.
进一步地,所述S103:基于关联损失函数和交叉熵损失函数,构建文本分类模型第二损失函数,其中,第二损失函数L,包括:Furthermore, the S103: constructing a second loss function of the text classification model based on the association loss function and the cross entropy loss function, wherein the second loss function L includes:
L=LCE+γLassocL=L CE +γLassoc
其中,LCE表示交叉熵损失函数,Lassoc表示关联损失函数,γ为平衡参数,是一个正实数,用于调节交叉熵损失函数和关联损失函数在总损失函数中的相对重要性。Among them, L CE represents the cross entropy loss function, Lassoc represents the association loss function, and γ is a balance parameter, which is a positive real number used to adjust the relative importance of the cross entropy loss function and the association loss function in the total loss function.
应理解地,模型训练阶段,除了传统的交叉熵损失函数用于衡量分类准确度,额外还引入了一个关联损失函数,关联损失函数基于标签关联矩阵,目的是惩罚模型预测过程中不符合已知标签之间关联性的情况,交叉熵损失函数和关联损失函数形成最终的第二损失函数,增强了模型对标签关联性的捕捉能力,采用梯度下降和正则化技术防止过拟合,确保模型具有良好的泛化能力。It should be understood that during the model training phase, in addition to the traditional cross-entropy loss function used to measure classification accuracy, an additional association loss function is introduced. The association loss function is based on the label association matrix. Its purpose is to punish the situation in which the model prediction process does not conform to the association between known labels. The cross-entropy loss function and the association loss function form the final second loss function, which enhances the model's ability to capture label associations. Gradient descent and regularization techniques are used to prevent overfitting and ensure that the model has good generalization capabilities.
同时,使用Adam优化器进行梯度下降,Adam优化器具有自适应学习率调整的特性,能够根据梯度的不同情况自动调整学习率,有助于提高模型的训练效率和收敛速度。At the same time, the Adam optimizer is used for gradient descent. The Adam optimizer has the feature of adaptive learning rate adjustment, which can automatically adjust the learning rate according to different gradient situations, helping to improve the training efficiency and convergence speed of the model.
此外,设置学习率衰减机制,这有助于在训练过程中逐渐降低学习率,使模型在接近最优解时更加稳定。为了进一步防止过拟合,在损失函数中加入L2正则化项。最后,超参数的选择上通过网格搜索结合交叉验证来确定超参数最佳取值。In addition, a learning rate decay mechanism is set, which helps to gradually reduce the learning rate during training, making the model more stable when approaching the optimal solution. To further prevent overfitting, an L2 regularization term is added to the loss function. Finally, the selection of hyperparameters is done by combining grid search with cross-validation to determine the optimal value of the hyperparameters.
进一步地,所述S104:根据测试集,对最终训练后的文本分类模型进行测试,利用通过测试的网络模型对待分类文本进行分类,其中,测试通过的标准是,计算的精确率、召回率和F1分数指标均大于设定阈值。Furthermore, S104: according to the test set, the finally trained text classification model is tested, and the text to be classified is classified using the tested network model, wherein the test passing criterion is that the calculated precision, recall and F1 score indicators are all greater than the set threshold.
应理解地,模型评估阶段,采用多种评估指标来衡量模型性能,在不同领域的数据集上测试模型的泛化能力,同时测试模型针对不同类型标签的分类效果。在模型训练完成后,对测试集进行预测,并计算精确率、召回率和F1分数等指标来评估模型性能。为了评估模型泛化能力,在独立数据集上进一步测试。此外,对不同类型标签(如高频、低频、中等频率标签)的分类性能分别进行评估,以全面了解模型的表现。It should be understood that during the model evaluation phase, a variety of evaluation indicators are used to measure model performance, test the generalization ability of the model on data sets in different fields, and test the classification effect of the model for different types of labels. After the model training is completed, the test set is predicted, and indicators such as precision, recall, and F1 score are calculated to evaluate the model performance. In order to evaluate the generalization ability of the model, further tests are performed on independent data sets. In addition, the classification performance of different types of labels (such as high-frequency, low-frequency, and medium-frequency labels) is evaluated separately to fully understand the performance of the model.
应理解地,应用范围与扩展研究,不断探索标签关联性分析在多个领域内的应用潜力,此外,尝试针对更复杂的标签结构,如层次标签体系,通过算法优化和模型迭代,提升处理复杂标签关系的能力。迭代改进与应用实践:根据评估结果对模型进行迭代优化,考虑实际应用场景,进一步调整模型参数和结构,以提高用户体验和满足特定需求。It should be understood that the scope of application and expansion research will continue to explore the application potential of tag correlation analysis in multiple fields. In addition, attempts will be made to improve the ability to handle complex tag relationships through algorithm optimization and model iteration for more complex tag structures, such as hierarchical tag systems. Iterative improvement and application practice: The model will be iteratively optimized based on the evaluation results, and the actual application scenarios will be considered to further adjust the model parameters and structure to improve user experience and meet specific needs.
扩展研究与跨领域应用:探索标签关联性在其他领域,如生物信息学、场景识别等的应用潜力,尝试开发新的算法处理更复杂的标签结构,此外,尝试将此方法扩展到其他数据类型,如图像和音频数据的分类标注。Expand research and cross-domain applications: Explore the application potential of label relevance in other fields, such as bioinformatics, scene recognition, etc., try to develop new algorithms to handle more complex label structures, and try to extend this method to other data types, such as classification annotation of image and audio data.
场景1:在一个新闻分类系统中,每篇新闻文章需要被分配一个或多个标签,如“国际”、“政治”、“经济”、“体育”等。这些标签之间存在潜在的关联性,例如“国际”和“政治”经常一起出现。Scenario 1: In a news classification system, each news article needs to be assigned one or more tags, such as "international", "politics", "economy", "sports", etc. There are potential correlations between these tags, for example, "international" and "politics" often appear together.
实施步骤:Implementation steps:
1.数据准备:收集了一个包含10,000篇新闻文章的数据集,每篇文章平均有3个相关标签。进行文本清洗和预处理,包括去除停用词、特殊字符等。1. Data preparation: A dataset of 10,000 news articles was collected, with an average of 3 relevant tags per article. Text cleaning and preprocessing were performed, including removing stop words, special characters, etc.
2.模型训练:2. Model training:
使用BERT作为编码器,对文章进行编码,获取每篇文章的语义表征。设置BERT的最大序列长度为512,批次大小为16。Use BERT as the encoder to encode the articles and obtain the semantic representation of each article. Set the maximum sequence length of BERT to 512 and the batch size to 16.
设计全连接层和sigmoid函数,输出每个标签的存在概率。该全连接层的输出维度为50(本例中的独特标签总数)。Design a fully connected layer and sigmoid function to output the probability of existence of each tag. The output dimension of this fully connected layer is 50 (the total number of unique tags in this case).
计算所有标签对之间的条件概率,构建标签关联矩阵。Calculate the conditional probabilities between all label pairs and construct a label association matrix.
在模型训练中加入基于标签关联矩阵的损失项,并使用Adam优化器进行梯度下降。设置初始学习率为1e-5,每10,000步衰减为原来的0.96。应用L2正则化以防止过拟合,超参数通过网格搜索和交叉验证选择。A loss term based on the label association matrix was added to the model training, and the Adam optimizer was used for gradient descent. The initial learning rate was set to 1e-5 and decayed to 0.96 every 10,000 steps. L2 regularization was applied to prevent overfitting, and hyperparameters were selected through grid search and cross-validation.
3.模型评估:3. Model evaluation:
在独立的测试集(包含2,000篇新闻文章)上进行测试,模型达到了85%的精确率、80%的召回率和82%的F1分数。Tested on an independent test set (containing 2,000 news articles), the model achieved 85% precision, 80% recall, and 82% F1 score.
分析不同类型标签的分类性能,发现模型在识别高频标签(如“体育”)时的精确率可达到90%,而在识别低频标签(如“政治”)时的精确率为75%。By analyzing the classification performance of different types of tags, it was found that the model’s accuracy rate can reach 90% when identifying high-frequency tags (such as “sports”), and the accuracy rate is 75% when identifying low-frequency tags (such as “politics”).
在一个全新的新闻数据集上测试模型,该数据集包含500篇新闻文章,模型在该数据集上保持了相近的性能指标。The model was tested on a new news dataset consisting of 500 news articles, and the model maintained similar performance indicators on this dataset.
4.迭代改进:根据模型评估反馈结果,发现模型在区分“国际”和“政治”标签时存在困难。为此,调整这两个标签在关联矩阵中的权重,重新训练模型。4. Iterative improvement: Based on the feedback from the model evaluation, it was found that the model had difficulty distinguishing between the "international" and "political" labels. To this end, the weights of these two labels in the association matrix were adjusted and the model was retrained.
5.部署与应用:将训练好的模型部署到实际的新闻分类系统中。5. Deployment and application: Deploy the trained model to the actual news classification system.
通过这个实施案例,新闻分类系统能够更准确地为文章打上合适的标签,结合对标签关联性的分析提高了分类质量。Through this implementation case, the news classification system can more accurately label articles and improve the classification quality by combining the analysis of label correlation.
场景2:一个社交媒体平台需要对用户生成的内容进行分类,以便推荐给感兴趣的其他用户或进行内容审查。常见的内容标签包括“旅行”、“美食”、“科技”、“娱乐”等。这些标签通常存在一定相关性,例如“美食”和“旅行”经常被同时标注。Scenario 2: A social media platform needs to categorize user-generated content in order to recommend it to other interested users or conduct content review. Common content tags include "travel", "food", "technology", "entertainment", etc. These tags are usually related to each other, for example, "food" and "travel" are often tagged at the same time.
实施步骤:Implementation steps:
1.数据准备:收集了50,000条带标签的社交媒体帖子作为训练数据集,每条帖子平均标注有2个相关标签。对其执行文本清洗、文本规范化以及特殊符号和表情符号处理。1. Data preparation: 50,000 labeled social media posts were collected as training data sets, with an average of 2 relevant tags for each post. Text cleaning, text normalization, and special symbols and emoticons were performed on them.
2.模型训练:2. Model training:
使用BERT模型作为编码器,设置批次大小为32,对帖子内容进行编码。Use the BERT model as the encoder and set the batch size to 32 to encode the post content.
构建全连接层输出层映射到20个可能的标签(本例中的独特标签总数),并使用sigmoid函数预测每个标签的概率。The fully connected layer output layer maps to 20 possible labels (the total number of unique labels in this case) and predicts the probability of each label using the sigmoid function.
基于训练数据计算标签关联矩阵,用于指导模型学习标签共现的模式。The label association matrix is calculated based on the training data to guide the model to learn the pattern of label co-occurrence.
结合交叉熵损失和关联损失函数对模型进行训练,采用Adam优化器,初始学习率设为3e-5,每5,000次迭代后衰减0.95。The model was trained by combining the cross entropy loss and the association loss function. The Adam optimizer was used, and the initial learning rate was set to 3e-5 and decayed by 0.95 after every 5,000 iterations.
为了防止过拟合,在模型训练中加入了权重衰减(L2正则化)。To prevent overfitting, weight decay (L2 regularization) is added during model training.
3.模型评估:3. Model evaluation:
在包含10,000条帖子的独立测试集上评估模型性能,得到精确率为82%,召回率为79%,F1分数为80%。The model performance was evaluated on an independent test set of 10,000 posts and obtained a precision of 82%, a recall of 79%, and an F1 score of 80%.
对于特定标签如“旅行”和“娱乐”,模型展示了高于平均水平的性能,精确率分别达到了85%和84%。For specific tags such as “travel” and “entertainment”, the model demonstrated above-average performance, reaching 85% and 84% precision respectively.
使用新的数据集运行模型,发现模型能够在不同类型的内容上维持稳定的分类能力,精确率保持在81%-83%之间。When running the model with the new dataset, it was found that the model was able to maintain stable classification capabilities on different types of content, with the accuracy remaining between 81% and 83%.
4.迭代改进:根据反馈发现模型在区分“科技”和“娱乐”时存在一定的混淆。于是调整标签关联矩阵,重新训练模型以提高区分度。经过调整后的模型在“科技”和“娱乐”标签上的精确率分别提升到了80%和86%。4. Iterative improvement: Based on feedback, we found that the model was somewhat confused when distinguishing between "technology" and "entertainment". Therefore, we adjusted the label association matrix and retrained the model to improve the differentiation. The accuracy of the adjusted model on the "technology" and "entertainment" labels increased to 80% and 86% respectively.
5.部署与应用:将优化后的模型部署到社交媒体平台的系统中,能够为帖子打上相关标签,帮助内容运营团队更有效地管理和审查内容。5. Deployment and application: Deploying the optimized model to the system of the social media platform can label posts with relevant tags, helping the content operation team to manage and review content more effectively.
在此实施案例中,通过对标签关联性的深入挖掘和应用,社交媒体平台的内容分类系统提高了分类的准确性,增强了运营团队的内容管理效率。In this implementation case, through in-depth mining and application of tag relevance, the content classification system of the social media platform improved the accuracy of classification and enhanced the content management efficiency of the operation team.
这种方法能够有效处理非正式语言、特殊符号和表情符号,同时能够识别和利用标签之间的关联性来提升分类的准确性和效率。此外,考虑到社交媒体内容和语言使用趋势的快速演变,这种系统还具备良好的适应性和灵活性,能够随着时间的推进不断自我优化,以应对不断变化的文本特点和分类需求。This method can effectively handle informal language, special symbols and emoticons, and can identify and utilize the correlation between tags to improve the accuracy and efficiency of classification. In addition, considering the rapid evolution of social media content and language usage trends, this system is also adaptable and flexible, and can continuously optimize itself over time to cope with the changing text characteristics and classification needs.
通过引入和深化标签关联性分析,本发明显著提升了非规范文本数据中多标签分类的精度和效率。该方法和系统不仅在当前的社交媒体内容分类中展现出卓越性能,其灵活性和适应性设计也为未来的发展和应用提供了坚实的基础,对于解决复杂标签关系下多标签文本分类任务展现出广泛应用潜力。By introducing and deepening the analysis of label relevance, the present invention significantly improves the accuracy and efficiency of multi-label classification in non-standard text data. The method and system not only show excellent performance in the current social media content classification, but its flexibility and adaptability design also provide a solid foundation for future development and application, and show wide application potential for solving multi-label text classification tasks under complex label relationships.
本发明专为提升文本数据多标签分类任务的准确性和效率而设计。首先,本方法对输入文本执行一系列预处理步骤,随后,采用深度学习模型进行编码以捕获深层次语义信息,然后,通过一个映射机制将编码后的文本转换为与所有可能标签相对应的预测概率,接着,本方法引入了一个创新的标签关联性分析步骤,该步骤通过定量地计算标签之间的条件概率来构建标签关联矩阵,从而揭示标签间的依赖和共现关系。在模型训练阶段,结合这些关联性信息,调整并优化模型参数,使得模型能更好地处理和预测具有相关性的标签集合。此外,本发明包括了一整套评估流程,用于衡量经过优化的模型性能,并根据评估结果进行迭代优化。为了增强模型的实用性和灵活性,本发明还深入探索了如何将面向非规范文本数据的多标签关联分类方法和系统应用于具体场景,并展望了其在不同领域的广泛应用前景。整体而言,本发明通过结合先进的深度学习技术和细致的标签关联性分析,不仅在理论上展现了创新性,而且在实践中证明了其高度的实用价值,有效促进了自然语言处理和机器学习技术的进步。The present invention is designed to improve the accuracy and efficiency of multi-label classification tasks for text data. First, the method performs a series of preprocessing steps on the input text, then uses a deep learning model to encode it to capture deep semantic information, and then converts the encoded text into prediction probabilities corresponding to all possible labels through a mapping mechanism. Then, the method introduces an innovative label association analysis step, which constructs a label association matrix by quantitatively calculating the conditional probabilities between labels, thereby revealing the dependencies and co-occurrence relationships between labels. In the model training stage, the model parameters are adjusted and optimized in combination with these association information, so that the model can better process and predict a set of related labels. In addition, the present invention includes a complete set of evaluation processes for measuring the performance of the optimized model and iteratively optimizing it according to the evaluation results. In order to enhance the practicality and flexibility of the model, the present invention also deeply explores how to apply the multi-label association classification method and system for non-standard text data to specific scenarios, and looks forward to its wide application prospects in different fields. Overall, this invention, by combining advanced deep learning technology and detailed label correlation analysis, not only demonstrates innovation in theory, but also proves its high practical value in practice, effectively promoting the advancement of natural language processing and machine learning technology.
实施例二Embodiment 2
本实施例提供了基于多标签关联的文本分类系统;This embodiment provides a text classification system based on multi-label association;
基于多标签关联的文本分类系统,包括:Text classification system based on multi-label association, including:
获取模块,其被配置为:获取已知标签类别的多个文本,构建训练集和测试集;所述训练集和测试集,包括:多个文本和多个标签类别,其中,每个文本的已知标签类别为多个;将训练集分为两部分:第一训练子集和第二训练子集;The acquisition module is configured to: acquire multiple texts of known label categories, and construct a training set and a test set; the training set and the test set include: multiple texts and multiple label categories, wherein each text has multiple known label categories; divide the training set into two parts: a first training subset and a second training subset;
一次训练模块,其被配置为:将第一训练子集中的每个文本和多个标签类别,输入到文本分类模型中,对模型进行训练,得到初步训练后的文本分类模型;A primary training module is configured to: input each text and multiple label categories in the first training subset into the text classification model, train the model, and obtain a preliminarily trained text classification model;
二次训练模块,其被配置为:将第二训练子集中的每个文本和多个标签类别,输入到初步训练后的文本分类模型,对模型进行训练,训练的过程中,根据每个文本的标签类别预测概率,计算出标签与标签之间的条件概率,基于条件概率构建标签关联矩阵,基于标签关联矩阵构建关联损失函数,基于关联损失函数和交叉熵损失函数,构建文本分类模型第二损失函数,在第二损失函数的损失值不小于预设阈值时,根据第二损失函数的损失值对初步训练后的文本分类模型的参数进行二次更新,利用二次更新后的文本分类模型获得每个文本的标签类别预测概率,直至计算出第二损失函数的损失值小于预设阈值,得到最终训练后的文本分类模型;A secondary training module, which is configured to: input each text and multiple label categories in the second training subset into the text classification model after preliminary training, train the model, calculate the conditional probability between labels according to the label category prediction probability of each text during the training process, construct a label association matrix based on the conditional probability, construct an association loss function based on the label association matrix, construct a second loss function of the text classification model based on the association loss function and the cross entropy loss function, and when the loss value of the second loss function is not less than a preset threshold, perform a secondary update on the parameters of the text classification model after preliminary training according to the loss value of the second loss function, and use the text classification model after the secondary update to obtain the label category prediction probability of each text, until the loss value of the second loss function is calculated to be less than the preset threshold, and obtain the final trained text classification model;
测试模块,其被配置为:根据测试集,对最终训练后的文本分类模型进行测试,利用通过测试的网络模型对待分类文本进行分类。The testing module is configured to: test the final trained text classification model according to the test set, and classify the text to be classified using the tested network model.
此处需要说明的是,上述获取模块、一次训练模块、二次训练模块和测试模块对应于实施例一中的步骤S101至步骤S104,上述模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例一所公开的内容。需要说明的是,上述模块作为系统的一部分可以在诸如一组计算机可执行指令的计算机系统中执行。It should be noted that the acquisition module, the first training module, the second training module and the test module correspond to steps S101 to S104 in Embodiment 1, and the examples and application scenarios implemented by the modules and the corresponding steps are the same, but are not limited to the contents disclosed in Embodiment 1. It should be noted that the modules as part of the system can be executed in a computer system such as a set of computer executable instructions.
上述实施例中对各个实施例的描述各有侧重,某个实施例中没有详述的部分可以参见其他实施例的相关描述。The description of each embodiment in the above embodiments has different emphases. For parts not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
所提出的系统,可以通过其他的方式实现。例如以上所描述的系统实施例仅仅是示意性的,例如上述模块的划分,仅仅为一种逻辑功能划分,实际实现时,可以有另外的划分方式,例如多个模块可以结合或者可以集成到另外一个系统,或一些特征可以忽略,或不执行。The proposed system can be implemented in other ways. For example, the system embodiment described above is only illustrative, and the division of the modules is only a logical function division. In actual implementation, there may be other division methods, such as multiple modules can be combined or integrated into another system, or some features can be ignored or not executed.
实施例三Embodiment 3
本实施例还提供了一种电子设备,包括:一个或多个处理器、一个或多个存储器、以及一个或多个计算机程序;其中,处理器与存储器连接,上述一个或多个计算机程序被存储在存储器中,当电子设备运行时,该处理器执行该存储器存储的一个或多个计算机程序,以使电子设备执行上述实施例一所述的方法。This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is connected to the memory, and the one or more computer programs are stored in the memory. When the electronic device is running, the processor executes the one or more computer programs stored in the memory so that the electronic device executes the method described in the above embodiment one.
应理解,本实施例中,处理器可以是中央处理单元CPU,处理器还可以是其他通用处理器、数字信号处理器DSP、专用集成电路ASIC,现成可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general-purpose processors, digital signal processors DSP, application-specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.
存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据、存储器的一部分还可以包括非易失性随机存储器。例如,存储器还可以存储设备类型的信息。The memory may include a read-only memory and a random access memory, and provide instructions and data to the processor. A portion of the memory may also include a non-volatile random access memory. For example, the memory may also store information about the device type.
在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.
实施例一中的方法可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。The method in the first embodiment can be directly embodied as a hardware processor, or a combination of hardware and software modules in the processor. The software module can be located in a mature storage medium in the field such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, or an electrically erasable programmable memory, a register, etc. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
本领域普通技术人员可以意识到,结合本实施例描述的各示例的单元及算法步骤,能够以电子硬件或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art will appreciate that the units and algorithm steps of each example described in conjunction with this embodiment can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present invention.
实施例四Embodiment 4
本实施例还提供了一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成实施例一所述的方法。This embodiment further provides a computer-readable storage medium for storing computer instructions. When the computer instructions are executed by a processor, the method described in the first embodiment is completed.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202410335568.2A CN118227790A (en) | 2024-03-22 | 2024-03-22 | Text classification method, system, device and medium based on multi-label association | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202410335568.2A CN118227790A (en) | 2024-03-22 | 2024-03-22 | Text classification method, system, device and medium based on multi-label association | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN118227790A true CN118227790A (en) | 2024-06-21 | 
Family
ID=91501946
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202410335568.2A Pending CN118227790A (en) | 2024-03-22 | 2024-03-22 | Text classification method, system, device and medium based on multi-label association | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN118227790A (en) | 
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN118673152A (en) * | 2024-08-22 | 2024-09-20 | 山东省齐鲁大数据研究院 | Text classification method, system, terminal and medium based on self-adaptive rewarding mechanism | 
| CN119179785A (en) * | 2024-11-22 | 2024-12-24 | 之江实验室 | Text emotion analysis method and device combined with graph regularization | 
| CN119226961A (en) * | 2024-12-03 | 2024-12-31 | 中孚安全技术有限公司 | Topic classification method, system, electronic device and storage medium for multi-label text | 
| CN119938924A (en) * | 2025-01-07 | 2025-05-06 | 中国矿业大学(北京) | A multi-label text classification method and system based on label relationship | 
| CN120372296A (en) * | 2025-06-24 | 2025-07-25 | 宁波宁帆信息科技有限公司 | Large model training system and method based on multi-mode medical data fusion | 
- 
        2024
        
- 2024-03-22 CN CN202410335568.2A patent/CN118227790A/en active Pending
 
 
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN118673152A (en) * | 2024-08-22 | 2024-09-20 | 山东省齐鲁大数据研究院 | Text classification method, system, terminal and medium based on self-adaptive rewarding mechanism | 
| CN119179785A (en) * | 2024-11-22 | 2024-12-24 | 之江实验室 | Text emotion analysis method and device combined with graph regularization | 
| CN119226961A (en) * | 2024-12-03 | 2024-12-31 | 中孚安全技术有限公司 | Topic classification method, system, electronic device and storage medium for multi-label text | 
| CN119226961B (en) * | 2024-12-03 | 2025-07-11 | 中孚安全技术有限公司 | Topic classification method, system, electronic equipment and storage medium for multi-label text | 
| CN119938924A (en) * | 2025-01-07 | 2025-05-06 | 中国矿业大学(北京) | A multi-label text classification method and system based on label relationship | 
| CN120372296A (en) * | 2025-06-24 | 2025-07-25 | 宁波宁帆信息科技有限公司 | Large model training system and method based on multi-mode medical data fusion | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN111897908B (en) | Event extraction method and system for fusing dependency information and pre-trained language model | |
| CN110210037B (en) | Syndrome-oriented medical field category detection method | |
| CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
| CN118227790A (en) | Text classification method, system, device and medium based on multi-label association | |
| CN113515632B (en) | Text classification method based on graph path knowledge extraction | |
| CN117149974A (en) | A knowledge graph question and answer method optimized for subgraph retrieval | |
| Jin et al. | Cold-start active learning for image classification | |
| CN110633365A (en) | A hierarchical multi-label text classification method and system based on word vectors | |
| CN107391565B (en) | Matching method of cross-language hierarchical classification system based on topic model | |
| CN116610778A (en) | Two-way image-text matching method based on cross-modal global and local attention mechanism | |
| CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
| CN111222318A (en) | Trigger word recognition method based on two-channel bidirectional LSTM-CRF network | |
| CN114048314B (en) | Natural language steganalysis method | |
| CN114911945A (en) | Knowledge graph-based multi-value chain data management auxiliary decision model construction method | |
| CN118277509A (en) | Knowledge graph-based data set retrieval method | |
| CN119669530B (en) | Knowledge graph generation-assisted teaching question answering method and system based on LLM | |
| CN115600602A (en) | Method, system and terminal device for extracting key elements of long text | |
| CN115169429A (en) | Lightweight aspect-level text emotion analysis method | |
| CN118012990A (en) | A prompt text generation method, system, computer device and storage medium | |
| CN118410802A (en) | Domain named entity recognition method based on hybrid hint learning and rules | |
| CN113535897A (en) | Fine-grained emotion analysis method based on syntactic relation and opinion word distribution | |
| CN118093860A (en) | Multi-level scientific research topic mining method based on text embedded vector clustering | |
| CN120012774B (en) | Product label extraction method based on Internet big data and AI big language model | |
| CN113792144B (en) | Text classification method of graph convolution neural network based on semi-supervision | |
| CN112989803B (en) | Entity link prediction method based on topic vector learning | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |