CN114490935A - Abnormal text detection method, device, computer readable medium and electronic device - Google Patents
Abnormal text detection method, device, computer readable medium and electronic device Download PDFInfo
- Publication number
- CN114490935A CN114490935A CN202210073277.1A CN202210073277A CN114490935A CN 114490935 A CN114490935 A CN 114490935A CN 202210073277 A CN202210073277 A CN 202210073277A CN 114490935 A CN114490935 A CN 114490935A
- Authority
- CN
- China
- Prior art keywords
- feature
- abnormal
- text
- segment
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域technical field
本申请属于计算机及人工智能技术领域,具体涉及一种异常文本的检测方法、装置、计算机可读介质以及电子设备。The present application belongs to the technical field of computers and artificial intelligence, and specifically relates to a method, device, computer-readable medium and electronic device for detecting abnormal text.
背景技术Background technique
在很多情况下,需要对文本数据进行校订,解决文本中的错别字、语义语法错误等异常问题。目前,在自然语言处理中,常用的检测方法是通过序列标注模型来识别文本中的异常位置。序列标注,就是给定一个文本序列,对该文本序列中的每一个元素进行分析,确定每个元素的异常概率,最后将异常概率较大的元素识别为异常元素。然而,序列标注法通常会产生定位偏移问题,即识别出的异常元素与真实异常元素之间具有一定差距,因此,这种检测方法的检测准确性不高,有待改进。In many cases, the text data needs to be revised to solve abnormal problems such as typos and semantic and grammatical errors in the text. Currently, in natural language processing, a common detection method is to identify abnormal locations in text through sequence annotation models. Sequence labeling is to give a text sequence, analyze each element in the text sequence, determine the abnormal probability of each element, and finally identify the element with a large abnormal probability as an abnormal element. However, the sequence labeling method usually has the problem of positioning offset, that is, there is a certain gap between the identified abnormal elements and the real abnormal elements. Therefore, the detection accuracy of this detection method is not high and needs to be improved.
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本申请的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above Background section is only for enhancing understanding of the background of the application, and therefore may include information that does not form the prior art known to a person of ordinary skill in the art.
发明内容SUMMARY OF THE INVENTION
本申请的目的在于提供一种异常文本的检测方法、装置、计算机可读介质以及电子设备,以解决相关技术中对文本中异常片段的定位精度较低的问题。The purpose of the present application is to provide a method, apparatus, computer-readable medium and electronic device for detecting abnormal text, so as to solve the problem of low positioning accuracy of abnormal segments in text in the related art.
本申请的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。Other features and advantages of the present application will become apparent from the following detailed description, or be learned in part by practice of the present application.
根据本申请实施例的一个方面,提供一种异常文本的检测方法,包括:According to an aspect of the embodiments of the present application, a method for detecting abnormal text is provided, including:
获取由多个字组成的待检测文本;Get the text to be detected consisting of multiple words;
对所述待检测文本进行特征提取,得到所述待检测文本的特征序列,所述特征序列包括所述待检测文本中多个字对应的上下文特征;Perform feature extraction on the text to be detected to obtain a feature sequence of the text to be detected, where the feature sequence includes contextual features corresponding to multiple words in the text to be detected;
通过多个预设模型分别对所述特征序列进行映射处理,得到各个预设模型对应的处理结果;其中,所述预设模型的处理结果包括所述特征序列中特征片段的异常概率,所述特征片段包括至少一个字的上下文特征;不同预设模型的处理结果所对应的特征片段的长度不同;The feature sequences are respectively mapped through a plurality of preset models to obtain processing results corresponding to each preset model; wherein, the processing results of the preset models include the abnormal probability of feature segments in the feature sequence, and the The feature segment includes the context feature of at least one word; the lengths of the feature segments corresponding to the processing results of different preset models are different;
根据各个预设模型处理结果所指示的特征片段的异常概率,确定所述待检测文本中的异常片段。According to the abnormal probability of the feature segment indicated by the processing result of each preset model, the abnormal segment in the text to be detected is determined.
根据本申请实施例的一个方面,提供一种异常文本的检测装置,包括:According to an aspect of the embodiments of the present application, an apparatus for detecting abnormal text is provided, including:
文本获取模块,用于获取由多个字组成的待检测文本;The text acquisition module is used to acquire the text to be detected composed of multiple words;
特征提取模块,用于对所述待检测文本进行特征提取,得到所述待检测文本的特征序列,所述特征序列包括所述待检测文本中多个字对应的上下文特征;a feature extraction module, configured to perform feature extraction on the text to be detected to obtain a feature sequence of the text to be detected, where the feature sequence includes contextual features corresponding to multiple words in the text to be detected;
映射处理模块,用于通过多个预设模型分别对所述特征序列进行映射处理,得到各个预设模型对应的处理结果;其中,所述预设模型的处理结果包括所述特征序列中不同特征片段的异常概率,所述特征片段包括至少一个字的上下文特征;不同预设模型的处理结果所对应的特征片段的长度不同;a mapping processing module, configured to perform mapping processing on the feature sequence through a plurality of preset models, to obtain processing results corresponding to each preset model; wherein the processing results of the preset model include different features in the feature sequence The abnormal probability of the segment, the feature segment includes the context feature of at least one word; the lengths of the feature segments corresponding to the processing results of different preset models are different;
异常片段确定模块,用于根据各个预设模型处理结果所指示的特征片段的异常概率,确定所述待检测文本中的异常片段。The abnormal segment determination module is configured to determine the abnormal segment in the text to be detected according to the abnormal probability of the feature segment indicated by the processing result of each preset model.
在本申请的一个实施例中,所述装置还包括:In an embodiment of the present application, the device further includes:
样本数据获取模块,用于获取由多个字组成的样本数据,所述样本数据中的字具有指示异常状态的第一标签;a sample data acquisition module for acquiring sample data consisting of a plurality of words, the words in the sample data having a first label indicating an abnormal state;
第二标签生成模块,用于基于多个预设片段长度,根据每个预设片段长度确定所述样本数据中的多个样本片段,并根据所述样本片段对应的第一标签为所述样本片段赋予指示异常状态的第二标签;The second label generation module is configured to determine a plurality of sample segments in the sample data according to the length of each preset segment based on a plurality of preset segment lengths, and generate the sample according to the first label corresponding to the sample segment. The fragment assigns a second label indicating an abnormal state;
模型训练模块,用于将各个预设片段长度对应的具有第二标签的样本数据作为训练样本,通过所述训练样本对神经网络模型进行训练,得到各个预设片段长度对应的预设模型。The model training module is configured to use the sample data with the second label corresponding to each preset segment length as a training sample, and train the neural network model through the training samples to obtain preset models corresponding to each preset segment length.
在本申请的一个实施例中,所述第二标签生成模块具体用于:In an embodiment of the present application, the second label generation module is specifically used for:
设定一个以所述预设片段长度作为窗口宽度的窗口,将样本数据中包含在所述窗口内的所有字作为样本片段,其中,所述窗口根据设定步长从所述样本数据的起始位滑动至终止位。A window is set with the preset segment length as the window width, and all words contained in the window in the sample data are used as sample segments, wherein the window starts from the sample data according to the set step size. The start position slides to the end position.
在本申请的一个实施例中,所述第一标签包括正常标签和异常标签;所述第二标签生成模块还用于::In an embodiment of the present application, the first label includes a normal label and an abnormal label; the second label generation module is further configured to:
根据所述窗口内的异常标签总量和所述窗口宽度生成所述样本片段的第二标签。A second label for the sample segment is generated based on the total amount of abnormal labels within the window and the window width.
在本申请的一个实施例中,在所述神经网络模型的训练过程中,将所述神经网络模型针对所述训练样本的预测值与所述训练样本的第二标签之间的交叉熵作为损失函数,基于所述损失函数更新所述神经网络模型的模型参数。In an embodiment of the present application, in the training process of the neural network model, the cross-entropy between the predicted value of the neural network model for the training sample and the second label of the training sample is used as the loss function to update model parameters of the neural network model based on the loss function.
在本申请的一个实施例中,特征提取模块包括:In an embodiment of the present application, the feature extraction module includes:
分字单元,用于对所述待检测文本进行分字处理,得到按顺序排列的多个字,并根据预设字典将所述按顺序排列的多个字中的每个字转化为对应的字标签,得到所述待检测文本的字序列;The word segmentation unit is used to perform word segmentation processing on the text to be detected to obtain a plurality of words arranged in order, and convert each word in the plurality of words arranged in order into a corresponding word according to a preset dictionary word label, to obtain the word sequence of the text to be detected;
特征提取单元,用于对所述字序列进行上下文特征提取,得到所述待检测文本的特征序列。A feature extraction unit, configured to perform context feature extraction on the word sequence to obtain the feature sequence of the text to be detected.
在本申请的一个实施例中,所述特征提取单元具体用于:In an embodiment of the present application, the feature extraction unit is specifically used for:
根据所述字序列中的字标签确定所述字标签对应的语义向量和位置向量;Determine the semantic vector and the position vector corresponding to the word label according to the word label in the word sequence;
根据所述字序列中的字标签及所述字标签对应的语义向量和位置向量生成待特征提取向量;Generate a vector to be feature extraction according to the word label in the word sequence and the semantic vector and position vector corresponding to the word label;
对所述带特征提取向量进行上下文特征提取,得到所述待检测文本的特征序列。Context feature extraction is performed on the vector with feature extraction to obtain the feature sequence of the text to be detected.
在本申请的一个实施例中,映射处理模块包括:In an embodiment of the present application, the mapping processing module includes:
特征片段确定单元,用于根据所述预设模型对应的预设片段长度,通过滑动窗口法确定所述特征序列中的特征片段;a feature segment determining unit, configured to determine the feature segment in the feature sequence by a sliding window method according to the preset segment length corresponding to the preset model;
异常特征表示获取单元,用于通过所述预设模型的卷积层获取所述特征片段的异常特征表示;an abnormal feature representation acquisition unit, configured to acquire the abnormal feature representation of the feature segment through the convolution layer of the preset model;
异常概率获取单元,用于通过所述预设模型的全连接层对所述异常特征表示进行映射处理,得到所述特征片段的异常概率。An abnormal probability acquisition unit, configured to perform mapping processing on the abnormal feature representation through the fully connected layer of the preset model to obtain the abnormal probability of the feature segment.
在本申请的一个实施例中,所述异常特征表示获取单元具体用于:In an embodiment of the present application, the abnormal feature representation acquisition unit is specifically configured to:
根据所述预设模型的卷积层的模型参数对所述特征片段中的所有上下文特征进行融合,得到所述特征片段的异常特征表示。All context features in the feature segment are fused according to the model parameters of the convolutional layer of the preset model to obtain an abnormal feature representation of the feature segment.
在本申请的一个实施例中,所述卷积层的模型参数包括第一权重参数和第一基值参数;所述异常特征表示获取单元进一步用于:In an embodiment of the present application, the model parameters of the convolution layer include a first weight parameter and a first base value parameter; the abnormal feature representation acquisition unit is further configured to:
通过所述第一权重参数对所述特征片段中的所有上下文特征进行加权求和,得到权值特征;Weighted summation is performed on all context features in the feature segment by using the first weight parameter to obtain a weight feature;
将所述权值特征与所述第一基值参数叠加,得到所述特征片段的异常特征表示。The weight feature and the first base value parameter are superimposed to obtain an abnormal feature representation of the feature segment.
在本申请的一个实施例中,所述预设模型的全连接层包括第二权重参数和第二基值参数;所述异常概率获取单元具体用于:In an embodiment of the present application, the fully connected layer of the preset model includes a second weight parameter and a second base value parameter; the abnormal probability acquisition unit is specifically used for:
将所述异常特征表示与所述第二权重参数相乘后再与所述第二基值参数相加,得到待激活特征;Multiplying the abnormal feature representation by the second weight parameter and then adding the second base value parameter to obtain the feature to be activated;
通过预设激活函数对所述待激活特征进行处理,得到所述特征片段的异常概率。The feature to be activated is processed through a preset activation function to obtain the abnormal probability of the feature segment.
在本申请的一个实施例中,异常片段确定模块具体用于:In an embodiment of the present application, the abnormal segment determination module is specifically used for:
确定各个预设模型处理结果所指示的特征片段的异常概率中的最大异常概率;determining the maximum abnormal probability among the abnormal probabilities of the feature segments indicated by the processing results of each preset model;
将所述最大异常概率对应特征片段所指示的多个字作为所述待检测文本中的异常片段。A plurality of words indicated by the feature segments corresponding to the maximum abnormal probability are used as abnormal segments in the text to be detected.
根据本申请实施例的一个方面,提供一种计算机可读介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如以上技术方案中的异常文本的检测方法。According to an aspect of the embodiments of the present application, there is provided a computer-readable medium on which a computer program is stored, and when the computer program is executed by a processor, implements the abnormal text detection method in the above technical solution.
根据本申请实施例的一个方面,提供一种电子设备,该电子设备包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器被配置为经由执行所述可执行指令来执行如以上技术方案中的异常文本的检测方法。According to an aspect of the embodiments of the present application, there is provided an electronic device, the electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the The executable instructions are used to execute the abnormal text detection method in the above technical solution.
根据本申请实施例的一个方面,提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行如以上技术方案中的异常文本的检测方法。According to one aspect of the embodiments of the present application, there is provided a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the abnormal text detection method in the above technical solution.
在本申请实施例提供的技术方案中,通过多个预设模型分别对特征序列进行处理得到处理结果,其中,处理结果包括特征片段的异常概率,也就是将待检测文本划分为多个片段进行异常检测,而不是直接对待检测文本整个句子进行检测,使得异常检测过程中充分考虑到待检测文本中的局部特征,提高了检测精度;同时,由于不同预设模型的处理结果所对应的特征片段的长度不同,相当于通过多种粒度的模型分别对待检测文本进行检测,进一步提高检测结果的准确性和精度。In the technical solutions provided in the embodiments of the present application, the feature sequences are processed separately through a plurality of preset models to obtain processing results, wherein the processing results include the abnormal probability of the feature segments, that is, the text to be detected is divided into multiple segments for processing. Anomaly detection, instead of directly detecting the entire sentence of the text to be detected, makes the local features in the text to be detected fully considered in the process of anomaly detection and improves the detection accuracy; at the same time, due to the feature fragments corresponding to the processing results of different preset models The length of the text is different, which is equivalent to the detection of the text to be detected through a variety of granular models, which further improves the accuracy and precision of the detection results.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application. Obviously, the drawings in the following description are only some embodiments of the present application, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1示意性地示出了应用本申请技术方案的示例性系统架构框图。FIG. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
图2示意性地示出了本申请一个实施例提供的异常文本的检测方法的流程图。FIG. 2 schematically shows a flowchart of a method for detecting abnormal text provided by an embodiment of the present application.
图3示意性地示出了本申请一个实施例提供的对待检测文本进行特征提取的流程图。FIG. 3 schematically shows a flowchart of feature extraction for text to be detected provided by an embodiment of the present application.
图4示意性地示出了本申请一个实施例提供的确定预设片段长度的示意图。FIG. 4 schematically shows a schematic diagram of determining the length of a preset segment provided by an embodiment of the present application.
图5示意性地示出了本申请一个实施例提供的预设模型的构建方法的流程图。FIG. 5 schematically shows a flowchart of a method for constructing a preset model provided by an embodiment of the present application.
图6示意性地示出了应用本申请技术方案的模型结构图。FIG. 6 schematically shows a model structure diagram to which the technical solution of the present application is applied.
图7示意性地示出了本申请技术方案在一种场景下的应用流程图。FIG. 7 schematically shows an application flow chart of the technical solution of the present application in a scenario.
图8示意性地示出了本申请实施例提供的异常文本的检测装置的结构框图。FIG. 8 schematically shows a structural block diagram of an abnormal text detection apparatus provided by an embodiment of the present application.
图9示意性示出了适于用来实现本申请实施例的电子设备的计算机系统结构框图。FIG. 9 schematically shows a structural block diagram of a computer system suitable for implementing the electronic device of the embodiment of the present application.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present application. However, those skilled in the art will appreciate that the technical solutions of the present application may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present application.
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are merely functional entities and do not necessarily necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are only exemplary illustrations and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.
图1示意性地示出了应用本申请技术方案的示例性系统架构框图。FIG. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
如图1所示,系统架构100可以包括终端设备110、网络120和服务器130。终端设备110可以包括智能手机、平板电脑、笔记本电脑、智能语音交互设备、智能家电、车载终端等等。服务器130可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云计算服务的云服务器。网络120可以是能够在终端设备110和服务器130之间提供通信链路的各种连接类型的通信介质,例如可以是有线通信链路或者无线通信链路。As shown in FIG. 1 , the
根据实现需要,本申请实施例中的系统架构可以具有任意数目的终端设备、网络和服务器。例如,服务器130可以是由多个服务器设备组成的服务器群组。另外,本申请实施例提供的技术方案可以应用于终端设备110,也可以应用于服务器130,或者可以由终端设备110和服务器130共同实施,本申请对此不做特殊限定。According to implementation requirements, the system architecture in this embodiment of the present application may have any number of terminal devices, networks, and servers. For example, the
本申请实施例提供的异常文本的检测方法由服务器130执行,相应地,异常文本的检测装置设置于服务器130中。但本领域技术人员容易理解的是,本申请实施例提供的异常文本的检测方法也可以由终端设备110执行,相应地,异常文本的检测装置也可以设置于终端设备110中,本示例性实施例中对此不做特殊限定。The abnormal text detection method provided by the embodiment of the present application is executed by the
举例而言,服务器130获取由多个字组成的待检测文本,然后对待检测文本进行特征提取,得到待检测文本的特征序列,特征序列包括待检测文本中多个字对应的上下文特征。接下来,服务器130通过多个预设模型分别对特征序列进行映射处理,得到各个预设模型对应的处理结果;其中,不同预设模型的处理结果所对应的特征片段的长度不同。并且,在一个预设模型的处理结果中,包括特征序列中多个特征片段的异常概率,特征片段包括至少一个字的上下文特征。最后,服务器130根据各个预设模型处理结果所指示的特征片段的异常概率,确定待检测文本中的异常片段。For example, the
在本申请的一个实施例中,服务器130在确定待检测文本中的异常片段之后,可以将待检测文本中的异常片段通过网络120返回给终端设备110,终端设备110可以在显示界面中标记处待检测文本中的异常片段,例如,将待检测文本中的异常片段进行高亮标记,进而可以通过终端设备110的显示界面快速便捷地获知待检测文本中的异常片段。In an embodiment of the present application, after determining the abnormal segment in the text to be detected, the
本申请实施例提供的技术方案可通过人工智能技术实现,如通过人工智能技术生成预设模型等。人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。The technical solutions provided in the embodiments of the present application can be implemented by artificial intelligence technology, such as generating a preset model by artificial intelligence technology. Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning, autonomous driving, and smart transportation.
下面结合具体实施方式对本申请提供的异常文本的检测方法做出详细说明。The method for detecting abnormal text provided by the present application will be described in detail below with reference to specific embodiments.
图2示意性地示出了本申请一个实施例提供的异常文本的检测方法的流程图,该方法可以由终端设备实施,如图1所示的终端设备110;该方法也可以由服务器实施,如图1所示的服务器130。如图2所示,本申请实施例提供的异常文本的检测方法包括步骤210至步骤240,具体如下:FIG. 2 schematically shows a flowchart of a method for detecting abnormal text provided by an embodiment of the present application. The method can be implemented by a terminal device, such as the
步骤210、获取由多个字组成的待检测文本。Step 210: Acquire a text to be detected consisting of multiple characters.
具体的,待检测文本由多个字组成,可以是一个句子,也可以是多个句子。待检测文本是已经确定为异常但不清楚异常位置的文本数据,文本数据的异常情况包括文本数据中有错别字、语法错误、语义错误等情况。待检测文本可以是从文本数据中获取的内容,例如,一篇文章中确认为异常的标题或正文句子。待检测文本也可以是对语音数据进行语音识别后得到的异常文本数据,或者,是对图像数据进行文字识别得到的异常文本数据,本申请实施例不做限制。其中,确认文本数据是为异常文本可以通过训练好的文本检测模型来实现。Specifically, the text to be detected consists of multiple words, which may be one sentence or multiple sentences. The text to be detected is the text data that has been determined to be abnormal but the abnormal position is not clear. The abnormal situation of the text data includes typos, grammatical errors, and semantic errors in the text data. The text to be detected can be content obtained from text data, for example, a title or body sentence in an article that is confirmed to be abnormal. The text to be detected may also be abnormal text data obtained by performing speech recognition on voice data, or abnormal text data obtained by performing text recognition on image data, which is not limited in this embodiment of the present application. Among them, confirming that the text data is abnormal text can be realized by a trained text detection model.
步骤220、对待检测文本进行特征提取,得到待检测文本的特征序列,特征序列包括待检测文本中多个字对应的上下文特征。Step 220: Perform feature extraction on the text to be detected to obtain a feature sequence of the text to be detected, where the feature sequence includes contextual features corresponding to multiple words in the text to be detected.
具体的,对待检测文本进行特征提取,就是对待检测文本进行上下文语义特征提取,将待检测文本从文字转化为特征向量,得到待检测文本中多个字对应的上下文特征,形成特征序列。在一个字的上下文特征中,包含了该字在待检测文本中的语义信息,也就是包含了该字在待检测文本中的异常信息。Specifically, to perform feature extraction on the text to be detected is to extract the contextual semantic features of the text to be detected, convert the text to be detected from text into feature vectors, and obtain contextual features corresponding to multiple words in the text to be detected to form a feature sequence. In the context feature of a word, the semantic information of the word in the text to be detected is included, that is, the abnormal information of the word in the text to be detected is included.
在本申请的一个实施例中,如图3所示,对待检测文本进行特征提取的过程包括步骤310至步骤320,具体为:In an embodiment of the present application, as shown in FIG. 3 , the process of performing feature extraction on the text to be detected includes
步骤310、对待检测文本进行分字处理,得到按顺序排列的多个字,并根据预设字典将按顺序排列的多个字中的每个字转化为对应的字标签,得到待检测文本的字序列。Step 310: Perform word segmentation processing on the text to be detected to obtain a plurality of words arranged in order, and convert each word of the plurality of words arranged in order into a corresponding word label according to a preset dictionary, and obtain the text to be detected. word sequence.
具体的,分字处理,就是对待检测文本中的文字进行切分,得到按顺序排列的多个字,这多个字的排列顺序就是待检测文本中的文字排列顺序。由于不能够直接对文字进行处理,在切分得到多个字之后,还需要将每个字转化为对应的字标签,从而得到按顺序排序的多个字标签,也就是待检测文本的字序列。字标签是字的一种标识,相当于字的ID,记为Token。Specifically, the word segmentation process is to segment the words in the text to be detected to obtain multiple words arranged in sequence, and the arrangement order of the multiple words is the arrangement order of the words in the text to be detected. Since the text cannot be processed directly, after dividing and obtaining multiple words, each word needs to be converted into a corresponding word label, so as to obtain multiple word labels sorted in order, that is, the word sequence of the text to be detected. . The word tag is a kind of identification of the word, which is equivalent to the ID of the word, which is recorded as Token.
在本申请的一个实施例中,可以根据预设字典将字转化为对应的字标签。预设字典收录了大量文字及文字所对应的字标签。遍历按顺序排列的多个字,针对每一个字,在预设字典中查找与该字相同的字,将该相同的字所对应的字标签作为该字的字标签。In an embodiment of the present application, words can be converted into corresponding word labels according to a preset dictionary. The default dictionary contains a large number of words and their corresponding word tags. Traverse multiple words arranged in sequence, and for each word, look up the same word as the word in the preset dictionary, and use the word label corresponding to the same word as the word label of the word.
在本申请的一个实施例中,当待检测文本包括多个句子时,为了识别出句子,可以在待检测文本的首部设置一句首标识符[CLS],并在每个句子的句尾设置一句尾标识符[SEP]。一般来说,待检测文本的首部就是待检测文本的起始字之前。句子的句尾会有标点符号,故而可以先识别出句子中的标点符号,然后在标点符号处设置句尾标识符。在一种情况下,当一个字之后没有标点符号也没有其他字时,可以认为该字为句尾,在该字之后设置句尾标识符。当然,当待检测文本只有一个句子时,之间在最后一个字标签之后设置句尾标识符即可。那么,最终得到的字序列由句首标识符、字标签和句尾标识符构成。In an embodiment of the present application, when the text to be detected includes multiple sentences, in order to identify the sentences, a sentence start identifier [CLS] can be set at the head of the text to be detected, and a sentence can be set at the end of each sentence tail identifier [SEP]. Generally speaking, the head of the text to be detected is before the start word of the text to be detected. There will be punctuation marks at the end of sentences, so you can first identify the punctuation marks in the sentence, and then set the sentence end identifier at the punctuation mark. In one case, when there is no punctuation mark and no other words after a word, the word can be considered as a sentence end, and a sentence end identifier is set after the word. Of course, when the text to be detected has only one sentence, it is sufficient to set a sentence end identifier after the last word label. Then, the final word sequence is composed of sentence start identifier, word label and sentence end identifier.
示例性的,待检测文本是“我是中国人,我爱中国”,该待检测文本的首部为“我”前面,故而在“我”的字标签之前设置句首标识符[CLS]。待检测文本中的“,”视为句尾,且“我爱中国”中的“国”视为句尾,故而需要设置两个句尾标识符[SEP]。转化后得到字序列:[CLS]Token1Token2 Token3 Token4Token5[SEP]Token6 Token7Token8 Token9[SEP],其中,Token的编号仅代表对应字在待检测文本中的排序编号。Exemplarily, the text to be detected is "I am Chinese, I love China", and the head of the text to be detected is before "I", so the sentence start identifier [CLS] is set before the word label of "I". The "," in the text to be detected is regarded as the end of the sentence, and the "country" in "I love China" is regarded as the end of the sentence, so two sentence end identifiers [SEP] need to be set. After conversion, the word sequence is obtained: [CLS]Token1Token2 Token3 Token4Token5[SEP]Token6 Token7Token8 Token9[SEP], where the token number only represents the order number of the corresponding word in the text to be detected.
步骤320、对字序列进行上下文特征提取,得到待检测文本的特征序列。
具体的,上下文特征就是能够代表字词所处语境、语义等信息的特征,自然也能够反映字词所处位置的异常信息。Specifically, the context feature is a feature that can represent information such as the context and semantics of the word, and can also reflect the abnormal information of the location of the word.
在本申请的一个实施例中,可以通过预训练语言模型对字序列进行上下文特征提取,预训练语言模型可以是BERT(Bidirectional Encoder Representation fromTransformers,变换器的双向编码器表示)模型。In an embodiment of the present application, a pre-trained language model may be used to perform context feature extraction on a word sequence, and the pre-trained language model may be a BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoder representation of a transformer) model.
预训练语言模型是自然语言处理中的一种模型。自然语言处理(Nature Languageprocessing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。A pretrained language model is a type of model in natural language processing. Natural Language Processing (NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
在本申请的一个实施例中,对字序列进行上下文特征提取的过程具体包括:根据字序列中的字标签确定字标签对应的语义向量和位置向量;根据字序列中的字标签及字标签对应的语义向量和位置向量生成待特征提取向量;对带特征提取向量进行上下文特征提取,得到待检测文本的特征序列。In an embodiment of the present application, the process of performing context feature extraction on a word sequence specifically includes: determining a semantic vector and a position vector corresponding to the word label according to the word label in the word sequence; The semantic vector and position vector are generated to generate the feature extraction vector; the context feature extraction is performed on the feature extraction vector to obtain the feature sequence of the text to be detected.
具体的,字的语义向量表征待检测文本的全局语义信息和字的语义信息融合后的信息,例如,字的语义向量可以指示字所在句子(例如,待检测文本包括句子A和句子B,字的语义向量可以指示该字是在句子A中还是在句子B中)、字所在句子的类型(如标题或正文),语义向量由预训练语言模型根据字标签和待检测文本确定。字的位置向量表征字在待检测文本中的位置信息,这是由于待检测文本中不同位置的字所携带的语义信息会有所不同,故而添加字的位置向量,使得上下文特征提取的过程中考虑到字的位置信息,进而使上下文特征提取更加精确。位置向量由预训练语言模型根据字标签和待检测文本确定。Specifically, the semantic vector of the word represents the information after fusion of the global semantic information of the text to be detected and the semantic information of the word. For example, the semantic vector of the word can indicate the sentence in which the word is located (for example, the text to be detected includes sentence A and sentence B, and the word The semantic vector of can indicate whether the word is in sentence A or sentence B), the type of sentence in which the word is located (such as title or body), and the semantic vector is determined by the pre-trained language model according to the word label and the text to be detected. The position vector of the word represents the position information of the word in the text to be detected. This is because the semantic information carried by words in different positions in the text to be detected will be different. Therefore, the position vector of the word is added to make the context feature extraction process. Considering the position information of words, the context feature extraction is more accurate. The position vector is determined by the pretrained language model from the word labels and the text to be detected.
在确定了字序列中字标签对应的语义向量和位置向量之后,将字标签与对应的语义向量和位置向量叠加,得到待特征提取向量。示例性的,对于字序列:[CLS]Token1Token2 Token3 Token4Token5[SEP],各个字标签对应的语义向量(按照字序列中的排序)为:EC E1 E2 E3 E4E5 ES,各个字标签对应的位置向量(按照字序列中的排序)为:PC P1P2 P3 P4P5 PS,则生成的待特征提取向量为:[CLS]+EC+PC Token1+E1+P1 Token2+E2+P2Token3+E3+P3 Token4+E4+P4Token5+E5+P5[SEP]+E6+P6。After the semantic vector and position vector corresponding to the word label in the word sequence are determined, the word label is superimposed with the corresponding semantic vector and position vector to obtain a vector to be feature extraction. Exemplarily, for the word sequence: [CLS]Token1Token2 Token3 Token4Token5[SEP], the semantic vector corresponding to each word label (in the order of the word sequence) is: EC E1 E2 E3 E4E5 ES, the position vector corresponding to each word label ( According to the order in the word sequence) is: PC P1P2 P3 P4P5 PS, then the generated vector to be extracted is: [CLS]+EC+PC Token1+E1+P1 Token2+E2+P2Token3+E3+P3 Token4+E4+P4Token5 +E5+P5[SEP]+E6+P6.
最后,对待特征提取向量进行上下文特征提取,得到特征序列。特征序列中包括每个字的上下文特征,将第i个字的上下文特征记为hi,则具有n个字的待检测文本所对应的特征序列为h1h2h3…hi…hn。Finally, context feature extraction is performed on the feature extraction vector to obtain a feature sequence. The feature sequence includes the context feature of each word, and the context feature of the i-th word is denoted as h i , then the feature sequence corresponding to the text to be detected with n words is h 1 h 2 h 3 ... h i ... h n .
继续参考图2,步骤230、通过多个预设模型分别对特征序列进行映射处理,得到各个预设模型对应的处理结果;其中,预设模型的处理结果包括特征序列中特征片段的异常概率,特征片段包括至少一个字的上下文特征;不同预设模型的处理结果所对应的特征片段的长度不同。Continuing to refer to FIG. 2, in
具体的,预设模型用于预测特征序列的异常情况,预设模型对特征序列进行映射处理所得到的处理结果,包括多个特征片段的异常概率,也就是说,预设模型将特征序列划分为多个特征片段,然后预测每个特征片段的异常概率。一个特征片段相当于特征序列中的一部分,故而一个特征片段包括至少一个字的上下文特征。Specifically, the preset model is used to predict the abnormal situation of the feature sequence, and the processing result obtained by the preset model mapping the feature sequence includes the abnormal probability of multiple feature segments. That is, the preset model divides the feature sequence into for multiple feature segments, and then predict the anomaly probability of each feature segment. A feature segment corresponds to a part of a feature sequence, so a feature segment includes context features of at least one word.
在本申请实施例中,通过多个预设模型分别对特征序列进行映射处理,每个预设模型得到的处理结果中,特征片段的长度都不相同,特征片段的长度是指构成特征片段的上下文特征的数量(相当于特征片段所对应的字的数量)。示例性的,本申请中使用5个预设模型分别对特征序列进行映射处理,第1个预设模型对应的特征片段长度为1,第2个预设模型对应的特征片段长度为2,第3个预设模型对应的特征片段长度为3,第4个预设模型对应的特征片段长度为4,第5个预设模型对应的特征片段长度为5。In the embodiment of the present application, the feature sequences are respectively mapped through a plurality of preset models. In the processing results obtained by each preset model, the lengths of the feature segments are different, and the length of the feature segments refers to the length of the feature segments. The number of contextual features (equivalent to the number of words corresponding to feature segments). Exemplarily, in this application, five preset models are used to map the feature sequences respectively, the length of the feature segment corresponding to the first preset model is 1, the length of the feature segment corresponding to the second preset model is 2, and the length of the feature segment corresponding to the second preset model is 2. The length of the feature segment corresponding to the three preset models is 3, the length of the feature segment corresponding to the fourth preset model is 4, and the length of the feature segment corresponding to the fifth preset model is 5.
在本申请的一个实施例中,预设模型对特征序列的处理过程为:根据预设模型对应的预设片段长度,通过滑动窗口法确定特征序列中的特征片段,并对特征片段进行映射处理,得到特征片段的异常概率。In an embodiment of the present application, the processing process of the feature sequence by the preset model is: according to the preset segment length corresponding to the preset model, determine the feature segment in the feature sequence by the sliding window method, and perform mapping processing on the feature segment , get the abnormal probability of the feature segment.
具体的,预设模型需要将特征序列划分的特征片段的长度是预先设定好的,即预设片段长度。以预设片段长度作为一个窗口的宽度,然后将该窗口沿特征序列滑动,在窗口滑动过程中,特征序列在窗口内的片段即为特征片段,由此,通过滑动窗口法,将特征序列划分为多个特征片段。在划分得到特征片段后,对特征片段进行映射处理,即得到特征片段的异常概率。Specifically, the preset model requires that the lengths of the feature segments into which the feature sequence is divided are preset, that is, the preset segment lengths. Take the preset segment length as the width of a window, and then slide the window along the feature sequence. During the window sliding process, the segment of the feature sequence within the window is the feature segment. Therefore, the feature sequence is divided by the sliding window method. for multiple feature segments. After the feature segments are obtained by division, the feature segments are mapped to obtain the abnormal probability of the feature segments.
在窗口滑动过程中,窗口以设定步长进行滑动,以窗口首部作为计算起点,那么当前窗口首部与上一窗口首部之间的距离为设定步长。一般的,设定步长为1,也就是说,窗口每次都后移一个字的距离,且窗口首部从特征序列的起始位滑动至终止位。那么,在对特征序列进行划分时,当预设片段长度为k(k>1)时,设特征序列为h1h2h3…hi…hn,则特征片段为[hi:hi+k],表示特征片段是从第i个字的上下文特征hi到第(i+k)个字的上下文特征hi+k,其中,i的取值是从1到n。可以看出,当i=n-k时,hi+k为hn,当i再增加,i+k将大于n,此时的hi+k可用0代替。当预设片段长度为1时,实际上就是对特征序列中的上下文特征进行了一一切分,也就是对待检测文本进行了单字划分,所得特征片段为待检测文本中的一个字所对应的上下文特征。示例性的,如图4所示,以预设片段长度为2为例,特征序列为h1h2h3h4h5,窗口滑动的设定步长为1,可得特征片段:[h1:h2]、[h2:h3]、[h3:h4]、[h4:h5]、[h5:0]。During the window sliding process, the window slides with a set step size, and the window header is used as the calculation starting point, then the distance between the current window header and the previous window header is the set step size. Generally, the step size is set to 1, that is to say, the window moves backward by a distance of one word each time, and the window header slides from the start position of the feature sequence to the end position. Then, when dividing the feature sequence, when the preset segment length is k (k>1), set the feature sequence to be h 1 h 2 h 3 ... h i ... h n , then the feature segment is [ hi :h i+k ], indicating that the feature segment is from the context feature hi of the ith word to the context feature hi+k of the (i + k)th word, wherein the value of i is from 1 to n. It can be seen that when i=nk, h i+k is h n , and when i increases again, i+k will be greater than n, and hi +k can be replaced by 0 at this time. When the preset segment length is 1, it actually divides the context features in the feature sequence, that is, the text to be detected is divided into single characters, and the obtained feature fragment corresponds to a word in the text to be detected. contextual features. Exemplarily, as shown in Figure 4, taking the preset segment length as 2 as an example, the feature sequence is h 1 h 2 h 3 h 4 h 5 , and the set step size of the window sliding is 1, the feature segment can be obtained: [ h 1 :h 2 ], [h 2 :h 3 ], [h 3 :h 4 ], [h 4 :h 5 ], [h 5 :0].
在本申请的一个实施例中,对特征片段进行映射处理的过程为:通过预设模型的卷积层获取特征片段的异常特征表示;通过预设模型的全连接层对异常特征表示进行映射处理,得到特征片段的异常概率。In an embodiment of the present application, the process of mapping the feature segment is as follows: obtaining the abnormal feature representation of the feature segment through the convolution layer of the preset model; performing mapping processing on the abnormal feature representation through the fully connected layer of the preset model , get the abnormal probability of the feature segment.
具体的,预设模型具有卷积层和全连接层,卷积层用于提取特征片段的异常特征表示,全连接层用于根据该异常特征表示计算异常概率。具体而言,卷积层通过模型参数对特征片段中的所有上下文特征进行融合,得到异常特征表示。卷积层的模型参数包括第一权重参数和第一基值参数,第一权重参数和第一基值参数是在模型训练过程中所得到的模型参数,不同的预设模型,其卷积层的模型参数不同。在进行融合处理时,首先通过第一权重参数对特征片段中的所有上下文特征进行加权求和,得到权值特征;然后将权值特征与第一基值参数叠加,得到特征片段的异常特征表示。将预设片段长度k所对应预设模型的卷积层的第一权重参数记为W1k,第一基值参数记为b1k,第i个特征片段中的所有上下文特征记为[hi:hi+k],则第i个特征片段的异常特征表示rki如下式所示:Specifically, the preset model has a convolution layer and a fully connected layer, the convolution layer is used for extracting abnormal feature representation of the feature segment, and the fully connected layer is used for calculating abnormal probability according to the abnormal feature representation. Specifically, the convolutional layer fuses all contextual features in feature segments through model parameters to obtain abnormal feature representations. The model parameters of the convolution layer include a first weight parameter and a first base value parameter. The first weight parameter and the first base value parameter are the model parameters obtained during the model training process. Different preset models have different convolution layer parameters. different model parameters. When performing fusion processing, firstly, the first weight parameter is used to weight and sum all the context features in the feature segment to obtain the weight feature; then the weight feature and the first base value parameter are superimposed to obtain the abnormal feature representation of the feature segment. . Denote the first weight parameter of the convolutional layer of the preset model corresponding to the preset segment length k as W 1k , the first base value parameter as b 1k , and all the context features in the i-th feature segment as [h i :h i+k ], then the abnormal feature representation r ki of the i-th feature segment is as follows:
rki=(W1k[hi:hi+k]+b1k)r ki =(W 1k [h i :h i+k ]+b 1k )
其中,k表示预设片段长度,rki为在预设片段长度k下抽取的第i个特征片段的异常特征表示,[hi:hi+k]表示第i个特征片段,W1k、b1k为预设片段长度k所对应预设模型中卷积层的模型参数。预设片段长度也相当于卷积层的卷积核宽度,本申请实际上是通过多种粒度的卷积模型分别对特征序列进行处理。Among them, k represents the preset segment length, r ki is the abnormal feature representation of the ith feature segment extracted under the preset segment length k, [hi :hi +k ] represents the ith feature segment, W 1k , b 1k is the model parameter of the convolutional layer in the preset model corresponding to the preset segment length k. The length of the preset segment is also equivalent to the width of the convolution kernel of the convolution layer. In this application, the feature sequences are actually processed separately through convolution models of various granularities.
在得到异常特征表示之后,通过全连接层对异常特征表示进行映射处理,得到特征片段的异常概率。具体而言,全连接层具有第二权重参数和第二基值参数,首先将异常特征表示与第二权重参数相乘,然后再与第二基值参数相加,得到待激活特征;最后通过全连接层所设置的预设激活函数对待激活特征进行处理,得到特征片段的异常概率。预设激活函数可以是ReLU函数、Sigmoid函数、Softmax函数、Linear函数等,在设置时可以根据实际需求进行选择。示例性的,预设片段长度k所对应预设模型的全连接层的第二权重参数记为W2k,第二基值参数记为b2k,采用Softmax函数作为预设激活函数,则特征片段的异常概率如下式所示:After the abnormal feature representation is obtained, the abnormal feature representation is mapped through the fully connected layer to obtain the abnormal probability of the feature segment. Specifically, the fully connected layer has a second weight parameter and a second base value parameter. First, the abnormal feature representation is multiplied by the second weight parameter, and then added with the second base value parameter to obtain the feature to be activated; The preset activation function set by the fully connected layer processes the feature to be activated to obtain the abnormal probability of the feature segment. The preset activation function can be a ReLU function, a Sigmoid function, a Softmax function, a Linear function, etc., which can be selected according to actual needs during setting. Exemplarily, the second weight parameter of the fully connected layer of the preset model corresponding to the preset segment length k is denoted as W 2k , the second base value parameter is denoted as b 2k , and the Softmax function is used as the preset activation function, then the feature segment The abnormal probability of is as follows:
pki=softmax(W2krki+b2k)p ki =softmax(W 2k r ki +b 2k )
其中,pki为在预设片段长度k下抽取的第i个特征片段的异常概率,rki为在预设片段长度k下抽取的第i个特征片段的异常特征表示,W2k、b2k为预设片段长度k所对应预设模型中全连接层的模型参数。Among them, p ki is the abnormal probability of the i-th feature segment extracted under the preset segment length k, r ki is the abnormal feature representation of the i-th feature segment extracted under the preset segment length k, W 2k , b 2k is the model parameter of the fully connected layer in the preset model corresponding to the preset segment length k.
在本申请的一个实施例中,在通过预设模型对特征序列进行映射处理之前,还包括预设模型的构建过程,如图5所示,该过程包括步骤510至步骤530,具体为:In an embodiment of the present application, before the feature sequence is mapped by the preset model, a process of constructing the preset model is also included. As shown in FIG. 5 , the process includes
步骤510、获取由多个字组成的样本数据,样本数据中的字具有指示异常状态的第一标签。Step 510: Obtain sample data consisting of a plurality of words, and the words in the sample data have a first label indicating an abnormal state.
具体的,样本数据是具有异常信息标注的异常文本数据,该异常文本数据也是由多个字构成,每个字具有指示异常状态的第一标签。字的异常状态是指该字是否处于异常状态,该异常状态可以通过第一标签的不同标注信息来体现,例如,第一标签标注为0,代表该字不处于异常状态(即该字为正常状态),故而这种类型的第一标签可以记为正常标签。第一标签标注为1,代表该字处于异常状态,故而这种类型的第一标签可以记为异常标签。Specifically, the sample data is abnormal text data marked with abnormal information, and the abnormal text data is also composed of a plurality of words, and each word has a first label indicating an abnormal state. The abnormal state of a word refers to whether the word is in an abnormal state. The abnormal state can be reflected by different label information of the first label. For example, if the first label is marked as 0, it means that the word is not in an abnormal state (that is, the word is normal.) state), so the first tag of this type can be recorded as a normal tag. The first label is marked as 1, which means that the word is in an abnormal state, so the first label of this type can be recorded as an abnormal label.
在本申请的一个实施例中,对于同一个样本数据,不同的标注方法可能使得该样本数据中各个字的第一标签有所不同。例如,对于样本数据“宠物的地位都人高”,标注人员1可能认为“都人”处不通顺(即异常),则将“都人”二字均标记为1,其余字标记为0;标注人员2可能认为“人”字多余,则将“人”字标记为1,其余字标记为0;标注人员3可能认为“都”字不通顺,则将“都”字标记为1,其余字标记为0。如此,对于样本数据“宠物的地位都人高”,可以得到如下表所示的三种标注情况:In an embodiment of the present application, for the same sample data, different labeling methods may cause the first labels of each word in the sample data to be different. For example, for the sample data "the status of pets is high", the
表1Table 1
步骤520、基于多个预设片段长度,根据每个预设片段长度确定样本数据中的多个样本片段,并根据样本片段对应的第一标签为样本片段赋予指示异常状态的第二标签。Step 520: Determine multiple sample segments in the sample data according to each preset segment length based on the multiple preset segment lengths, and assign a second label indicating an abnormal state to the sample segment according to the first label corresponding to the sample segment.
具体的,设定多个预设片段长度,针对每个预设片段长度对样本数据进行划分,得到样本数据对应的多个样本片段,并根据样本片段中每个字的第一标签为样本片段赋予第二标签。该第二标签根据第一标签计算得到,用于指示样本片段的异常状况。Specifically, a plurality of preset segment lengths are set, the sample data is divided for each preset segment length, a plurality of sample segments corresponding to the sample data are obtained, and the sample segment is determined according to the first label of each word in the sample segment Assign a second label. The second label is calculated according to the first label, and is used to indicate the abnormal condition of the sample segment.
根据预设片段长度得到样本数据的多个样本片段的过程为:设定一个以预设片段长度作为窗口宽度的窗口,将样本数据中包含在窗口内的所有字作为样本片段,其中,窗口根据设定步长从样本数据的起始位滑动至终止位。也就是说,设定一个以预设片段长度作为窗口宽度的窗口,然后根据设定步长将该窗口首部从样本数据的起始位滑动至终止位,每滑动一次,样本数据中包含在窗口内的所有字形成一个样本片段,从而得到多个样本片段。当窗口中的字数不够时,通过0补齐。一般的,设定步长为1。获取样本片段的过程可以参考前文获取特征片段的过程,二者过程相似。示例性的,以预设片段长度为2,上述表1中的样本数据为例,可以得到样本片段:宠物、物的、的地、地位、位都、都人、人高、高0。The process of obtaining multiple sample segments of the sample data according to the preset segment length is as follows: setting a window with the preset segment length as the window width, and taking all the words contained in the window in the sample data as the sample segments, wherein the window is based on Set the step size to slide from the start bit to the end bit of the sample data. That is to say, set a window with the preset segment length as the window width, and then slide the window header from the start position of the sample data to the end position according to the set step size. All words within form a sample segment, resulting in multiple sample segments. When the number of words in the window is not enough, it is filled with 0. Generally, set the step size to 1. For the process of obtaining sample fragments, refer to the previous process of obtaining feature fragments, and the two processes are similar. Exemplarily, taking the preset segment length as 2 and the sample data in Table 1 as an example, sample segments can be obtained: pet, object's, location, status, location, capital, people, height, and height 0.
由于样本数据中每个字的第一标签只有0、1两种情况,当直接使用第一标签进行模型训练时,通过模型的预测值与第一标签所计算的损失函数将会有较大误差。例如,以上述表1中标注人员2的标注为例,“人”字的第一标签为1,但是模型会将“都”字的概率预测的也比较大,这会导致模型的损失很大。然后当模型梯度反向传播更新参数重新训练时,会误导模型让“都”字概率预测的很小。这样就会导致模型困惑,从而降低模型的预测精度。Since the first label of each word in the sample data has only two cases, 0 and 1, when the first label is directly used for model training, there will be a large error between the predicted value of the model and the loss function calculated by the first label. . For example, taking the labeling of
考虑到上述情况,本申请为样本片段重新赋予第二标签,在窗口移动过程中,根据窗口内的异常标签总量和窗口宽度生成样本片段的第二标签。样本片段的第二标签包括两个部分:正常标识和异常标识,其中,正常标识用于表示样本片段为正常状态的概率,异常标识用于表示样本片段为异常状态的概率,正常标识与异常标识的和为1,那么,确定了异常标识,也就确定了正常标识。Considering the above situation, the present application assigns a second label to the sample fragment again, and during the window movement process, the second label of the sample fragment is generated according to the total amount of abnormal labels in the window and the width of the window. The second label of the sample segment includes two parts: a normal flag and an abnormal flag, wherein the normal flag is used to represent the probability that the sample segment is in a normal state, the abnormal flag is used to represent the probability that the sample segment is in an abnormal state, the normal flag and the abnormal flag are used. The sum is 1, then, the abnormal flag is determined, and the normal flag is also determined.
在本申请实施例中,在窗口移动过程中,将窗口内的异常标签总量与窗口宽度的比值作为样本片段的第二标签中的异常标识,该样本片段的正常标识即为1减去异常标识。当窗口宽度为1(即预设片段长度为1)时,此时样本片段就是样本数据中的单个字,故而样本片段的第二标签与每个字的第一标签相同。当窗口宽度为2时,以上述表1中标注人员2的标注为例,可以得到样本片段及其对应的第二标签(以(异常标识,正常标识)的形式表示)为:宠物:(0,1)、物的:(0,1)、的地:(0,1)、地位:(0,1)、位都:(0,1)、都人:(0.5,0.5)、人高:(0.5,0.5)、高0:(0,1),其中,宠物:(0,1)表示,样本片段“宠物”的异常标识为0,正常标识为1。当窗口宽度为3时,以上述表1中标注人员2的标注为例,可以得到样本片段及其对应的第二标签为:宠物的:(0,1)、物的地:(0,1)、的地位:(0,1)、地位都:(0,1)、位都人:(0.33,0.67)、都人高:(0.33,0.67)、人高0:(0.33,0.67)、高00:(0,1)。可以看出,第二标签中的数据不再只有两种取值,第二标签中的数据更加平缓,能够有效缓解第一标签中的误差对模型的影响。In the embodiment of the present application, during the window movement process, the ratio of the total amount of abnormal labels in the window to the window width is used as the abnormality identifier in the second label of the sample segment, and the normal identifier of the sample segment is 1 minus the abnormality logo. When the window width is 1 (ie, the preset segment length is 1), the sample segment is a single word in the sample data, so the second label of the sample segment is the same as the first label of each word. When the window width is 2, taking the labeling of
步骤530、将各个预设片段长度对应的具有第二标签的样本数据作为训练样本,通过训练样本对神经网络模型进行训练,得到各个预设片段长度对应的预设模型。Step 530: Use the sample data with the second label corresponding to each preset segment length as a training sample, and train the neural network model through the training sample to obtain a preset model corresponding to each preset segment length.
具体的,在对样本数据中的样本片段赋予第二标签后,即可将该样本数据作为训练样本对神经网络模型进行训练。通过前述步骤的处理,每个预设片段长度都可以得到一种具有第二标签的样本数据,也就是每个预设片段长度对应一份训练样本,在训练过程中,使用预设片段长度对应的训练样本对该预设片段长度的神经网络模型进行训练,即得到该预设片段长度对应的预设模型。Specifically, after the second label is assigned to the sample segment in the sample data, the sample data can be used as a training sample to train the neural network model. Through the processing of the foregoing steps, a sample data with a second label can be obtained for each preset segment length, that is, each preset segment length corresponds to a training sample, and in the training process, the preset segment length corresponding to Training the neural network model of the preset segment length with the training samples, that is, to obtain the preset model corresponding to the preset segment length.
在本申请的一个实施例中,在神经网络模型的训练过程中,将神经网络模型针对训练样本的预测值与训练样本的第二标签之间的交叉熵作为损失函数,基于损失函数更新神经网络模型的模型参数,模型参数为前述步骤中的W1k、b1k、W2k、b2k等参数。In an embodiment of the present application, during the training process of the neural network model, the cross-entropy between the predicted value of the neural network model for the training sample and the second label of the training sample is used as a loss function, and the neural network is updated based on the loss function. Model parameters of the model, the model parameters are parameters such as W 1k , b 1k , W 2k , and b 2k in the preceding steps.
具体的,损失函数Lossk的计算方式如下:Specifically, the calculation method of the loss function Loss k is as follows:
其中,y0ki表示预设片段长度k下第i个样本片段中的正常标签,y1ki表示预设片段长度k下第i个样本片段中的异常标签,y0ki+y1ki=1;p0ki表示预设片段长度k的神经网络模型预测的第i个样本片段为正常的概率,p1ki表示预设片段长度k的神经网络模型预测的第i个样本片段为异常的概率。Among them, y 0ki represents the normal label in the ith sample segment under the preset segment length k, y 1ki represents the abnormal label in the ith sample segment under the preset segment length k, y 0ki +y 1ki =1; p 0ki represents the probability that the ith sample segment predicted by the neural network model with the preset segment length k is normal, and p 1ki represents the probability that the ith sample segment predicted by the neural network model with the preset segment length k is abnormal.
在本申请的一个实施例中,在对各个预设片段长度对应的神经网络模型进行训练时,可以采用迭代训练法,即按照预设片段长度的从小到大的顺序依次对各模型进行训练。In an embodiment of the present application, when training the neural network models corresponding to each preset segment length, an iterative training method may be used, that is, each model is sequentially trained in order of the preset segment lengths from small to large.
在本申请的一个实施例中,也可以使用其他合适的机器学习模型训练得到预设模型。机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。In an embodiment of the present application, the preset model can also be obtained by training other suitable machine learning models. Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
继续参考图2,步骤240、根据各个预设模型处理结果所指示的特征片段的异常概率,确定待检测文本中的异常片段。Continuing to refer to FIG. 2 ,
具体的,一个预设模型的处理结果包括同一种长度的多个特征片段的异常概率,多个预设模型的处理结果包括多种长度的多个特征片段的异常概率,确定该多个异常概率中的最大异常概率,将最大异常概率对应特征片段所指示的多个字作为待检测文本中的异常片段,从而确定了待检测文本中的异常片段的长度和位置。例如,最大异常概率对应的特征片段为预设片段长度k第i个特征片段,则确定待检测文本中的异常片段是由第i个字到第(i+k)个字构成的片段。Specifically, the processing result of one preset model includes abnormal probabilities of multiple feature segments of the same length, the processing results of multiple preset models include the abnormal probabilities of multiple feature segments of multiple lengths, and the multiple abnormal probabilities are determined. The maximum abnormal probability in , and the multiple words indicated by the feature segment corresponding to the maximum abnormal probability are used as abnormal segments in the text to be detected, thereby determining the length and position of the abnormal segments in the text to be detected. For example, if the feature segment corresponding to the maximum abnormal probability is the i-th feature segment of the preset segment length k, it is determined that the abnormal segment in the text to be detected is a segment composed of the i-th word to the (i+k)-th word.
在本申请实施例提供的技术方案中,通过多个预设模型分别对特征序列进行处理得到处理结果,其中,处理结果包括特征片段的异常概率,也就是将待检测文本划分为多个片段进行异常检测,而不是直接对待检测文本整个句子进行检测,使得异常检测过程中充分考虑到待检测文本中的局部特征,提高了检测精度;同时,由于不同预设模型的处理结果所对应的特征片段的长度不同,相当于通过多种粒度的模型分别对待检测文本进行检测,进一步提高检测结果的准确性和精度。In the technical solutions provided in the embodiments of the present application, the feature sequences are processed separately through a plurality of preset models to obtain processing results, wherein the processing results include the abnormal probability of the feature segments, that is, the text to be detected is divided into multiple segments for processing. Anomaly detection, instead of directly detecting the entire sentence of the text to be detected, makes the local features in the text to be detected fully considered in the process of anomaly detection and improves the detection accuracy; at the same time, due to the feature fragments corresponding to the processing results of different preset models The length of the text is different, which is equivalent to the detection of the text to be detected through a variety of granular models, which further improves the accuracy and precision of the detection results.
图6示意性地示出了应用本申请技术方案的模型结构图。如图6所示,该模型结构包括:FIG. 6 schematically shows a model structure diagram to which the technical solution of the present application is applied. As shown in Figure 6, the model structure includes:
文本嵌入模块(TokenEMBEDDING)610,该模块用于对待检测文本进行分字处理,以将待检测文本611转化为由字标签(token)构成的字序列,具体可以参考前述步骤310的相关描述,在此不再赘述。Text embedding module (TokenEMBEDDING) 610, this module is used to perform word segmentation processing on the text to be detected, so as to convert the text to be detected 611 into a word sequence composed of word tags (tokens). For details, refer to the relevant description of the
向量叠加模块(TASKEMBEDDING)620,该模块用于将字序列中的字标签与对应的语义向量和位置向量叠加,生成待特征提取向量621。具体可以参考前述步骤320的相关描述,在此不再赘述。The vector superposition module (TASKEMBEDDING) 620 is used for superimposing the word labels in the word sequence with the corresponding semantic vectors and position vectors to generate a vector to be
BERT模型(BERT MODEL)630,BERT模型为预训练语言模型,用于对待特征提取向量进行上下文特征提取,输出特征序列631。The BERT model (BERT MODEL) 630, the BERT model is a pre-trained language model, is used to perform context feature extraction on the feature extraction vector, and output a
多粒度卷积模块640,该模块包括5种粒度(gram)的卷积模型,其中,粒度也就是卷积核的尺寸,也就是预设片段长度。在本申请实施例中5个卷积模型的粒度分别是:1、2、3、4、5。每种粒度的卷积模型分别对输出特征序列631进行映射处理,得到特征序列中特征片段的异常概率,特征片段的长度与对应的卷积模型的粒度相同。特征片段的异常概率相当于卷积模型对该特征片段的预测得分(score),最后从各个得分中选取最大值(MAXscore),即可确定待检测文本611中的异常片段。The
图7示意性地示出了在一种场景下,本申请技术方案的应用流程图。如图7所示,该流程包括:FIG. 7 schematically shows an application flow chart of the technical solution of the present application in a scenario. As shown in Figure 7, the process includes:
S710、获取不通顺文本。不通顺文本即为异常文本,可以通过本申请技术方案确定该不通顺文本中的异常片段。S710 , acquiring the incoherent text. The incomprehensible text is the abnormal text, and the abnormal segment in the incomprehensible text can be determined through the technical solution of the present application.
S720、将不通顺文本输入不通顺片段检测模型。该不通顺片段检测模型是实施本申请技术方案的模型,即不通顺片段检测模型对不通顺片段进行特征提取,得到特征序列;然后通过多个预设模型分别对特征序列进行映射处理,得到各个预设模型对应的处理结果,其中,预设模型的处理结果包括特征序列中特征片段的异常概率,特征片段包括至少一个字的上下文特征;不同预设模型的处理结果所对应的特征片段的长度不同。最后根据各个预设模型处理结果所指示的特征片段的异常概率的最大值,确定待检测文本中的异常片段。S720. Input the non-fluent text into the non-fluent segment detection model. The non-smooth segment detection model is a model for implementing the technical solution of the present application, that is, the non-smooth segment detection model performs feature extraction on the non-sequential segments to obtain feature sequences; and then maps the feature sequences through a plurality of preset models to obtain each The processing result corresponding to the preset model, wherein the processing result of the preset model includes the abnormal probability of the feature segment in the feature sequence, and the feature segment includes the context feature of at least one word; the length of the feature segment corresponding to the processing results of different preset models different. Finally, the abnormal segment in the text to be detected is determined according to the maximum value of the abnormal probability of the feature segment indicated by the processing result of each preset model.
S730、根据模型预测结果对不通顺片段定位。根据不通顺片段检测模型的输出结果确定不通顺片段的具体位置。S730, locating the incompatibility segment according to the model prediction result. The specific positions of the unsmooth segments are determined according to the output result of the unscheduled segment detection model.
S740、机器审核系统。将定位结果输入机器审核系统,该系统可以进行人工审核。S740, a machine audit system. The positioning results are entered into a machine review system, which can be reviewed manually.
S750、不通顺片段高亮显示。通过高亮显示使对象能够更加快速地确定不通顺文本中的不通顺片段。S750, the unsmooth segment is highlighted. Enables objects to more quickly identify sloppy segments of sloppy text with highlighting.
应当注意,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。It should be noted that although the various steps of the methods of the present application are depicted in the figures in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps must be performed to achieve the desired the result of. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, and the like.
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的异常文本的检测方法。图8示意性地示出了本申请实施例提供的异常文本的检测装置的结构框图。如图8所示,本申请实施例提供的异常文本的检测装置包括:The following describes the apparatus embodiments of the present application, which can be used to execute the abnormal text detection method in the above-mentioned embodiments of the present application. FIG. 8 schematically shows a structural block diagram of an abnormal text detection apparatus provided by an embodiment of the present application. As shown in FIG. 8 , the apparatus for detecting abnormal text provided by the embodiment of the present application includes:
文本获取模块810,用于获取由多个字组成的待检测文本;a
特征提取模块820,用于对所述待检测文本进行特征提取,得到所述待检测文本的特征序列,所述特征序列包括所述待检测文本中多个字对应的上下文特征;A
映射处理模块830,用于通过多个预设模型分别对所述特征序列进行映射处理,得到各个预设模型对应的处理结果;其中,所述预设模型的处理结果包括所述特征序列中不同特征片段的异常概率,所述特征片段包括至少一个字的上下文特征;不同预设模型的处理结果所对应的特征片段的长度不同;The
异常片段确定模块840,用于根据各个预设模型处理结果所指示的特征片段的异常概率,确定所述待检测文本中的异常片段。The abnormal
在本申请的一个实施例中,所述装置还包括:In an embodiment of the present application, the device further includes:
样本数据获取模块,用于获取由多个字组成的样本数据,所述样本数据中的字具有指示异常状态的第一标签;a sample data acquisition module for acquiring sample data consisting of a plurality of words, the words in the sample data having a first label indicating an abnormal state;
第二标签生成模块,用于基于多个预设片段长度,根据每个预设片段长度确定所述样本数据中的多个样本片段,并根据所述样本片段对应的第一标签为所述样本片段赋予指示异常状态的第二标签;The second label generation module is configured to determine a plurality of sample segments in the sample data according to the length of each preset segment based on a plurality of preset segment lengths, and generate the sample according to the first label corresponding to the sample segment. The fragment assigns a second label indicating an abnormal state;
模型训练模块,用于将各个预设片段长度对应的具有第二标签的样本数据作为训练样本,通过所述训练样本对神经网络模型进行训练,得到各个预设片段长度对应的预设模型。The model training module is configured to use the sample data with the second label corresponding to each preset segment length as a training sample, and train the neural network model through the training samples to obtain preset models corresponding to each preset segment length.
在本申请的一个实施例中,所述第二标签生成模块具体用于:In an embodiment of the present application, the second label generation module is specifically used for:
设定一个以所述预设片段长度作为窗口宽度的窗口,将样本数据中包含在所述窗口内的所有字作为样本片段,其中,所述窗口根据设定步长从所述样本数据的起始位滑动至终止位。A window is set with the preset segment length as the window width, and all words contained in the window in the sample data are used as sample segments, wherein the window starts from the sample data according to the set step size. The start position slides to the end position.
在本申请的一个实施例中,所述第一标签包括正常标签和异常标签;所述第二标签生成模块还用于::In an embodiment of the present application, the first label includes a normal label and an abnormal label; the second label generation module is further configured to:
根据所述窗口内的异常标签总量和所述窗口宽度生成所述样本片段的第二标签。A second label for the sample segment is generated based on the total amount of abnormal labels within the window and the window width.
在本申请的一个实施例中,在所述神经网络模型的训练过程中,将所述神经网络模型针对所述训练样本的预测值与所述训练样本的第二标签之间的交叉熵作为损失函数,基于所述损失函数更新所述神经网络模型的模型参数。In an embodiment of the present application, in the training process of the neural network model, the cross-entropy between the predicted value of the neural network model for the training sample and the second label of the training sample is used as the loss function to update model parameters of the neural network model based on the loss function.
在本申请的一个实施例中,特征提取模块820包括:In one embodiment of the present application, the
分字单元,用于对所述待检测文本进行分字处理,得到按顺序排列的多个字,并根据预设字典将所述按顺序排列的多个字中的每个字转化为对应的字标签,得到所述待检测文本的字序列;The word segmentation unit is used to perform word segmentation processing on the text to be detected to obtain a plurality of words arranged in order, and convert each word in the plurality of words arranged in order into a corresponding word according to a preset dictionary word label, to obtain the word sequence of the text to be detected;
特征提取单元,用于对所述字序列进行上下文特征提取,得到所述待检测文本的特征序列。A feature extraction unit, configured to perform context feature extraction on the word sequence to obtain the feature sequence of the text to be detected.
在本申请的一个实施例中,所述特征提取单元具体用于:In an embodiment of the present application, the feature extraction unit is specifically used for:
根据所述字序列中的字标签确定所述字标签对应的语义向量和位置向量;Determine the semantic vector and the position vector corresponding to the word label according to the word label in the word sequence;
根据所述字序列中的字标签及所述字标签对应的语义向量和位置向量生成待特征提取向量;Generate a vector to be feature extraction according to the word label in the word sequence and the semantic vector and position vector corresponding to the word label;
对所述带特征提取向量进行上下文特征提取,得到所述待检测文本的特征序列。Context feature extraction is performed on the vector with feature extraction to obtain the feature sequence of the text to be detected.
在本申请的一个实施例中,映射处理模块830包括:In an embodiment of the present application, the
特征片段确定单元,用于根据所述预设模型对应的预设片段长度,通过滑动窗口法确定所述特征序列中的特征片段;a feature segment determining unit, configured to determine the feature segment in the feature sequence by a sliding window method according to the preset segment length corresponding to the preset model;
异常特征表示获取单元,用于通过所述预设模型的卷积层获取所述特征片段的异常特征表示;an abnormal feature representation acquisition unit, configured to acquire the abnormal feature representation of the feature segment through the convolution layer of the preset model;
异常概率获取单元,用于通过所述预设模型的全连接层对所述异常特征表示进行映射处理,得到所述特征片段的异常概率。An abnormal probability acquisition unit, configured to perform mapping processing on the abnormal feature representation through the fully connected layer of the preset model to obtain the abnormal probability of the feature segment.
在本申请的一个实施例中,所述异常特征表示获取单元具体用于:In an embodiment of the present application, the abnormal feature representation acquisition unit is specifically configured to:
根据所述预设模型的卷积层的模型参数对所述特征片段中的所有上下文特征进行融合,得到所述特征片段的异常特征表示。All context features in the feature segment are fused according to the model parameters of the convolutional layer of the preset model to obtain an abnormal feature representation of the feature segment.
在本申请的一个实施例中,所述卷积层的模型参数包括第一权重参数和第一基值参数;所述异常特征表示获取单元进一步用于:In an embodiment of the present application, the model parameters of the convolution layer include a first weight parameter and a first base value parameter; the abnormal feature representation acquisition unit is further configured to:
通过所述第一权重参数对所述特征片段中的所有上下文特征进行加权求和,得到权值特征;Weighted summation is performed on all context features in the feature segment by using the first weight parameter to obtain a weight feature;
将所述权值特征与所述第一基值参数叠加,得到所述特征片段的异常特征表示。The weight feature and the first base value parameter are superimposed to obtain an abnormal feature representation of the feature segment.
在本申请的一个实施例中,所述预设模型的全连接层包括第二权重参数和第二基值参数;所述异常概率获取单元具体用于:In an embodiment of the present application, the fully connected layer of the preset model includes a second weight parameter and a second base value parameter; the abnormal probability acquisition unit is specifically used for:
将所述异常特征表示与所述第二权重参数相乘后再与所述第二基值参数相加,得到待激活特征;Multiplying the abnormal feature representation by the second weight parameter and then adding the second base value parameter to obtain the feature to be activated;
通过预设激活函数对所述待激活特征进行处理,得到所述特征片段的异常概率。The feature to be activated is processed through a preset activation function to obtain the abnormal probability of the feature segment.
在本申请的一个实施例中,异常片段确定模块840具体用于:In an embodiment of the present application, the abnormal
确定各个预设模型处理结果所指示的特征片段的异常概率中的最大异常概率;determining the maximum abnormal probability among the abnormal probabilities of the feature segments indicated by the processing results of each preset model;
将所述最大异常概率对应特征片段所指示的多个字作为所述待检测文本中的异常片段。A plurality of words indicated by the feature segments corresponding to the maximum abnormal probability are used as abnormal segments in the text to be detected.
本申请各实施例中提供的异常文本的检测装置的具体细节已经在对应的方法实施例中进行了详细的描述,此处不再赘述。The specific details of the abnormal text detection apparatus provided in each embodiment of the present application have been described in detail in the corresponding method embodiments, and are not repeated here.
图9示意性地示出了用于实现本申请实施例的电子设备的计算机系统结构框图。FIG. 9 schematically shows a structural block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.
需要说明的是,图9示出的电子设备的计算机系统900仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that the
如图9所示,计算机系统900包括中央处理器901(Central Processing Unit,CPU),其可以根据存储在只读存储器902(Read-Only Memory,ROM)中的程序或者从存储部分908加载到随机访问存储器903(Random Access Memory,RAM)中的程序而执行各种适当的动作和处理。在随机访问存储器903中,还存储有系统操作所需的各种程序和数据。中央处理器901、在只读存储器902以及随机访问存储器903通过总线904彼此相连。输入/输出接口905(Input/Output接口,即I/O接口)也连接至总线904。As shown in FIG. 9 , the
以下部件连接至输入/输出接口905:包括键盘、鼠标等的输入部分906;包括诸如阴极射线管(Cathode Ray Tube,CRT)、液晶显示器(Liquid Crystal Display,LCD)等以及扬声器等的输出部分907;包括硬盘等的存储部分908;以及包括诸如局域网卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器910也根据需要连接至输入/输出接口905。可拆卸介质911,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器910上,以便于从其上读出的计算机程序根据需要被安装入存储部分908。The following components are connected to the input/output interface 905: an
特别地,根据本申请的实施例,各个方法流程图中所描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分909从网络上被下载和安装,和/或从可拆卸介质911被安装。在该计算机程序被中央处理器901执行时,执行本申请的系统中限定的各种功能。In particular, according to the embodiments of the present application, the processes described in the flowcharts of the respective methods may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the
需要说明的是,本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、有线等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable Compact Disc Read-Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination of the foregoing.
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。From the description of the above embodiments, those skilled in the art can easily understand that the exemplary embodiments described herein may be implemented by software, or may be implemented by software combined with necessary hardware. Therefore, the technical solutions according to the embodiments of the present application may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , which includes several instructions to cause a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses or adaptations of this application that follow the general principles of this application and include common knowledge or conventional techniques in the technical field not disclosed in this application .
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Claims (15)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210073277.1A CN114490935B (en) | 2022-01-21 | 2022-01-21 | Abnormal text detection method, device, computer readable medium and electronic device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210073277.1A CN114490935B (en) | 2022-01-21 | 2022-01-21 | Abnormal text detection method, device, computer readable medium and electronic device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114490935A true CN114490935A (en) | 2022-05-13 |
| CN114490935B CN114490935B (en) | 2025-06-20 |
Family
ID=81473081
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210073277.1A Active CN114490935B (en) | 2022-01-21 | 2022-01-21 | Abnormal text detection method, device, computer readable medium and electronic device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114490935B (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117093853A (en) * | 2023-10-18 | 2023-11-21 | 腾讯科技(深圳)有限公司 | Time sequence data processing method and device, computer readable medium and electronic equipment |
| CN117151074A (en) * | 2023-08-29 | 2023-12-01 | 同方知网数字出版技术股份有限公司 | A detection method, device, medium and equipment for AI-generated text |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180365560A1 (en) * | 2017-06-19 | 2018-12-20 | International Business Machines Corporation | Context aware sensitive information detection |
| CN110457428A (en) * | 2019-06-26 | 2019-11-15 | 北京印刷学院 | Sensitive word detection and filtering method, device and electronic equipment |
| US20200192983A1 (en) * | 2018-12-17 | 2020-06-18 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for correcting error in text |
| US20210056173A1 (en) * | 2019-08-21 | 2021-02-25 | International Business Machines Corporation | Extracting meaning representation from text |
| CN112434131A (en) * | 2020-11-24 | 2021-03-02 | 平安科技(深圳)有限公司 | Text error detection method and device based on artificial intelligence, and computer equipment |
| CN112464641A (en) * | 2020-10-29 | 2021-03-09 | 平安科技(深圳)有限公司 | BERT-based machine reading understanding method, device, equipment and storage medium |
| CN112732912A (en) * | 2020-12-30 | 2021-04-30 | 平安科技(深圳)有限公司 | Sensitive tendency expression detection method, device, equipment and storage medium |
| CN113221906A (en) * | 2021-05-27 | 2021-08-06 | 江苏奥易克斯汽车电子科技股份有限公司 | Image sensitive character detection method and device based on deep learning |
| CN113705234A (en) * | 2021-03-19 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Named entity recognition method and device, computer readable medium and electronic equipment |
| US20220013111A1 (en) * | 2019-11-14 | 2022-01-13 | Tencent Technology (Shenzhen) Company Limited | Artificial intelligence-based wakeup word detection method and apparatus, device, and medium |
-
2022
- 2022-01-21 CN CN202210073277.1A patent/CN114490935B/en active Active
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180365560A1 (en) * | 2017-06-19 | 2018-12-20 | International Business Machines Corporation | Context aware sensitive information detection |
| US20200192983A1 (en) * | 2018-12-17 | 2020-06-18 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for correcting error in text |
| CN110457428A (en) * | 2019-06-26 | 2019-11-15 | 北京印刷学院 | Sensitive word detection and filtering method, device and electronic equipment |
| US20210056173A1 (en) * | 2019-08-21 | 2021-02-25 | International Business Machines Corporation | Extracting meaning representation from text |
| US20220013111A1 (en) * | 2019-11-14 | 2022-01-13 | Tencent Technology (Shenzhen) Company Limited | Artificial intelligence-based wakeup word detection method and apparatus, device, and medium |
| CN112464641A (en) * | 2020-10-29 | 2021-03-09 | 平安科技(深圳)有限公司 | BERT-based machine reading understanding method, device, equipment and storage medium |
| CN112434131A (en) * | 2020-11-24 | 2021-03-02 | 平安科技(深圳)有限公司 | Text error detection method and device based on artificial intelligence, and computer equipment |
| CN112732912A (en) * | 2020-12-30 | 2021-04-30 | 平安科技(深圳)有限公司 | Sensitive tendency expression detection method, device, equipment and storage medium |
| CN113705234A (en) * | 2021-03-19 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Named entity recognition method and device, computer readable medium and electronic equipment |
| CN113221906A (en) * | 2021-05-27 | 2021-08-06 | 江苏奥易克斯汽车电子科技股份有限公司 | Image sensitive character detection method and device based on deep learning |
Non-Patent Citations (3)
| Title |
|---|
| XINYU ZHOU ET AL.: "An Efficient and Accurate Scene Text Detector", ARXIV, 10 July 2017 (2017-07-10), pages 1 - 10 * |
| 王汀等: "一种面向中文网络百科非结构化信息的知识获取方法", 图书情报工作, vol. 60, no. 13, 5 July 2016 (2016-07-05), pages 126 - 132 * |
| 钟辉等: "一种基于数据分析的字符切分方法", 沈阳建筑大学学报(自然科学版), vol. 22, no. 1, 25 February 2006 (2006-02-25), pages 158 - 162 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117151074A (en) * | 2023-08-29 | 2023-12-01 | 同方知网数字出版技术股份有限公司 | A detection method, device, medium and equipment for AI-generated text |
| CN117151074B (en) * | 2023-08-29 | 2025-08-01 | 同方知网数字科技有限公司 | AI generated text detection method, device, medium and equipment |
| CN117093853A (en) * | 2023-10-18 | 2023-11-21 | 腾讯科技(深圳)有限公司 | Time sequence data processing method and device, computer readable medium and electronic equipment |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114490935B (en) | 2025-06-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110795543B (en) | Unstructured data extraction method, device and storage medium based on deep learning | |
| CN107908635B (en) | Establishing text classification model and method and device for text classification | |
| CN111062217B (en) | Language information processing method and device, storage medium and electronic equipment | |
| CN113011186B (en) | Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium | |
| CN113326380B (en) | Equipment measurement data processing method, system and terminal based on deep neural network | |
| CN112613306B (en) | Method, device, electronic equipment and storage medium for extracting entity relationship | |
| CN111738016A (en) | Multi-intention recognition method and related equipment | |
| CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
| CN111144093B (en) | Intelligent text processing method and device, electronic equipment and storage medium | |
| CN111767720B (en) | Title generation method, computer and readable storage medium | |
| CN114492661B (en) | Text data classification method and device, computer equipment and storage medium | |
| CN113505786A (en) | Test question photographing and judging method and device and electronic equipment | |
| CN115757731A (en) | Dialogue question rewriting method, device, computer equipment and storage medium | |
| CN115587583A (en) | Noise detection method, device and electronic equipment | |
| CN114490935A (en) | Abnormal text detection method, device, computer readable medium and electronic device | |
| CN116245097A (en) | Method for training entity recognition model, entity recognition method and corresponding device | |
| CN110851597A (en) | Method and device for sentence annotation based on similar entity replacement | |
| CN117216549A (en) | Model training method, auditing prompting method, device, equipment and storage medium | |
| CN114462418B (en) | Event detection method, system, intelligent terminal and computer readable storage medium | |
| CN113657092B (en) | Method, device, equipment and medium for identifying tag | |
| CN114298032B (en) | Text punctuation detection method, computer equipment and storage medium | |
| CN113627197B (en) | Text intention recognition method, device, equipment and storage medium | |
| CN113705194A (en) | Extraction method and electronic equipment for short | |
| CN114626378A (en) | Named entity recognition method and device, electronic equipment and computer readable storage medium | |
| CN115391542B (en) | Classification model training method, text classification method, device and equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |