CN117251685B

CN117251685B - Knowledge graph-based standardized government affair data construction method and device

Info

Publication number: CN117251685B
Application number: CN202311544685.1A
Authority: CN
Inventors: 谢真强; 董厚泽; 冯璐萍; 杨永鑫; 邹佳; 洒科进
Original assignee: CETC Big Data Research Institute Co Ltd
Current assignee: CETC Big Data Research Institute Co Ltd
Priority date: 2023-11-20
Filing date: 2023-11-20
Publication date: 2024-01-26
Anticipated expiration: 2043-11-20
Also published as: WO2025108197A1; LU600011B1; CN117251685A

Abstract

The present invention provides a method and device for constructing standardized government data based on a knowledge graph, which belongs to the field of data processing technology. The method of constructing standardized government data based on a knowledge graph utilizes a feature extraction model through seed words based on government scenarios. , identify multiple initial entities in the government affairs scenario, then use the mutual information value between adjacent initial entities to obtain the first phrase entity, and obtain the second phrase entity by calculating the left and right entropy based on the mutual information value, The scope of phrase entities is further expanded, and the target entity is finally obtained, which realizes the extraction of phrase entities that are nested combinations of multiple words, and then obtains richer and more accurate entities to build the knowledge graph in government affairs scenarios.

Description

A standardized government data construction method and device based on knowledge graph

技术领域Technical field

本发明涉及数据处理技术领域，尤其涉及一种基于知识图谱的标准化政务数据构建方法和装置。The present invention relates to the field of data processing technology, and in particular to a method and device for constructing standardized government data based on knowledge graphs.

背景技术Background technique

知识图谱可以形式化地描述现实世界中的事物及其相关关系。知识图谱采用实体、关系、实体的三元组形式存储知识，并以实体为节点、关系为边来构建知识网络。目前，很多知名的知识图谱项目组织了大量的数据，从中可以提取知识来进行组织和管理，为用户提供高质量的智能服务，如理解搜索的语义，提供更精准的搜索答案等。Knowledge graphs can formally describe things in the real world and their relationships. The knowledge graph uses entities, relationships, and entity triples to store knowledge, and uses entities as nodes and relationships as edges to build a knowledge network. At present, many well-known knowledge graph projects organize a large amount of data, from which knowledge can be extracted for organization and management, and provide users with high-quality intelligent services, such as understanding the semantics of search and providing more accurate search answers.

在政务场景下，知识图谱可以起到很重要的作用。政务领域涉及大量的实体和关系，例如政府机构、部门、官员、法规、政策等，这些实体和关系之间存在复杂的联系和依赖关系。知识图谱可以将这些实体和关系以图的形式进行建模，从而实现对政务信息的有效管理和应用。在构建政务场景的知识图谱时，需要聚焦政务服务知识库构建过程中数据(知识)的标准化表示与处理。其中知识图谱的构建过程包括政务场景中数据的获取、清洗处理以及知识(实体、关系、属性)抽取，进而实现知识表示、本体构建、知识存储以及知识融合。In government affairs scenarios, knowledge graphs can play a very important role. The field of government affairs involves a large number of entities and relationships, such as government agencies, departments, officials, regulations, policies, etc. There are complex connections and dependencies between these entities and relationships. Knowledge graphs can model these entities and relationships in the form of graphs to achieve effective management and application of government information. When building a knowledge graph for government affairs scenarios, it is necessary to focus on the standardized representation and processing of data (knowledge) in the process of building a government service knowledge base. The construction process of the knowledge graph includes the acquisition, cleaning and extraction of knowledge (entities, relationships, attributes) of data in government affairs scenarios, thereby achieving knowledge representation, ontology construction, knowledge storage and knowledge fusion.

但是，政务场景的文本数据通常较为复杂，包含大量专业术语和上下文相关信息，实体和关系的抽取可能面临挑战。现有的实体提取方法需要先进行分词，进而得到实体。分词方法大多采用基于词典的分词方法，使用的词典为跨领域通用词典，多为常用词汇，缺乏政务场景下的专有词汇。而且政务场景中的文本实体很多都是多个单词的嵌套组合，对于多个单词的嵌套组合，例如，针对“房屋租赁当事人”一词，按照常规的分词方式提取的实体会被分为“房屋租赁”以及“当事人”，这种方法的效果并不理想。However, the text data of government affairs scenarios are usually complex and contain a large amount of professional terms and context-related information, and the extraction of entities and relationships may face challenges. Existing entity extraction methods need to segment words first and then obtain entities. Most word segmentation methods use dictionary-based word segmentation methods. The dictionaries used are cross-domain general dictionaries, mostly commonly used vocabulary, and lack of proprietary vocabulary in government affairs scenarios. Moreover, many text entities in government affairs scenarios are nested combinations of multiple words. For nested combinations of multiple words, for example, for the word "house rental party", the entities extracted according to the conventional word segmentation method will be divided into "House rental" and "parties", the effect of this method is not ideal.

发明内容Contents of the invention

本发明提供一种基于知识图谱的标准化政务数据构建方法和装置，用以解决现有技术中实体提取效果不够理想的缺陷，实现精准提取政务场景下实体的效果。The present invention provides a standardized government data construction method and device based on knowledge graphs to solve the shortcomings of the existing technology that the entity extraction effect is not ideal, and achieve the effect of accurately extracting entities in government affairs scenarios.

本发明提供一种基于知识图谱的标准化政务数据构建方法，包括：The present invention provides a standardized government data construction method based on knowledge graph, including:

对政务场景的语料库进行数据预处理；Perform data preprocessing on the corpus of government affairs scenarios;

基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体；Based on the seed words of the government affairs scenario, the feature extraction model is used to identify the initial entities in the preprocessed corpus;

确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体；Determine mutual information values between adjacent initial entities, and combine adjacent initial entities with mutual information values greater than a first threshold into a first phrase entity;

确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体；所述第二阈值小于所述第一阈值；Determine the left and right entropies of the initial entities whose mutual information values of adjacent initial entities are greater than the second threshold in the preprocessed corpus, and determine the second phrase entity based on the initial entities whose left and right entropies are less than the third threshold; the third The second threshold is smaller than the first threshold;

将与相邻的初始实体之间的互信息值小于或者等于所述第一阈值的初始实体、所述第一短语实体以及所述第二短语实体确定为目标实体；Determine the initial entity, the first phrase entity and the second phrase entity whose mutual information value with the adjacent initial entity is less than or equal to the first threshold as the target entity;

基于政务场景的语料库以及所述目标实体，抽取所述目标实体之间的关系；Based on the corpus of government affairs scenarios and the target entities, extract the relationship between the target entities;

将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。Store the determined relationships between target entities in a relational database to build a government knowledge graph.

根据本发明提供的一种基于知识图谱的标准化政务数据构建方法，所述确定相邻的初始实体之间的互信息值，包括：According to a knowledge graph-based standardized government data construction method provided by the present invention, determining the mutual information value between adjacent initial entities includes:

遍历政务场景的语料库，分别确定两个相邻的初始实体以及两个相邻的初始实体组成的短语在政务场景的语料库中的出现次数；Traverse the corpus of government affairs scenes and determine the number of occurrences of two adjacent initial entities and phrases composed of two adjacent initial entities in the corpus of government affairs scenes;

将两个相邻的初始实体的出现次数分别除以语料库中所有词汇的总数，分别得到两个相邻的初始实体词汇的出现概率，并将两个相邻的初始实体组成的短语在政务场景的语料库中的出现次数除以语料库中所有词汇的总数，得到两个相邻的初始实体组成的短语的出现概率；Divide the number of occurrences of two adjacent initial entities by the total number of all words in the corpus to obtain the occurrence probabilities of the two adjacent initial entity words, and use the phrases composed of two adjacent initial entities in the government affairs scene The number of occurrences in the corpus is divided by the total number of all words in the corpus to obtain the occurrence probability of a phrase composed of two adjacent initial entities;

基于两个相邻的初始实体词汇的出现概率以及两个相邻的初始实体组成的短语的出现概率，确定出相邻的初始实体之间的互信息值。Based on the occurrence probability of two adjacent initial entity words and the occurrence probability of phrases composed of two adjacent initial entities, the mutual information value between adjacent initial entities is determined.

根据本发明提供的一种基于知识图谱的标准化政务数据构建方法，所述确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，包括：According to a knowledge graph-based standardized government data construction method provided by the present invention, determining the left and right entropy of an initial entity whose mutual information value of adjacent initial entities is greater than the second threshold in the preprocessed corpus includes:

确定选词窗口大小，所述选词窗口大小用于限定所述初始实体在预处理后的语料中的左右邻近词的范围；Determine the size of the word selection window, which is used to limit the range of left and right adjacent words of the initial entity in the preprocessed corpus;

基于选词窗口大小，获取相邻的初始实体的互信息值大于第二阈值的初始实体在政务场景的语料库中左右范围内的邻近词；Based on the word selection window size, obtain adjacent words in the left and right range of the initial entity whose mutual information value of adjacent initial entities is greater than the second threshold in the corpus of the government affairs scenario;

确定每个邻近词在所有选词窗口内出现的次数，分别得到每个邻近词在所有左侧选词窗口以及所有右侧选词窗口内出现的频率；Determine the number of times each adjacent word appears in all word selection windows, and obtain the frequency of each adjacent word appearing in all left word selection windows and all right word selection windows;

基于每个邻近词在各选词窗口内出现的频率，确定每个邻近词对应的初始实体在预处理后的语料中的左右熵。Based on the frequency of occurrence of each adjacent word in each word selection window, the left and right entropy of the initial entity corresponding to each adjacent word in the preprocessed corpus is determined.

根据本发明提供的一种基于知识图谱的标准化政务数据构建方法，所述基于左右熵小于第三阈值的初始实体，确定出第二短语实体，包括：According to a knowledge graph-based standardized government data construction method provided by the present invention, the second phrase entity is determined based on the initial entity whose left and right entropy is less than the third threshold, including:

确定左右熵小于第三阈值的初始实体在预处理后的语料中的左右熵中左熵值和右熵值；Determine the left entropy value and right entropy value in the left and right entropy of the initial entity whose left and right entropy is less than the third threshold in the preprocessed corpus;

将左熵值和右熵值中较小的一个所对应的邻近词与所述初始实体组成所述第二短语实体。The second phrase entity is composed of the adjacent words corresponding to the smaller one of the left entropy value and the right entropy value and the initial entity.

根据本发明提供的一种基于知识图谱的标准化政务数据构建方法，利用特征提取模型识别出预处理后的语料库中的初始实体，包括：According to a knowledge graph-based standardized government data construction method provided by the present invention, the feature extraction model is used to identify the initial entities in the preprocessed corpus, including:

基于政务场景的种子词，构建文本特征模版；所述特征模版包括单词的词向量表示、词性标签、种子词距离以及上下文信息；Based on the seed words of the government affairs scenario, a text feature template is constructed; the feature template includes the word vector representation, part-of-speech tag, seed word distance and contextual information;

利用初始训练语料和所述文本特征模版训练所述特征提取模型；Using initial training corpus and the text feature template to train the feature extraction model;

将预处理后的语料库中的语料输入至训练好的所述特征提取模型，得到所述特征提取模型输出的所述初始实体对应的特征，确定所述初始实体。The corpus in the preprocessed corpus is input into the trained feature extraction model, the features corresponding to the initial entity output by the feature extraction model are obtained, and the initial entity is determined.

根据本发明提供的一种基于知识图谱的标准化政务数据构建方法，所述特征提取模型为BiLSTM-CRF模型，所述特征提取模型包括输入层、编码层和解码层；According to a knowledge graph-based standardized government data construction method provided by the present invention, the feature extraction model is a BiLSTM-CRF model, and the feature extraction model includes an input layer, a coding layer and a decoding layer;

所述输入层用于将输入的语料文本中的单词或字符转换为连续的向量表示；The input layer is used to convert words or characters in the input corpus text into continuous vector representations;

所述编码层用于提取输入的语料文本的向量序列的上下文特征，并生成包含语义信息的编码向量序列；The encoding layer is used to extract contextual features of the vector sequence of the input corpus text, and generate an encoding vector sequence containing semantic information;

所述解码层用于使用条件随机场对所述编码层输出的所述编码向量序列进行标注，并对标注的每个位置的标签进行预测，生成输出所述初始实体对应的特征。The decoding layer is configured to use a conditional random field to annotate the encoding vector sequence output by the encoding layer, predict the label of each annotated position, and generate and output features corresponding to the initial entity.

根据本发明提供的一种基于知识图谱的标准化政务数据构建方法，政务场景的种子词包括政务场景专有词典中的词语以及在政务场景人工标注的词语。According to a knowledge graph-based standardized government data construction method provided by the present invention, the seed words of the government scene include words in the government scene-specific dictionary and words manually annotated in the government scene.

本发明还提供一种基于知识图谱的标准化政务数据构建装置，包括：The present invention also provides a standardized government data construction device based on knowledge graph, including:

第一处理模块，用于对政务场景的语料库进行数据预处理；The first processing module is used for data preprocessing of the corpus of government affairs scenarios;

第二处理模块，用于基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体；The second processing module is used to use the feature extraction model to identify the initial entities in the preprocessed corpus based on the seed words of the government affairs scenario;

第三处理模块，用于确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体；A third processing module, configured to determine mutual information values between adjacent initial entities, and combine adjacent initial entities with mutual information values greater than the first threshold into the first phrase entity;

第四处理模块，用于确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体；所述第二阈值小于所述第一阈值；The fourth processing module is used to determine the left and right entropy of the initial entity in the preprocessed corpus for which the mutual information value of the adjacent initial entities is greater than the second threshold, and determine the first entity based on the initial entity whose left and right entropy is less than the third threshold. Two phrase entities; the second threshold is smaller than the first threshold;

第五处理模块，用于将与相邻的初始实体之间的互信息值小于或者等于所述第一阈值的初始实体、所述第一短语实体以及所述第二短语实体确定为目标实体；A fifth processing module, configured to determine the initial entity, the first phrase entity and the second phrase entity whose mutual information value with the adjacent initial entity is less than or equal to the first threshold as the target entity;

第六处理模块，用于基于政务场景的语料库以及所述目标实体，抽取所述目标实体之间的关系；The sixth processing module is used to extract the relationship between the target entities based on the corpus of the government affairs scenario and the target entities;

第七处理模块，用于将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。The seventh processing module is used to store the determined relationships between target entities in a relational database to build a government knowledge graph.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述基于知识图谱的标准化政务数据构建方法。The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements any one of the above-mentioned methods based on the knowledge graph. Standardized government data construction method.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述基于知识图谱的标准化政务数据构建方法。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the standardized government data construction method based on knowledge graphs as described in any of the above is implemented.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述基于知识图谱的标准化政务数据构建方法。The present invention also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the computer program implements any one of the above knowledge graph-based standardized government data construction methods.

本发明提供的一种基于知识图谱的标准化政务数据构建方法和装置，通过基于政务场景的种子词来利用特征提取模型，识别得到政务场景下的多个初始实体，再利用相邻的初始实体之间的互信息值得到第一短语实体，并在互信息值的基础上通过计算左右熵来得到第二短语实体，进一步扩大短语实体的范围，最终得到目标实体，实现了对多个单词嵌套组合的短语实体的提取，进而得到了更为丰富而又准确的实体来构建政务场景下的知识图谱。The present invention provides a method and device for constructing standardized government data based on a knowledge graph, using a feature extraction model based on seed words of a government scenario to identify multiple initial entities in a government scenario, and then using the adjacent initial entities to The first phrase entity is obtained from the mutual information value, and the second phrase entity is obtained by calculating the left and right entropy based on the mutual information value, further expanding the scope of the phrase entity, and finally obtaining the target entity, realizing the nesting of multiple words The combined phrase entities are extracted to obtain richer and more accurate entities to construct the knowledge graph in government affairs scenarios.

附图说明Description of the drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1是本发明提供的基于知识图谱的标准化政务数据构建方法的流程示意图；Figure 1 is a schematic flow chart of the standardized government data construction method based on knowledge graph provided by the present invention;

图2是本发明提供的基于知识图谱的标准化政务数据构建装置的结构示意图；Figure 2 is a schematic structural diagram of a standardized government data construction device based on knowledge graphs provided by the present invention;

图3是本发明提供的电子设备的结构示意图。Figure 3 is a schematic structural diagram of the electronic device provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention more clear, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

下面结合图1-图3描述本发明的一种基于知识图谱的标准化政务数据构建方法和装置。The following describes a knowledge graph-based standardized government data construction method and device of the present invention in conjunction with Figures 1-3.

如图1所示，本发明实施例的一种基于知识图谱的标准化政务数据构建方法主要包括步骤110、步骤120、步骤130、步骤140、步骤150、步骤160以及步骤170。As shown in Figure 1, a standardized government data construction method based on a knowledge graph according to the embodiment of the present invention mainly includes step 110, step 120, step 130, step 140, step 150, step 160 and step 170.

步骤110，对政务场景的语料库进行数据预处理。Step 110: Perform data preprocessing on the corpus of government affairs scenarios.

政务场景的语料库是指在政府、行政机关、公共服务等领域中使用的文本数据集。这些数据集通常包含政策法规、公共服务以及行政办公等相关文本。The corpus of government affairs scenarios refers to text data sets used in the fields of government, administrative agencies, public services, etc. These data sets usually contain texts related to policies and regulations, public services, and administrative offices.

政务场景的语料库中通常包含各类政策法规的文本，如宪法、法律、行政法规、规章、规范性文件等。政府机构提供的公共服务覆盖了社会生活的各个领域，政务场景的语料库中通常包含公共服务相关的文本，如招投标公告、预算报告、财务报表、人事档案、社保福利等。政务场景的语料库中还可能包含各种行政办公相关的文本，如公文、函件、报告、备忘录、会议纪要、工作计划等。The corpus of government affairs scenarios usually contains texts of various policies and regulations, such as constitutions, laws, administrative regulations, rules, normative documents, etc. Public services provided by government agencies cover all areas of social life. The corpus of government affairs scenarios usually contains texts related to public services, such as bidding announcements, budget reports, financial statements, personnel files, social security benefits, etc. The corpus of government affairs scenes may also contain various administrative office-related texts, such as official documents, letters, reports, memos, meeting minutes, work plans, etc.

对政务场景的语料库进行数据预处理可以包括清洗文本数据。政务场景的数据来源通常比较固定，可能会出现重复的数据。因此，在进行数据清洗时，需要识别和去除重复的数据，避免对后续分析造成干扰。政务场景的数据来源也可能包括一些噪音数据，如无关信息、错误标注等。在进行数据清洗时，需要识别并去除这些噪音数据，以避免对后续分析造成影响。Data preprocessing of the corpus of government affairs scenarios can include cleaning text data. The data sources in government affairs scenarios are usually relatively fixed, and duplicate data may appear. Therefore, when performing data cleaning, it is necessary to identify and remove duplicate data to avoid interference with subsequent analysis. Data sources in government affairs scenarios may also include some noisy data, such as irrelevant information, mislabeling, etc. When performing data cleaning, these noisy data need to be identified and removed to avoid affecting subsequent analysis.

由于政务场景的数据来源可能来自不同的机构或部门，其格式和命名方式可能存在差异。在清洗完成后，可以对数据进行规范化处理，确保数据的格式和命名方式统一。Since the data sources of government affairs scenarios may come from different agencies or departments, their formats and naming methods may be different. After cleaning is completed, the data can be standardized to ensure that the format and naming of the data are unified.

在政务场景的数据存在数据缺失的情况下，如某些字段没有填写或记录，可以识别并处理这些空缺数据，如填充默认值、删除记录等。When there is missing data in government affairs scenarios, such as some fields are not filled in or recorded, these missing data can be identified and processed, such as filling in default values, deleting records, etc.

在此基础上，可以得到便于进行实体提取的语料文本。On this basis, corpus text that is convenient for entity extraction can be obtained.

步骤120，基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体。Step 120: Based on the seed words of the government affairs scenario, use the feature extraction model to identify the initial entities in the preprocessed corpus.

在政务场景中，种子词可以是与政府、行政管理、公共事务等相关的词汇。In government affairs scenarios, seed words can be words related to government, administrative management, public affairs, etc.

例如种子词可以是政策文件、法律法规（例如宪法、刑法）、政策措施等对应的词汇。For example, seed words can be words corresponding to policy documents, laws and regulations (such as the Constitution, criminal law), policy measures, etc.

种子词还可以是公共服务对应的词汇，如教育、医疗保健、社会保障、环境保护、基础设施建设等。Seed words can also be words corresponding to public services, such as education, medical care, social security, environmental protection, infrastructure construction, etc.

种子词还可以是政务信息系统、政府活动与会议等对应的词汇，此处不作限制。Seed words can also be words corresponding to government information systems, government activities and meetings, etc. There are no restrictions here.

政务场景的种子词可以作为政务场景中相关信息的起点，用于搜索、分类、组织和提取政务场景下的实体。当然，具体的种子词列表还需要根据实际应用场景和需求进行进一步的定制和扩展。The seed words of government affairs scenarios can be used as the starting point for relevant information in government affairs scenarios, and are used to search, classify, organize, and extract entities in government affairs scenarios. Of course, the specific seed word list needs to be further customized and expanded based on actual application scenarios and needs.

可以理解的是，政务场景的种子词可以包括政务场景专有词典中的词语以及在政务场景人工标注的词语。政务场景专有词典是一种用于存储政务领域特定术语和名词的工具，可以帮助对政务文本进行实体识别、信息提取等任务。即政务场景的种子词可以通过获取现有的政务场景专有词或者人工标注的方式来得到。It can be understood that the seed words of the government affairs scene may include words in the government affairs scene-specific dictionary and words manually annotated in the government affairs scene. The government affairs scenario-specific dictionary is a tool used to store specific terms and nouns in the government affairs field. It can help with tasks such as entity recognition and information extraction from government affairs texts. That is, the seed words of government affairs scenes can be obtained by obtaining existing government affairs scene-specific words or manual annotation.

根据提取实体的任务需求，可以选择适当的特征提取模型，例如基于机器学习的模型（如条件随机场、最大熵模型）或基于深度学习的模型（如循环神经网络、Transformer），来提取政务场景的语料库中的实体。According to the task requirements of extracting entities, you can choose an appropriate feature extraction model, such as a machine learning-based model (such as conditional random field, maximum entropy model) or a deep learning-based model (such as recurrent neural network, Transformer), to extract government affairs scenes entities in the corpus.

在进行特征提取时，可以根据政务场景和种子词，设计合适的特征。在此基础上，通过使用标注好的样本数据，训练特征提取模型。样本数据可以由人工标注的实体数据和政务场景的语料库构成，确保覆盖政务场景中的各种实体类型。When performing feature extraction, appropriate features can be designed based on the government affairs scenario and seed words. On this basis, the feature extraction model is trained by using the labeled sample data. The sample data can be composed of manually annotated entity data and a corpus of government affairs scenarios to ensure coverage of various entity types in government affairs scenarios.

可以理解的是，使用训练好的特征提取模型对预处理后的语料库进行实体识别，特征提取模型可以根据特征提取得到的特征对每个词进行分类，确定其是否为实体，并得到各个初始实体。It can be understood that the trained feature extraction model is used to perform entity recognition on the preprocessed corpus. The feature extraction model can classify each word according to the features obtained by feature extraction, determine whether it is an entity, and obtain each initial entity. .

在一些实施例中，基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体，包括以下过程。In some embodiments, based on the seed words of the government affairs scenario, a feature extraction model is used to identify the initial entities in the preprocessed corpus, including the following process.

可以先基于政务场景的种子词，构建文本特征模版。特征模版包括单词的词向量表示、词性标签、种子词距离以及上下文信息。You can first build a text feature template based on the seed words of the government affairs scene. The feature template includes the word vector representation, part-of-speech tag, seed word distance and contextual information.

在此基础上，再利用初始训练语料和文本特征模版训练特征提取模型，再将预处理后的语料库中的语料输入至训练好的特征提取模型，得到特征提取模型输出的初始实体对应的特征，确定初始实体。On this basis, the initial training corpus and text feature template are used to train the feature extraction model, and then the preprocessed corpus is input to the trained feature extraction model to obtain the features corresponding to the initial entities output by the feature extraction model. Determine the initial entity.

可以理解的是，可以针对政务场景的特点，例如政策法规、公共服务等，可以通过专家知识或者领域分析来确定一些关键词汇作为种子词，然后结合词向量表示、词性标签、种子词距离以及上下文信息等多个特征构建文本特征模板。It is understandable that according to the characteristics of government affairs scenarios, such as policies, regulations, public services, etc., some key words can be determined as seed words through expert knowledge or domain analysis, and then combined with word vector representation, part-of-speech tags, seed word distance and context Information and other features are used to construct a text feature template.

在此基础上，将语料和文本特征模板输入到特征提取模型中进行训练，目的是学习如何利用不同的特征来提取实体信息。将经过预处理的语料库输入到特征提取模型中，可以利用训练好的模型提取出语料中的初始实体，并得到实体对应的文本特征。根据特征提取模型输出的实体对应的特征，可以根据一定的规则和阈值确定哪些实体是有效的，即确定初始实体。On this basis, corpus and text feature templates are input into the feature extraction model for training, with the purpose of learning how to use different features to extract entity information. Input the preprocessed corpus into the feature extraction model, and use the trained model to extract the initial entities in the corpus and obtain the text features corresponding to the entities. According to the features corresponding to the entities output by the feature extraction model, which entities are valid can be determined according to certain rules and thresholds, that is, the initial entities are determined.

在本实施方式中，利用种子词构建文本特征模板，可以提高抽取算法的召回率和精度。种子词能够引导算法在文本中寻找相关实体和关系，而文本特征模板则能够捕捉关键信息，帮助算法准确地识别和提取实体和关系。In this implementation, seed words are used to construct text feature templates, which can improve the recall rate and precision of the extraction algorithm. Seed words can guide the algorithm to find relevant entities and relationships in the text, while text feature templates can capture key information and help the algorithm accurately identify and extract entities and relationships.

政务领域的知识是不断变化和扩展的，因此需要不断地更新和扩展知识图谱。在后续过程中，文本特征模板可以通过添加和修改种子词，以支持知识图谱的动态更新和扩展。Knowledge in the field of government affairs is constantly changing and expanding, so the knowledge graph needs to be constantly updated and expanded. In the subsequent process, the text feature template can support the dynamic update and expansion of the knowledge graph by adding and modifying seed words.

在一些实施例中，特征提取模型为BiLSTM-CRF模型，BiLSTM-CRF模型是一种常用于序列标注任务的深度学习模型，它结合了BiLSTM和CRF两种模型的优点。In some embodiments, the feature extraction model is a BiLSTM-CRF model. The BiLSTM-CRF model is a deep learning model commonly used in sequence labeling tasks. It combines the advantages of BiLSTM and CRF models.

BiLSTM是一种双向循环神经网络，它通过前向LSTM和后向LSTM分别对输入序列进行处理，从而获得上下文信息，可以有效地处理长序列依赖关系。在政务场景下的实体识别等序列标注任务中，BiLSTM能够将词语和上下文信息结合起来，生成每个词语对应的隐层状态表示。BiLSTM is a bidirectional recurrent neural network that processes input sequences through forward LSTM and backward LSTM respectively to obtain contextual information and can effectively handle long sequence dependencies. In sequence annotation tasks such as entity recognition in government affairs scenarios, BiLSTM can combine words and contextual information to generate a hidden layer state representation corresponding to each word.

CRF是一种条件随机场模型，它考虑到相邻标签之间的依赖关系，可以对整个序列进行全局最优的标注，从而提高模型的准确性。在实体识别任务中，CRF可以利用BiLSTM生成的隐层状态表示，对序列中的实体进行精确的标注。CRF is a conditional random field model that takes into account the dependence between adjacent labels and can perform globally optimal annotation of the entire sequence, thereby improving the accuracy of the model. In the entity recognition task, CRF can use the hidden layer state representation generated by BiLSTM to accurately label entities in the sequence.

特征提取模型包括输入层、编码层和解码层。The feature extraction model includes input layer, encoding layer and decoding layer.

输入层用于将输入的语料文本中的单词或字符转换为连续的向量表示。输入层的目的是将离散的文本输入转化为稠密的向量表示，以便于后续的处理。例如，可以使用预训练的词嵌入模型（如Word2Vec、GloVe）来获取单词的向量表示。The input layer is used to convert words or characters in the input corpus text into continuous vector representations. The purpose of the input layer is to convert discrete text input into dense vector representation for subsequent processing. For example, pre-trained word embedding models (such as Word2Vec, GloVe) can be used to obtain vector representations of words.

编码层用于提取输入的语料文本的向量序列的上下文特征，并生成包含语义信息的编码向量序列。编码层使用双向长短期记忆网络对输入序列进行建模。BiLSTM通过同时考虑前向和后向的上下文信息，能够有效地捕捉文本中的长距离依赖关系。编码层的作用是提取输入序列的上下文特征，并生成一个包含了丰富语义信息的编码向量序列。The encoding layer is used to extract contextual features of the vector sequence of the input corpus text and generate an encoding vector sequence containing semantic information. The encoding layer models the input sequence using a bidirectional long short-term memory network. BiLSTM can effectively capture long-distance dependencies in text by considering both forward and backward contextual information. The function of the encoding layer is to extract contextual features of the input sequence and generate a sequence of encoding vectors containing rich semantic information.

解码层用于使用条件随机场对编码层输出的编码向量序列进行标注，并对标注的每个位置的标签进行预测，生成输出初始实体对应的特征。解码层考虑了相邻标签之间的依赖关系，能够根据全局概率对整个序列进行标注，而不仅仅是独立地对每个位置进行分类。解码层的作用是根据编码层输出的特征序列，利用CRF模型对每个位置的标签进行预测，并生成最终的标注序列，进而得到初始实体。The decoding layer is used to use conditional random fields to annotate the encoding vector sequence output by the encoding layer, predict the label of each annotated position, and generate features corresponding to the output initial entities. The decoding layer takes into account the dependencies between adjacent labels and is able to label the entire sequence based on global probabilities rather than just classifying each position independently. The function of the decoding layer is to use the CRF model to predict the label of each position based on the feature sequence output by the encoding layer, and generate the final labeling sequence to obtain the initial entity.

在本实施方式中，BiLSTM-CRF相比于传统的基于规则或特征工程的方法，能够自动学习特征并能够捕捉上下文信息。传统的基于规则或特征工程的方法需要手动设计特征，而BiLSTM-CRF模型可以自动学习输入序列的特征表示，包括上下文信息、词向量表示等。这使得模型更加灵活和通用。BiLSTM-CRF模型是基于序列的模型，能够利用前后文的信息来进行标注。利用前后文的信息来进行标注对于实体提取任务来说非常重要，因为政务场景中的嵌套单词构成的实体通常与其上下文相关联，只有考虑上下文信息才能更准确地识别实体。In this implementation, BiLSTM-CRF can automatically learn features and capture contextual information compared to traditional rule-based or feature engineering methods. Traditional rule-based or feature engineering methods require manual design of features, while the BiLSTM-CRF model can automatically learn the feature representation of the input sequence, including contextual information, word vector representation, etc. This makes the model more flexible and versatile. The BiLSTM-CRF model is a sequence-based model that can use contextual information for annotation. Using context information for annotation is very important for entity extraction tasks, because entities composed of nested words in government affairs scenarios are usually associated with their context, and only by considering context information can entities be identified more accurately.

此外，BiLSTM-CRF模型能够统一处理多种实体类型，包括人名、地名、组织机构名等。这使得模型具有更好的通用性和泛化能力。In addition, the BiLSTM-CRF model can uniformly handle multiple entity types, including person names, place names, organization names, etc. This makes the model more versatile and generalizable.

步骤130，确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体。Step 130: Determine mutual information values between adjacent initial entities, and combine adjacent initial entities with mutual information values greater than the first threshold into the first phrase entity.

可以理解的是，在识别出初始实体后，可以对初始实体之间的位置进行识别，进而识别出相邻的初始实体。例如，可以识别到“房屋租赁”与“当事人”相邻。It can be understood that after the initial entities are identified, the positions between the initial entities can be identified, and then adjacent initial entities can be identified. For example, "house rental" can be identified as being adjacent to "party".

互信息值是一种用来衡量两个事件之间相关性的指标。确定相邻的初始实体之间的互信息值，可以包括以下过程。Mutual information value is a metric used to measure the correlation between two events. Determining the mutual information value between adjacent initial entities may include the following process.

可以先遍历政务场景的语料库，分别确定两个相邻的初始实体以及两个相邻的初始实体组成的短语在政务场景的语料库中的出现次数。You can first traverse the corpus of government affairs scenes and determine the number of occurrences of two adjacent initial entities and phrases composed of two adjacent initial entities in the corpus of government affairs scenes.

将两个相邻的初始实体的出现次数分别除以语料库中所有词汇的总数，分别得到两个相邻的初始实体词汇的出现概率，并将两个相邻的初始实体组成的短语在政务场景的语料库中的出现次数除以语料库中所有词汇的总数，得到两个相邻的初始实体组成的短语的出现概率。Divide the number of occurrences of two adjacent initial entities by the total number of all words in the corpus to obtain the occurrence probabilities of the two adjacent initial entity words, and use the phrases composed of two adjacent initial entities in the government affairs scene The number of occurrences in the corpus is divided by the total number of all words in the corpus to obtain the occurrence probability of a phrase composed of two adjacent initial entities.

例如，假设预处理后的语料库中包含相邻的两个词语A和B，则词语A和B的互信息值MI(A, B)可以表示为：For example, assuming that the preprocessed corpus contains two adjacent words A and B, the mutual information value MI(A, B) of words A and B can be expressed as:

MI(A, B) = log₂(P(A, B) / (P(A) * P(B)))；MI(A, B) = log ₂ (P(A, B) / (P(A) * P(B)));

其中，P(A, B)表示词语A和词语B组成的短语的出现概率，P(A)和P(B)分别表示词语A和词语B单独的出现概率。Among them, P(A, B) represents the occurrence probability of the phrase composed of word A and word B, and P(A) and P(B) represent the occurrence probabilities of word A and word B respectively.

可以理解的是，第一阈值可以根据实际情况来进行设置。在此基础上，可以将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体。It can be understood that the first threshold can be set according to actual conditions. On this basis, adjacent initial entities with mutual information values greater than the first threshold can be combined into the first phrase entity.

在本实施方式中，通过计算相邻的初始实体的互信息值，可以确定两个相邻的初始实体之间是否具有相关性，进而可以组成第一短语实体，从而构建多个单词嵌套构成的实体，可以得到更为丰富和准确的初始实体。In this implementation, by calculating the mutual information value of adjacent initial entities, it can be determined whether there is a correlation between two adjacent initial entities, and then the first phrase entity can be formed, thereby constructing a nested composition of multiple words entities, a richer and more accurate initial entity can be obtained.

步骤140，确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体。Step 140: Determine the left and right entropies of the initial entities in the preprocessed corpus of adjacent initial entities whose mutual information values are greater than the second threshold, and determine the second phrase entity based on the initial entities whose left and right entropies are less than the third threshold.

需要说明的是，第二阈值小于第一阈值。It should be noted that the second threshold is smaller than the first threshold.

在此种情况下，相邻的初始实体的互信息值大于第二阈值的初始实体中包含可以构成第一短语实体的初始实体，同时也包含一些不可以构成第一短语实体的初始实体。In this case, the initial entities whose mutual information values of adjacent initial entities are greater than the second threshold include initial entities that can constitute the first phrase entity, and also include some initial entities that cannot constitute the first phrase entity.

在一些实施例中，确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，具体可以按照以下方式确定。In some embodiments, the left and right entropies of the initial entities whose mutual information values of adjacent initial entities are greater than the second threshold in the preprocessed corpus are determined, and may be specifically determined in the following manner.

首先，确定选词窗口大小，选词窗口大小用于限定初始实体在预处理后的语料中的左右邻近词的范围。First, determine the size of the word selection window. The size of the word selection window is used to limit the range of left and right neighboring words of the initial entity in the preprocessed corpus.

例如，可以将窗口大小设定为2，表示考虑目标词左边和右边各2个邻近词。For example, the window size can be set to 2, which means that two adjacent words to the left and two to the right of the target word are considered.

在此基础上，基于选词窗口大小，获取相邻的初始实体的互信息值大于第二阈值的初始实体在政务场景的语料库中左右范围内的邻近词，例如可以基于固定的窗口大小进行滑动来获取邻近词。On this basis, based on the word selection window size, obtain adjacent words in the left and right range of the initial entity whose mutual information value is greater than the second threshold in the corpus of the government scene. For example, sliding can be based on a fixed window size. to get nearby words.

确定每个邻近词在所有选词窗口内出现的次数，分别得到每个邻近词在所有左侧选词窗口以及所有右侧选词窗口内出现的频率；基于每个邻近词在各选词窗口内出现的频率，确定每个邻近词对应的初始实体在预处理后的语料中的左右熵。Determine the number of times each adjacent word appears in all word selection windows, and obtain the frequency of each adjacent word in all left word selection windows and all right word selection windows; based on the frequency of each adjacent word in each word selection window The frequency of occurrence within the word determines the left and right entropy of the initial entity corresponding to each adjacent word in the preprocessed corpus.

要实现确定每个初始实体的邻近词在所有选词窗口内出现的次数，并得到每个邻近词在所有左侧选词窗口以及所有右侧选词窗口内出现的频率，可以按照以下过程进行。To determine the number of times each initial entity's neighboring words appear in all word selection windows, and to obtain the frequency of each neighboring word appearing in all left word selection windows and all right word selection windows, you can follow the following process .

遍历整个语料库，对于每个选词窗口，统计窗口内每个邻近词的出现次数，并累加到相应的邻近词计数器中。在此基础上，统计所有左侧选词窗口内邻近词的频率，即将每个邻近词的计数除以左侧选词窗口的总词数得到频率。类似地，统计所有右侧选词窗口内邻近词的频率，即将每个邻近词的计数除以右侧选词窗口的总词数得到频率。Traverse the entire corpus, and for each word selection window, count the number of occurrences of each adjacent word in the window, and add them to the corresponding adjacent word counter. On this basis, count the frequency of all adjacent words in the left word selection window, that is, divide the count of each adjacent word by the total number of words in the left word selection window to obtain the frequency. Similarly, count the frequency of all adjacent words in the right word selection window, that is, divide the count of each adjacent word by the total number of words in the right word selection window to get the frequency.

对于每个邻近词，计算其在所有左侧选词窗口内的频率分布。即将该邻近词在每个左侧选词窗口的频率除以该窗口内的总词数得到频率分布。可以使用信息熵来衡量左侧选词窗口中邻近词的不确定性。根据频率分布计算每个邻近词的概率，然后计算其左熵值。类似地，对于每个邻近词，重复上述过程，计算其在所有右侧选词窗口内的频率分布和右熵值。For each neighboring word, calculate its frequency distribution within all left word selection windows. That is, the frequency of the adjacent word in each left word selection window is divided by the total number of words in the window to obtain the frequency distribution. Information entropy can be used to measure the uncertainty of neighboring words in the word selection window on the left. Calculate the probability of each neighboring word based on the frequency distribution, and then calculate its left entropy value. Similarly, for each neighboring word, the above process is repeated to calculate its frequency distribution and right entropy value in all right word selection windows.

在得到左熵值和右熵值后，可以计算二者的平均值作为每个邻近词对应的初始实体在预处理后的语料中的左右熵，或者将二者的值取较小的一个作为每个邻近词对应的初始实体在预处理后的语料中的左右熵，此处对左右熵的具体计算方式不作限制。After obtaining the left entropy value and the right entropy value, the average of the two can be calculated as the left and right entropy of the initial entity corresponding to each adjacent word in the preprocessed corpus, or the smaller of the two values can be used as The left and right entropy of the initial entity corresponding to each adjacent word in the preprocessed corpus. There is no restriction on the specific calculation method of left and right entropy here.

在此种情况下，可以基于左右熵小于第三阈值的初始实体，确定出第二短语实体。In this case, the second phrase entity may be determined based on the initial entity whose left and right entropies are less than the third threshold.

在本实施方式中，通过计算左右熵，可以进一步地识别出与一些初始实体构成短语后边界感强的第二短语实体，进而可以进一步识别出可能的多个单词嵌套构成的实体。In this embodiment, by calculating the left and right entropy, the second phrase entity with a strong sense of boundary after forming a phrase with some initial entities can be further identified, and further entities formed by possible nesting of multiple words can be further identified.

在一些实施例中，基于左右熵小于第三阈值的初始实体，确定出第二短语实体，具体包括：确定左右熵小于第三阈值的初始实体在预处理后的语料中的左右熵中左熵值和右熵值；将左熵值和右熵值中较小的一个所对应的邻近词与初始实体组成第二短语实体。In some embodiments, determining the second phrase entity based on the initial entity whose left and right entropy is less than the third threshold value specifically includes: determining the left and right entropy of the initial entity whose left and right entropy is less than the third threshold value in the preprocessed corpus. value and the right entropy value; the adjacent words corresponding to the smaller one of the left entropy value and the right entropy value and the initial entity form the second phrase entity.

在本实施方式中，可以明确边界强的邻近词，进而与初始实体组成第二短语实体，提高实体提取的准确性。In this implementation, adjacent words with strong boundaries can be clarified, and then combined with the initial entity to form a second phrase entity, thereby improving the accuracy of entity extraction.

步骤150，将与相邻的初始实体之间的互信息值小于或者等于第一阈值的初始实体、第一短语实体以及第二短语实体确定为目标实体。Step 150: Determine the initial entity, the first phrase entity and the second phrase entity whose mutual information value with the adjacent initial entity is less than or equal to the first threshold as the target entity.

可以理解的是，第二短语实体中也可能包括与第一短语实体相重复的短语实体，在识别目标实体时，可以将第二短语实体中重复的短语实体筛除，进而得到目标实体。It can be understood that the second phrase entity may also include phrase entities that are repeated with the first phrase entity. When identifying the target entity, the repeated phrase entities in the second phrase entity may be filtered out to obtain the target entity.

步骤160，基于政务场景的语料库以及目标实体，抽取目标实体之间的关系。Step 160: Extract the relationship between the target entities based on the corpus of the government affairs scenario and the target entities.

在识别出政务场景语料库中的目标实体后，可以基于政务场景的语料库以及目标实体，抽取目标实体之间的关系。After identifying the target entities in the government scene corpus, the relationship between the target entities can be extracted based on the government scene corpus and the target entities.

在一些实施例中，可以基于预定义的规则或模式，寻找目标实体之间的关系。例如，在一句话中，如果两个目标实体之间存在特定的关键词或词组，可以根据这些关键词或词组定义相应的关系。In some embodiments, relationships between target entities may be found based on predefined rules or patterns. For example, in a sentence, if there are specific keywords or phrases between two target entities, the corresponding relationship can be defined based on these keywords or phrases.

在一些实施例中，还可以利用机器学习或深度学习技术，建立一个关系分类模型，并将目标实体及其上下文作为输入，预测它们之间的关系类别。在此种情况下，需要准备一个标注了关系类别的训练数据集，并使用监督学习算法进行模型训练。In some embodiments, machine learning or deep learning technology can also be used to build a relationship classification model, and take the target entity and its context as input to predict the relationship category between them. In this case, you need to prepare a training data set labeled with relationship categories and use a supervised learning algorithm for model training.

可以理解的是，可以根据具体的任务和数据特点选择适合的方法。在实际应用中，还可以采用多种技术和方法的组合，以提高关系抽取的准确性和效果。It is understood that the appropriate method can be selected based on the specific task and data characteristics. In practical applications, a combination of multiple technologies and methods can also be used to improve the accuracy and effect of relationship extraction.

步骤170，将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。Step 170: Store the determined relationships between target entities in a relational database to build a government knowledge graph.

关系数据库用于存储目标实体之间的关系。关系数据库中还存在使用其他实体抽取的实体以及实体关系，此处不作限制。可以理解的是，在抽取得到目标实体之间的关系后，将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。Relational databases are used to store relationships between target entities. There are also entities and entity relationships extracted using other entities in the relational database, which are not limited here. It can be understood that after extracting the relationship between the target entities, the determined relationship between the target entities is stored in the relational database to construct the government knowledge graph.

根据本发明实施例提供的一种基于知识图谱的标准化政务数据构建方法，通过基于政务场景的种子词来利用特征提取模型，识别得到政务场景下的多个初始实体，再利用相邻的初始实体之间的互信息值得到第一短语实体，并在互信息值的基础上通过计算左右熵来得到第二短语实体，进一步扩大短语实体的范围，最终得到目标实体，实现了对多个单词嵌套组合的短语实体的提取，进而得到了更为丰富而又准确的实体来构建政务场景下的知识图谱。According to a knowledge graph-based standardized government data construction method provided by an embodiment of the present invention, a feature extraction model is used to identify multiple initial entities in the government scenario by using seed words based on the government scenario, and then adjacent initial entities are used The mutual information value between them is used to obtain the first phrase entity, and based on the mutual information value, the second phrase entity is obtained by calculating the left and right entropy, further expanding the scope of the phrase entity, and finally obtaining the target entity, realizing the embedding of multiple words. By extracting a set of combined phrase entities, we can obtain richer and more accurate entities to build a knowledge graph in government affairs scenarios.

下面对本发明提供的一种基于知识图谱的标准化政务数据构建装置进行描述，下文描述的一种基于知识图谱的标准化政务数据构建装置与上文描述的一种基于知识图谱的标准化政务数据构建方法可相互对应参照。The following is a description of a standardized government data construction device based on a knowledge graph provided by the present invention. The standardized government data construction device based on a knowledge graph described below can be used with the standardized government data construction method based on a knowledge graph described above. mutual reference.

如图2所示，本发明实施例的一种基于知识图谱的标准化政务数据构建装置主要包括第一处理模块210、第二处理模块220、第三处理模块230、第四处理模块240、第五处理模块250、第六处理模块260以及第七处理模块270。As shown in Figure 2, a standardized government data construction device based on a knowledge graph according to the embodiment of the present invention mainly includes a first processing module 210, a second processing module 220, a third processing module 230, a fourth processing module 240, a fifth The processing module 250, the sixth processing module 260 and the seventh processing module 270.

第一处理模块210用于对政务场景的语料库进行数据预处理；The first processing module 210 is used to perform data preprocessing on the corpus of government affairs scenarios;

第二处理模块220用于基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体；The second processing module 220 is used to use the feature extraction model to identify the initial entities in the preprocessed corpus based on the seed words of the government affairs scenario;

第三处理模块230用于确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体；The third processing module 230 is configured to determine mutual information values between adjacent initial entities, and combine adjacent initial entities with mutual information values greater than the first threshold into the first phrase entity;

第四处理模块240用于确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体；第二阈值小于第一阈值；The fourth processing module 240 is used to determine the left and right entropies of the initial entities whose mutual information values of adjacent initial entities are greater than the second threshold in the preprocessed corpus, and determine the first entity based on the initial entities whose left and right entropies are less than the third threshold. Two phrase entities; the second threshold is smaller than the first threshold;

第五处理模块250用于将与相邻的初始实体之间的互信息值小于或者等于第一阈值的初始实体、第一短语实体以及第二短语实体确定为目标实体；The fifth processing module 250 is configured to determine the initial entity, the first phrase entity and the second phrase entity whose mutual information value with the adjacent initial entity is less than or equal to the first threshold as the target entity;

第六处理模块260用于基于政务场景的语料库以及目标实体，抽取目标实体之间的关系；The sixth processing module 260 is used to extract relationships between target entities based on the corpus of government affairs scenarios and target entities;

第七处理模块270用于将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。The seventh processing module 270 is used to store the determined relationships between target entities in a relational database to build a government knowledge graph.

根据本发明实施例提供的一种基于知识图谱的标准化政务数据构建装置，通过基于政务场景的种子词来利用特征提取模型，识别得到政务场景下的多个初始实体，再利用相邻的初始实体之间的互信息值得到第一短语实体，并在互信息值的基础上通过计算左右熵来得到第二短语实体，进一步扩大短语实体的范围，最终得到目标实体，实现了对多个单词嵌套组合的短语实体的提取，进而得到了更为丰富而又准确的实体来构建政务场景下的知识图谱。According to a standardized government data construction device based on a knowledge graph provided by an embodiment of the present invention, a feature extraction model is used through seed words based on government scenarios to identify multiple initial entities in the government scenario, and then adjacent initial entities are used The mutual information value between them is used to obtain the first phrase entity, and based on the mutual information value, the second phrase entity is obtained by calculating the left and right entropy, further expanding the scope of the phrase entity, and finally obtaining the target entity, realizing the embedding of multiple words. By extracting a set of combined phrase entities, we can obtain richer and more accurate entities to build a knowledge graph in government affairs scenarios.

在一些实施例中，第三处理模块230还用于遍历政务场景的语料库，分别确定两个相邻的初始实体以及两个相邻的初始实体组成的短语在政务场景的语料库中的出现次数；将两个相邻的初始实体的出现次数分别除以语料库中所有词汇的总数，分别得到两个相邻的初始实体词汇的出现概率，并将两个相邻的初始实体组成的短语在政务场景的语料库中的出现次数除以语料库中所有词汇的总数，得到两个相邻的初始实体组成的短语的出现概率；基于两个相邻的初始实体词汇的出现概率以及两个相邻的初始实体组成的短语的出现概率，确定出相邻的初始实体之间的互信息值。In some embodiments, the third processing module 230 is also used to traverse the corpus of the government affairs scene, and respectively determine the number of occurrences of two adjacent initial entities and a phrase composed of two adjacent initial entities in the corpus of the government affairs scene; Divide the number of occurrences of two adjacent initial entities by the total number of all words in the corpus to obtain the occurrence probabilities of the two adjacent initial entity words, and use the phrases composed of two adjacent initial entities in the government affairs scene The number of occurrences in the corpus is divided by the total number of all words in the corpus to obtain the occurrence probability of a phrase composed of two adjacent initial entities; based on the occurrence probability of two adjacent initial entity words and two adjacent initial entities The occurrence probability of the composed phrases determines the mutual information value between adjacent initial entities.

在一些实施例中，第四处理模块240还用于确定选词窗口大小，选词窗口大小用于限定初始实体在预处理后的语料中的左右邻近词的范围；基于选词窗口大小，获取相邻的初始实体的互信息值大于第二阈值的初始实体在政务场景的语料库中左右范围内的邻近词；确定每个邻近词在所有选词窗口内出现的次数，分别得到每个邻近词在所有左侧选词窗口以及所有右侧选词窗口内出现的频率；基于每个邻近词在各选词窗口内出现的频率，确定每个邻近词对应的初始实体在预处理后的语料中的左右熵。In some embodiments, the fourth processing module 240 is also used to determine the size of the word selection window. The size of the word selection window is used to limit the range of left and right adjacent words of the initial entity in the preprocessed corpus; based on the size of the word selection window, obtain The initial entity whose mutual information value of adjacent initial entities is greater than the second threshold has adjacent words in the left and right range in the corpus of the government scene; determine the number of times each adjacent word appears in all word selection windows, and obtain each adjacent word respectively. The frequency of occurrence in all left word selection windows and all right word selection windows; based on the frequency of each neighboring word appearing in each word selection window, determine the initial entity corresponding to each neighboring word in the preprocessed corpus The left and right entropy of .

在一些实施例中，第四处理模块240还用于确定左右熵小于第三阈值的初始实体在预处理后的语料中的左右熵中左熵值和右熵值；将左熵值和右熵值中较小的一个所对应的邻近词与初始实体组成第二短语实体。In some embodiments, the fourth processing module 240 is also used to determine the left entropy value and right entropy value in the left and right entropy of the initial entity whose left and right entropy is less than the third threshold in the preprocessed corpus; convert the left entropy value and right entropy value The adjacent word corresponding to the smaller one of the values and the initial entity form the second phrase entity.

在一些实施例中，第二处理模块220还用于基于政务场景的种子词，构建文本特征模版；特征模版包括单词的词向量表示、词性标签、种子词距离以及上下文信息；利用初始训练语料和文本特征模版训练特征提取模型；将预处理后的语料库中的语料输入至训练好的特征提取模型，得到特征提取模型输出的初始实体对应的特征，确定初始实体。In some embodiments, the second processing module 220 is also used to construct a text feature template based on the seed words of the government affairs scenario; the feature template includes the word vector representation of the word, part-of-speech tag, seed word distance and contextual information; using the initial training corpus and The text feature template trains the feature extraction model; input the preprocessed corpus into the trained feature extraction model, obtain the features corresponding to the initial entities output by the feature extraction model, and determine the initial entities.

在一些实施例中，特征提取模型为BiLSTM-CRF模型，特征提取模型包括输入层、编码层和解码层；输入层用于将输入的语料文本中的单词或字符转换为连续的向量表示；编码层用于提取输入的语料文本的向量序列的上下文特征，并生成包含语义信息的编码向量序列；解码层用于使用条件随机场对编码层输出的编码向量序列进行标注，并对标注的每个位置的标签进行预测，生成输出初始实体对应的特征。In some embodiments, the feature extraction model is a BiLSTM-CRF model. The feature extraction model includes an input layer, an encoding layer and a decoding layer; the input layer is used to convert words or characters in the input corpus text into continuous vector representations; encoding The decoding layer is used to extract the contextual features of the vector sequence of the input corpus text and generate a coding vector sequence containing semantic information; the decoding layer is used to use conditional random fields to annotate the coding vector sequence output by the coding layer, and to label each annotated vector sequence. The label of the location is predicted and the features corresponding to the initial entity are generated and output.

在一些实施例中，政务场景的种子词包括政务场景专有词典中的词语以及在政务场景人工标注的词语。In some embodiments, the seed words of the government scene include words in the government scene-specific dictionary and words manually annotated in the government scene.

图3示例了一种电子设备的实体结构示意图，如图3所示，该电子设备可以包括：处理器（processor）310、通信接口（Communications Interface）320、存储器（memory）330和通信总线340，其中，处理器310，通信接口320，存储器330通过通信总线340完成相互间的通信。处理器310可以调用存储器330中的逻辑指令，以执行基于知识图谱的标准化政务数据构建方法，该方法包括：对政务场景的语料库进行数据预处理；基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体；确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体；确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体；第二阈值小于第一阈值；将与相邻的初始实体之间的互信息值小于或者等于第一阈值的初始实体、第一短语实体以及第二短语实体确定为目标实体；基于政务场景的语料库以及目标实体，抽取目标实体之间的关系；将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。Figure 3 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 3, the electronic device may include: a processor (processor) 310, a communications interface (Communications Interface) 320, a memory (memory) 330, and a communication bus 340. Among them, the processor 310, the communication interface 320, and the memory 330 complete communication with each other through the communication bus 340. The processor 310 can call the logical instructions in the memory 330 to execute the standardized government data construction method based on the knowledge graph. The method includes: performing data preprocessing on the corpus of the government scene; using the feature extraction model to identify the seed words based on the government scene. Obtain the initial entities in the preprocessed corpus; determine the mutual information value between adjacent initial entities, and combine the adjacent initial entities with mutual information values greater than the first threshold into the first phrase entity; determine the adjacent initial entities The mutual information value of the initial entity is greater than the left and right entropy of the initial entity in the preprocessed corpus of the second threshold, and the second phrase entity is determined based on the initial entity whose left and right entropy is less than the third threshold; the second threshold is less than the first threshold ; Determine the initial entity, the first phrase entity and the second phrase entity whose mutual information value with the adjacent initial entity is less than or equal to the first threshold as the target entity; based on the corpus of the government affairs scenario and the target entity, extract the target entity The relationship between them; store the determined relationship between the target entities in the relational database to build a government knowledge graph.

此外，上述的存储器330中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logical instructions in the memory 330 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的基于知识图谱的标准化政务数据构建方法，该方法包括：对政务场景的语料库进行数据预处理；基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体；确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体；确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体；第二阈值小于第一阈值；将与相邻的初始实体之间的互信息值小于或者等于第一阈值的初始实体、第一短语实体以及第二短语实体确定为目标实体；基于政务场景的语料库以及目标实体，抽取目标实体之间的关系；将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。On the other hand, the present invention also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Execute the standardized government data construction method based on the knowledge graph provided by the above methods. The method includes: performing data preprocessing on the corpus of government scenes; based on the seed words of the government scene, using the feature extraction model to identify the preprocessed corpus initial entities; determine the mutual information values between adjacent initial entities, and combine adjacent initial entities with mutual information values greater than the first threshold into the first phrase entity; determine the mutual information values of adjacent initial entities greater than The left and right entropy of the initial entity of the second threshold in the preprocessed corpus is determined, and the second phrase entity is determined based on the initial entity whose left and right entropy is less than the third threshold; the second threshold is less than the first threshold; and the adjacent initial entity is The initial entity, the first phrase entity and the second phrase entity whose mutual information value between entities is less than or equal to the first threshold are determined as target entities; based on the corpus of the government affairs scenario and the target entity, the relationship between the target entities is extracted; the determined The relationships between the target entities are stored in the relational database to build a government knowledge graph.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的基于知识图谱的标准化政务数据构建方法，该方法包括：对政务场景的语料库进行数据预处理；基于政务场景的种子词，利用特征提取模型识别出预处理后的语料库中的初始实体；确定相邻的初始实体之间的互信息值，并将互信息值大于第一阈值的相邻的初始实体组合成第一短语实体；确定相邻的初始实体的互信息值大于第二阈值的初始实体在预处理后的语料中的左右熵，并基于左右熵小于第三阈值的初始实体，确定出第二短语实体；第二阈值小于第一阈值；将与相邻的初始实体之间的互信息值小于或者等于第一阈值的初始实体、第一短语实体以及第二短语实体确定为目标实体；基于政务场景的语料库以及目标实体，抽取目标实体之间的关系；将确定出的目标实体之间的关系存储在关系数据库中，以构建政务知识图谱。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by the processor to perform the standardized government data construction based on the knowledge graph provided by the above methods. Method, the method includes: performing data preprocessing on a corpus of government affairs scenarios; using a feature extraction model to identify initial entities in the preprocessed corpus based on the seed words of the government affairs scenario; determining mutual information between adjacent initial entities value, and combine the adjacent initial entities whose mutual information value is greater than the first threshold into the first phrase entity; determine the left and right of the initial entities whose mutual information value is greater than the second threshold of the adjacent initial entities in the preprocessed corpus entropy, and determine the second phrase entity based on the initial entity whose left and right entropy is less than the third threshold; the second threshold is less than the first threshold; the initial entity with the mutual information value between the adjacent initial entity is less than or equal to the first threshold The entity, the first phrase entity and the second phrase entity are determined as the target entity; based on the corpus of the government affairs scenario and the target entity, the relationship between the target entities is extracted; the determined relationship between the target entities is stored in the relational database to Build a government knowledge graph.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备（可以是个人计算机，服务器，或者网络设备等）执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disc, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The standardized government affair data construction method based on the knowledge graph is characterized by comprising the following steps of:

carrying out data preprocessing on a corpus of government affair scenes;

based on the seed words of the government affair scene, identifying initial entities in the preprocessed corpus by utilizing a feature extraction model;

determining mutual information values between adjacent initial entities, and combining the adjacent initial entities with the mutual information values larger than a first threshold value into a first phrase entity;

determining left and right entropy of an initial entity with mutual information value larger than a second threshold value in the preprocessed corpus of adjacent initial entities, and determining a second phrase entity based on the initial entity with left and right entropy smaller than a third threshold value; the second threshold is less than the first threshold;

Determining an initial entity, the first phrase entity and the second phrase entity, of which the mutual information values with adjacent initial entities are smaller than or equal to the first threshold value, as target entities;

based on a corpus of government scenes and the target entities, extracting the relation between the target entities;

and storing the determined relation between the target entities in a relation database to construct a government knowledge graph.

2. The knowledge-graph-based standardized government data construction method according to claim 1, wherein the determining the mutual information value between adjacent initial entities includes:

traversing a corpus of the government affair scene, and respectively determining two adjacent initial entities and the occurrence times of phrases formed by the two adjacent initial entities in the corpus of the government affair scene;

dividing the occurrence times of two adjacent initial entities by the total number of all words in the corpus respectively to obtain the occurrence probability of the words of the two adjacent initial entities, and dividing the occurrence times of phrases formed by the two adjacent initial entities in the corpus of the government scene by the total number of all words in the corpus to obtain the occurrence probability of phrases formed by the two adjacent initial entities;

And determining a mutual information value between the adjacent initial entities based on the occurrence probabilities of the two adjacent initial entity words and the occurrence probabilities of phrases formed by the two adjacent initial entities.

3. The knowledge-graph-based standardized government data construction method according to claim 1, wherein the determining the left and right entropy of the initial entity with the mutual information value of the adjacent initial entities greater than the second threshold in the preprocessed corpus comprises:

determining a word selecting window size, wherein the word selecting window size is used for limiting the range of left and right adjacent words of the initial entity in the preprocessed corpus;

based on the size of the word selecting window, acquiring adjacent words of the initial entities, of which the mutual information values are larger than a second threshold, in the left-right range in the corpus of the government affair scene;

determining the frequency of occurrence of each adjacent word in all word selecting windows, and respectively obtaining the occurrence frequency of each adjacent word in all left word selecting windows and all right word selecting windows;

based on the frequency of each adjacent word in each word selecting window, the left and right entropy of the initial entity corresponding to each adjacent word in the preprocessed corpus is determined.

4. The knowledge-graph-based standardized government data construction method of claim 3, wherein the determining the second phrase entity based on the initial entity having the left-right entropy less than the third threshold includes:

determining left entropy values and right entropy values in left entropy and right entropy values of an initial entity with left entropy and right entropy smaller than a third threshold in the preprocessed corpus;

and forming the second phrase entity by the adjacent word corresponding to the smaller one of the left entropy value and the right entropy value and the initial entity.

5. The knowledge-graph-based standardized government affair data construction method according to claim 1, wherein the identifying the initial entity in the preprocessed corpus by using the feature extraction model based on the seed words of the government affair scene comprises:

constructing a text feature template based on the seed words of the government affair scene; the feature templates comprise word vector representations of words, part-of-speech tags, seed word distances and context information;

training the feature extraction model by using an initial training corpus and the text feature template;

inputting the corpus in the preprocessed corpus into the trained feature extraction model to obtain the features corresponding to the initial entity output by the feature extraction model, and determining the initial entity.

6. The knowledge-graph-based standardized government data construction method of claim 5 wherein the feature extraction model is a BiLSTM-CRF model, the feature extraction model including an input layer, an encoding layer and a decoding layer;

the input layer is used for converting words or characters in the input corpus text into continuous vector representations;

the coding layer is used for extracting the context characteristics of the vector sequence of the input corpus text and generating a coding vector sequence containing semantic information;

the decoding layer is used for marking the coded vector sequence output by the coding layer by using a conditional random field, predicting the label of each marked position and generating and outputting the characteristics corresponding to the initial entity.

7. The knowledge-graph-based standardized government affair data construction method according to claim 1, wherein the seed words of the government affair scene include words in a dictionary specific to the government affair scene and words artificially marked in the government affair scene.

8. A standardized government affair data construction device based on a knowledge graph is characterized in that,

the first processing module is used for preprocessing data of a corpus of government affair scenes;

The second processing module is used for identifying an initial entity in the preprocessed corpus by utilizing the feature extraction model based on the seed words of the government affair scene;

the third processing module is used for determining mutual information values between adjacent initial entities and combining the adjacent initial entities with the mutual information values larger than a first threshold value into a first phrase entity;

the fourth processing module is used for determining left and right entropy of the initial entity with the mutual information value larger than a second threshold value in the preprocessed corpus of the adjacent initial entities and determining a second phrase entity based on the initial entity with the left and right entropy smaller than a third threshold value; the second threshold is less than the first threshold;

a fifth processing module, configured to determine an initial entity, the first phrase entity, and the second phrase entity, where a mutual information value between the first phrase entity and an adjacent initial entity is less than or equal to the first threshold, as a target entity;

the sixth processing module is used for extracting the relation between the target entities based on the corpus of government affair scenes and the target entities;

and the seventh processing module is used for storing the determined relation between the target entities in a relation database so as to construct a government knowledge graph.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the knowledge-graph-based standardized government data construction method of any of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the knowledge-graph-based standardized government data construction method of any of claims 1 to 7.