Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a graph database-based automatic classification method and system for high and new technology field enterprises, the high and new technology enterprise field which is not a high and new technology enterprise is determined by constructing a technical field knowledge map, establishing the relation between the high and new technology field and the knowledge map and the relation between the enterprise and the knowledge map, and the problem of field ambiguity in the process of cultivating and reporting the high and new technology enterprise by the technical enterprise is solved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a graph database-based automatic enterprise classification method in the high and new technology field.
A graph database-based automatic enterprise classification method in the high and new technology field comprises the following processes:
acquiring a label name of a non-high and new technology enterprise;
searching an entity of a graph database according to a label name of a non-high and new technology enterprise, establishing a relation between the searched entity and the enterprise, and adding the enterprise into the graph database;
and determining the high and new technical field category of the enterprise according to the knowledge graph relationship in the graph database.
Further, acquiring a tag name of a non-high and new technology enterprise includes: and performing word segmentation on the operating range of the enterprise, the trademark information of the enterprise and the intellectual property information of the enterprise to obtain the label name of the enterprise.
Furthermore, the trademark category of the enterprise trademark is translated according to the trademark dictionary, the category corresponds to the dictionary and the dictionary subset, the trademark category name and the trademark category description are obtained, and the label name of the enterprise is obtained after word segmentation.
Furthermore, the patent classification numbers are translated according to the IPC patent classification dictionary, the patent classification numbers correspond to the dictionary and the dictionary subset to obtain patent classification number names, and the label names of the enterprises are obtained after word segmentation.
Further, knowledge graph construction comprises the following steps:
acquiring scientific and technical literature data;
taking the existing disciplines and the incidence relation among the disciplines as the basis for automatically constructing the knowledge graph;
according to the title and abstract of scientific and technical literature data, entity identification is carried out by using a voting mechanism;
defining the co-occurrence relation of different entities appearing in the abstract of the same scientific and technical literature as the relation between the entities;
storing the entities and relationships in a graph database.
Furthermore, processing the information of the high and new technical fields and adding the information into a graph database comprises the following steps:
performing word segmentation on the field name and the field description of the high and new technical field to serve as an initial label of the high and new technical field;
performing word frequency statistics on the initial label of each field high and new technology enterprise, and taking the initial label as a final label of each field high and new technology if the initial label exceeds a threshold;
and searching an entity of the graph database entity according to the final label name in the high and new technology fields, establishing a relation between the searched entity and the high and new technology fields, and adding the high and new technology fields into the graph database.
Furthermore, before the technical field knowledge graph is constructed, the acquired high and new technical field information, trademark classification information and IPC international patent classification information are subjected to structuring processing.
The invention provides a graph database-based enterprise automatic classification system in the high and new technology field.
A high and new technology field enterprise automatic classification system based on a graph database comprises:
a data acquisition module configured to: acquiring a label name of a non-high and new technology enterprise;
an entity lookup module configured to: searching an entity of a graph database according to a label name of a non-high and new technology enterprise, establishing a relation between the searched entity and the enterprise, and adding the enterprise into the graph database;
a domain identification module configured to: and determining the high and new technical field category of the enterprise according to the knowledge graph relationship in the graph database.
A third aspect of the present invention provides a computer-readable storage medium having stored thereon a program which, when being executed by a processor, implements the steps of the method for automatically categorizing a database-based high and new technology area enterprise according to the first aspect of the present invention.
A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for automatically classifying a high and new technology field enterprise based on a graph database according to the first aspect of the present invention.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention relates to a graph database-based high and new technology field enterprise automatic classification method and system, which are used for collecting data such as government open data website high and new technology field information, trademark classification information, IPC international patent classification information and the like through the Internet, preprocessing and structural treatment are carried out on the collected information, a plurality of relational databases are established, a technical field knowledge graph is established, and finally, the national high and new technology field automatic classification method is established and completed.
2. According to the graph database-based high and new technology field enterprise automatic classification method and system, non-high and new technology enterprises are analyzed through the method, the technical fields of the enterprises are automatically classified, the high and new technology enterprise staged cultivation is assisted, the enterprise list of the high and new technology enterprise cultivation can be performed in a clear area, and technical support is provided for the cultivation and reporting of the regional high and new technology enterprises.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example 1:
as shown in fig. 1, an embodiment 1 of the present invention provides an automatic classification method for an enterprise in a high and new technology field based on a graph database, and a classification dictionary relation library is formed by performing structural management on the high and new technology field, trademark classification information, and patent classification information; establishing a domain knowledge map according to scientific and technical literature and 'national standard subject classification and code' and establishing a map database; the method comprises the steps of treating and labeling main intellectual property information of an enterprise by using a classification dictionary relational database, adding a graph database into treated enterprises and high and new technical fields, constructing an incidence relation between the two by using an intellectual graph, and finally calculating the range of the high and new technical fields to which main products (services) of the enterprise belong.
Specifically, the method comprises the following steps:
s1: collecting high and new technology field information, trademark classification information and IPC international patent classification information, performing structuring processing, and storing the information in a database;
s1.1: collecting high and new technology field information, trademark classification information and IPC international patent classification information by a government public data website;
s1.2: structuring high and new technology field information into field codes, field names, field descriptions, father field codes and field hierarchies, wherein the high and new technology field information data structure hierarchy has 3 layers which are respectively a major class, a middle class and a minor class;
s1.3: structuring the trademark classification information into trademark codes, trademark names, trademark descriptions, trademark father codes and trademark hierarchies, wherein the trademark information data structure hierarchy has 3 layers which are major classes, middle classes and minor classes respectively;
s1.4: the international patent classification information is structured into patent codes, patent names, patent father codes and patent hierarchies, and the data structure layer of the international patent classification information comprises 5 layers which are respectively a department, a major class, a minor class, a major group and a minor group.
S2: technical field knowledge map construction
S2.1: scientific and technological documents including scientific and technological reports, scientific and technological books, periodicals, conference documents, patent documents and standard documents are collected through the Internet.
S2.2: the national standard subject classification and code defines the subject classification into a first level, a second level and a third level, establishes the association relationship among the subjects, and is used as the basis for automatically constructing the knowledge graph according to the existing subjects and the association relationship among the subjects.
S2.3: and (3) taking the titles and abstracts of scientific and technical literature as data sources for constructing a knowledge graph in the technical field, and carrying out entity identification by using a voting mechanism.
Specifically, a plurality of groups of models are adopted to participate in voting, including a BerT-BilSt-CRF model introducing Len _ loss and a plurality of groups of entity description regular models set by experts, and the voting obtains the same entity identified by more than half of models as a result. Once a new entity is identified, the similarity between the entity and the existing entity is calculated. If the similarity exceeds the threshold value, using an entity with longer text length as a uniform entity; otherwise, the entity is named as a new entity.
S2.4: the co-occurrence relationship of different entities appearing in the same abstract of scientific literature is defined as the relationship between the entities. Each relationship comprises a relationship name and a relationship weight, and the relationship weight records the common occurrence frequency of the two entities in all the linguistic data and is used for measuring the closeness degree of the relationship.
S2.5: storing the entities and relationships in a graph database.
S3: and (4) sending the graph database entity text identified in the step (S2.3) into a word segmentation dictionary for optimizing the word segmentation device.
S4: performing word segmentation on the operating range of an enterprise, enterprise trademark information and enterprise intellectual property information to serve as tags of the enterprise;
s4.1: and translating the trademark category of the enterprise trademark according to the trademark dictionary, outputting a trademark category name and trademark category description by using the category corresponding dictionary and the dictionary subset, and performing word segmentation to obtain the enterprise label.
S4.2: translating the patent classification number of the intellectual property according to the IPC patent classification dictionary, outputting the name of the patent classification number corresponding to the dictionary and the subset of the dictionary, and performing word segmentation to serve as an enterprise label.
S5: processing high and new technology field information and adding graph database
S5.1: and (4) segmenting the field name and the field description of the high and new technical field to serve as a label of the high and new technical field.
S5.2: and performing word frequency statistics on the labels of the high and new technology enterprises in each field, and if the word frequency statistics exceeds a threshold value, taking the labels as the labels of the high and new technology fields.
S5.3: and searching an entity of the graph database entity according to the label name of the high and new technical field, establishing a relation between the searched entity and the high and new technical field, and adding the graph database in the high and new technical field.
S6: determining the high and new technology field according to the labels of the non-high and new technology enterprises and the related graph database
S6.1: and searching an entity of the graph database according to the label name of the non-high and new technology enterprise, establishing a relation between the searched entity and the enterprise, and adding the enterprise into the graph database.
S6.2: and determining the high and new technical fields of the enterprise according to the relation in the knowledge graph.
Example 2:
the embodiment 2 of the invention provides an automatic enterprise classification system in the high and new technology field based on a graph database, which comprises the following steps:
a data acquisition module configured to: acquiring a label name of a non-high and new technology enterprise;
an entity lookup module configured to: searching an entity of a graph database according to a label name of a non-high and new technology enterprise, establishing a relation between the searched entity and the enterprise, and adding the enterprise into the graph database;
a domain identification module configured to: and determining the high and new technical field category of the enterprise according to the knowledge graph relationship in the graph database.
The detailed working method of the system is the same as the automatic classification method of the enterprise in the high and new technology field based on the graph database provided in the embodiment 1, and the detailed description is omitted here.
Example 3:
embodiment 3 of the present invention provides a computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the steps in the automatic classification method for a graph database-based high and new technology field enterprise according to embodiment 1 of the present invention.
Example 4:
embodiment 4 of the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable on the processor, where the processor executes the program to implement the steps in the automatic classification method for an enterprise in the high and new technology fields based on a graph database according to embodiment 1 of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.