CN112487819A

CN112487819A - Method, system, electronic device and storage medium for identifying homonyms among enterprises

Info

Publication number: CN112487819A
Application number: CN202011502898.4A
Authority: CN
Inventors: 罗镇权; 刘世林; 张发展; 祝凯
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-12

Abstract

The present application relates to the field of natural language processing, and in particular, relates to a method, system, electronic device and storage medium for identifying the same person among enterprises, wherein a method for identifying the same person between enterprises includes the following steps: Step 1 obtains the same name for inputting Characteristics. Step 2: Obtain the feature-labeled data. Step 3 organizes the labeled data into a preparatory training data set, in which a representative is selected from the data for the same person in different companies. Step 4 prepares the structure of the twin network, and the sub-network of the twin network includes a recurrent neural network. Step 5: Input the training data set into the Siamese network for training to obtain a trained model. Step 6 uses the trained model to make predictions. Compared with the prior art, the present invention can greatly reduce the amount of calculation, and greatly reduce the time complexity from the original to N ² .

Description

Method, system, electronic device and storage medium for identifying homonyms among enterprises

Technical Field

The application relates to the field of natural language processing, in particular to a method and a system for identifying a same-name person among enterprises, electronic equipment and a storage medium.

Background

With the rapid development of internet technology, more and more public data can be obtained by people, and more people pay attention to how to rapidly arrange the unstructured data. In the big data technology, information of enterprises, stockholders, high governance and the like is extracted to establish a knowledge map, and the method plays an important role in the fields of market research, investment analysis, financial supervision and the like. When the associated graph is drawn, if the natural person which cannot judge the enterprise information is the same person, a plurality of natural person entities with the same name which are actually the same entity natural person appear in one graph, and the inference analysis of the graph is influenced. Therefore, the alignment of the entities with the same name is important in the construction of the knowledge graph. If the identity card data exists, the homonymy alignment is simple, but the identity card information belongs to personal privacy, so that the identity card data is difficult to capture. Therefore, there is a need to generate a "unique ID" for the natural person of the public data by using a technical method to distinguish the same person of each different company.

The method of machine learning is adopted more popular at present, and the characteristics of the same person are input, then whether the person is the same person is judged through a machine learning model, and the same number is given as 'unique ID' for the person identified as the same person.

The method has the advantages that multiple natural human entities with the same name are distinguished, the important significance is achieved, the problem that the knowledge map is compelled to be subjected to composition in the very first place can be solved, and the method has more wide expected application in the future. At present, a machine learning method is generally adopted to judge the homonym, two people are classified into two classes, and the judgment is the homonym or the homonym, but the problem exists that the calculated amount is extremely large, and the time complexity reaches N²The actual requirements cannot be well met.

For example, the invention patent with the existing patent application number of CN 201910256769.2, application date of 2019.04.01 and name of "a method for disambiguating names of business and business high governors based on enterprise association" has the technical scheme that: the invention discloses a business high-authority management name disambiguation method based on enterprise incidence relation, which relates to the field of entity disambiguation and comprises the following steps: dividing a data set U to be disambiguated into n high-management name groups A according to high-management names; constructing a high-management and enterprise association relationship network G within N layers for each group A according to the name group division result obtained in the step S1; aiming at each name group A, calculating the associated density f between the high-management nodes in the name group A according to the density calculation rule; and constructing a clustering function CL according to the association closeness, and obtaining a disambiguation result by using a hierarchical clustering algorithm. The method adopts a multilayer relational network, and establishes a clustering function for disambiguation through the associated density, belongs to an unsupervised learning method, and the unsupervised learning method cannot control and judge the result, so that the result is likely to be clustered to obtain an unnecessary result, and the identification accuracy is not high.

Disclosure of Invention

In order to overcome the defects in the prior art, the application provides a method, a system, an electronic device and a storage medium for identifying the same-name person among enterprises, which belong to a supervised learning method and can reduce the calculation amount of a judgment process and shorten the calculation time of the judgment process.

In order to achieve the technical effects, the technical scheme of the application is as follows:

a method for identifying homonyms among enterprises comprises the following steps:

step 1, obtaining features for input.

And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name.

And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data for the same person of different companies, and the representative is used for comparing other people with the same name without comparing all data of other people with the same name.

The input form of the training data is: a sample

Wherein y has a value in the range of [0, 1]]，

Is a vector of input features of the same person, wherein

Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form

Then the label of y is 1 at this time, and the same person with different name is combined with the representative

Then the y-tag is 0 at this time.

And 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network. Given a sample

Y is [0, 1]]Wherein the cosine similarity expression is as follows

The loss function expression may be used as follows:

，

wherein

。

Wherein

：

Vectors representing the homonym feature components, such as the Zhang III + A feature vector,

another vector composed of homonym features, e.g. Zhang three + alpha, y

Whether the persons are the same person or not is considered to be the same person if y =1, and is considered to be two persons of the same name if y = 0.

: a common general cosine similarity calculation formula is shown,

representing cosine similarity representations of two name translation vectors,

show that

A new vector is obtained after the input of the twin network,

in the same way, the method for preparing the composite material,

meaning that the inner product of two vectors is taken,

representing a vector

And

the binary 2-norm multiplication.

、

And

express that there are many same-name pairs

To the same-name person pair

Each individual is distinguished by a plurality of labels (i)

I takes a value from 0 to the total number of pairs of the same person-1;

m is similar severity and is used for adjusting the similar severity of the two vectors, the larger the m is set, the higher the cosine similarity requirement of the two vectors is, and the value range of m is (0, 1);

，

if y is 1 during the training process, the method is adopted

If y is 0, then

。

And 5, inputting the training data set into the twin network for training to obtain a trained model.

Step 6, using the trained model to predict, comparing newly input data with the representatives of the same person, if the newly input data are the same (the same means reaching the judgment threshold of the same person, wherein the judgment threshold can be manually set), adding the corresponding group, and finishing the calculation of the current round; if the two are different, the new same-name person is considered to be a new same-name person, and the new same-name person is designated as a new representative person.

Preferably, the sub-network of the twin network employed in step 4 is a Bi-LSTM network structure.

Further, the application provides a system for identifying the same name among enterprises, which comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;

the data acquisition module is used for acquiring the characteristics of the same celebrity, the characteristic labeled data, the training data set and the twin network;

the data storage module is used for storing the data output by the data acquisition module and the data processing module;

and the data processing module is used for inputting the training data set into the twin network for training to obtain a trained model, predicting by using the trained model, comparing newly input data with the representatives of the same celebrity, adding the newly input data into the data of the same celebrity if the newly input data are the same with the representatives of the same celebrity, and considering that the newly input data are the same celebrity if the newly input data are different from the representatives of the same celebrity.

Further, the present application provides an inter-enterprise electronic device for a synonym, comprising a processor and a memory, the processor being connected to the memory, the memory storing program code which, when executed by the processor, causes the processor to execute the method of the present application to perform the inter-enterprise synonym identification.

Further, the present application provides a computer readable storage medium comprising program code for causing an electronic device to perform the steps of the method of the present application, when said program code is run on the electronic device.

The beneficial effect of this application does:

1. the invention provides a method for identifying homonyms among enterprises, which adopts a twin network, is equivalent to that a representative name vector is used as a center in a space according to the characteristics of the twin network, the name vector in a group is close to the representative name as much as possible, and different representative names are far away as possible around the vicinity of the representative name, so that the judgment sensitivity is improved, the identification accuracy is ensured, and the defect that other machine learning cannot determine the similarity between the representative name and the chain type is overcome. Meanwhile, the sub-network of the twin network comprises the recurrent neural network, the calculation accuracy is higher than that of the convolutional neural network, the applicability is stronger, and the method is also applicable to the calculation of general homonymous nodes except for large nodes (the large nodes refer to the number of the homonymous nodes larger than 100).

2. The invention selects a representative comparison mode, greatly reduces the calculated amount compared with pairwise comparison in the prior art, and greatly improves the calculation efficiency because the new homonym identification only needs to be compared with the representative selected at the early stage and does not need to be compared with all members in the same group.

Drawings

FIG. 1 is a flow chart of a method for identifying homonyms between enterprises.

FIG. 2 is a diagram illustrating a twin network architecture according to the present invention, which is formed by BI-LSTM.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example 1

As shown in fig. 1, a method for identifying a synonym between enterprises includes the following steps:

step 1, acquiring features for input; the characteristics can be selected, but are not limited to company names, keywords in the company names, industries where the companies are located, company addresses, the number of the same-name companies, whether the companies are directly related, whether two companies are brother companies, whether two companies are parents, whether two companies are grandfather relations, whether the companies are other two-degree related relations, the number of high-management companies, street numbers where the companies are located, the number of enterprises with name changes in national enterprise relations, the number of provinces where the names are located respectively, and the like, and the main purpose of the characteristics is to distinguish the same-name companies. Which are features that have proven effective in practice, by means of which better results can be obtained.

And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name. Specifically, the data after feature labeling can be obtained by adopting manual labeling or existing data. Feature labels refer to the same person, which is the same person, and which is not the same person, who specifies a given feature.

And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data for the same person of different companies, and the representative is used for comparing other people with the same name without comparing all data of other people with the same name. The input form of the training data is: a sample

Wherein y has a value in the range of [0, 1]]，

Is a vector of input features of the same person, wherein

Then the y-tag is 0 at this time. To improve computational efficiency, the representative person may be generated by machine selection by setting rules. Examples are as follows:

the data marked with the characteristics are as follows: [ Zhangsan + A, Zhangsan + B, Zhangsan + C, Zhangsan + D, Zhangsan + E ] and [ Zhangsan + alpha, Zhangsan + beta, Zhangsan + gamma, Zhangsan + theta ] are two different Zhangsan, A and alpha respectively represent different company names, B and beta, C and gamma, D and theta respectively represent different contents in the same characteristic class, wherein Zhangsan + A and Zhangsan + alpha are selected as respectively representing, then [ Zhangsan + A, Zhangsan + B, 1], [ Zhangsan + A, Zhangsan + C, 1], [ Zhangsan + alpha, Zhangsan + beta, 1], [ Zhangsan + alpha, Zhangsan + gamma, 1], [ Zhangsan + A, Zhangsan + alpha, 0 ].

And 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network. Because of the twin network structure and the way we design to input data, we are guaranteed that the representation we choose is efficient and reliable. Siamese in twin networks (Siamese Network) means Siamese cat, twin or twin. Twin networks mean that the structures of two networks, Network _1 and Network _2, in this Network structure are generally the same, and the parameters are shared, i.e. the parameters are identical. In the supervised learning paradigm, a twin neural network maximizes the characterization of different tags and minimizes the characterization of the same tag. And because the sub-networks in the twin network comprise the recurrent neural network, the calculation accuracy is improved to a greater extent. The cyclic neural network has the memory property and is particularly suitable for processing the sequence problem, each feature of the same celebrity can be regarded as a sequence, and due to the memory property of the cyclic neural network, the vector converted by the cyclic neural network can reflect the slight difference among different feature data, so that the accuracy of subsequent calculation can be ensured compared with the vector converted by the convolutional neural network; experiments prove that the circular neural network has better calculation accuracy than the convolutional neural network for the sequence problem although the calculation speed is not as high as that of the convolutional neural network.

In the present invention application, information is input

And

the vector reconstruction is carried out through two sub-networks of the twin network respectively, and the twin network is to

And

because of the characteristics of the twin network, the vectors in the same group can be as close as possible and the vectors in different groups can be as far away as possible after training is finished, namely, the construction vector takes the representative name vector as the center in the space, the name vector in the group is as close as possible to the representative name, and the different representative names are as far away as possible around the vicinity of the representative name, so that the judgment sensitivity is improved, the identification accuracy is ensured, and the defect that the representative and chain type similarity problem cannot be determined in other machine learning is overcome. Chain similarity problem refers to A->B->C->D->E, if a is not compared to E, then a may be considered different from E, discarded, and because the use of the representational contrast approach greatly reduces the amount of computation required to complete the entire recognition.

Example 2

And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name. Specifically, the data after feature labeling can be obtained by adopting manual labeling or existing data. Feature labels refer to the same person, which is the same person, and which is not the same person, who specifies a given feature. The method has universality, only part of the same-name persons can be selected for labeling, for example, the total scale of the number of the same-name persons plus companies reaches hundreds of millions, for example, 5000 or 10000 pairs of data can be selected for labeling, and even the same-name persons which are not labeled can be well identified after model training is finished.

And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data aiming at the same person of different companies, and the representative is used for comparing other same-name persons with the representative without the need of other same-name persons and the same personAll data were compared. The input form of the training data is: a sample

Wherein y has a value in the range of [0, 1]]，

Is a vector of input features of the same person, wherein

Then this y-tag is 0.

Examples are as follows:

And 4, preparing a twin network structure. The twin network can adopt a Bi-LSTM network structure and the like, and a sample is given

Y is [0, 1]]Wherein the cosine similarity expression is as follows

The loss function expression may be used as follows:

，

wherein

。

The process is realized by a twin network, the twin network structure in the embodiment adopts a mode of combining BI-LSTM and cosine similarity, the expression adopted by the loss function can generate vectors, the vectors are compared through the cosine similarity, and the rest parts which are not further described can be realized by the prior art in the field.

Wherein

：

another vector composed of homonym features, e.g. Zhang three + alpha, y

: a common general cosine similarity calculation formula is shown,

cosine similarity representation representing two name translation vectors，

Show that

A new vector is obtained after the input of the twin network,

in the same way, the method for preparing the composite material,

meaning that the inner product of two vectors is taken,

representing a vector

And

the binary 2-norm multiplication.

: there are many same-name people

To the same-name person pair

Each individual is distinguished by a plurality of labels (i)

I takes the value from 0 to the total number of pairs of the same person-1.

，

If y is 1 during the training process, the method is adopted

If y is 0, then

。

As shown in FIG. 2, taking the BI-LSTM as a twin network architecture diagram, the BI-LSTM is a two-layer LSTM network, each small square represents each unit of the LSTM, the two layers are connected as shown by arrows, and the characteristics of the same person are shown from bottom to top

，

As input, the vectors are respectively output after being calculated by a twin network

And

then the cosine similarity is calculated for the two vectors, i.e.

. Where the number represents the length of the vector. The LSTM is an excellent variant model of a Recurrent Neural Network (RNN), inherits the characteristics of most RNN models, solves the problem of vanizing Gradient generated by gradual reduction in the Gradient back propagation process, and can better capture the characteristics of tasks by overlapping two LSTMs to achieve better calculation effect.

And 6, predicting by using the trained model, comparing the new input data with the representatives of the same name persons, adding the same data if the new input data are the same as the representatives of the same name persons, and considering that the same name persons are the new same name persons if the new input data are different from the representatives of the same name persons.

In particular, before step 6, representative persons can be selected by the machine according to the set rules for the persons of the same name which are not marked.

Specifically, for the same celebrity needing to be predicted, whether the same celebrity belongs to the marked name is judged, and if the same celebrity belongs to the marked name, the step 6 is carried out; if not, the machine selects the representative person according to the set rule, and then the step 6 is carried out.

Example 3

On the basis of the embodiment 1 and the embodiment 2, the application provides a system for identifying the same name among enterprises, which comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;

the data acquisition module is used for acquiring the characteristics of the same celebrity, the data marked with the characteristics, the training data set and the twin network;

Example 4

On the basis of embodiments 1-3, the present application provides an electronic device for identifying a synonym between enterprises, which includes a processor and a memory, wherein the processor is connected to the memory, and the memory stores program codes, and when the program codes are executed by the processor, the processor executes the method of the present application to complete the identification of the synonym between different enterprises.

Example 5

The present application provides a computer readable storage medium comprising program code means for causing an electronic device to carry out the steps of the method of the present application, when said program code means are run on said electronic device.

The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying a homonym among enterprises is characterized by comprising the following steps: the method comprises the following steps:

step 1, acquiring features for input;

step 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name;

step 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data aiming at the same person of different companies, and the representative is used for comparing other persons with the same name;

step 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network;

step 5, inputting the training data set into a twin network for training to obtain a trained model;

and 6, predicting by using the trained model, comparing the new input data with the representative of the same person, if the new input data are the same as the representative of the same person, adding the new input data into the data of the same person, and if the new input data are different from the representative of the same person, determining that the new person is the new same person.

2. The method for identifying the same-name person among the enterprises according to claim 1, wherein: and 2, acquiring the data after the characteristic marking by adopting a manual marking or third-party data purchasing mode.

3. The method for identifying the same-name person among the enterprises according to claim 1, wherein: the input form of the training data set in the step 3 is as follows: a sample

Wherein y has a value in the range of [0, 1]]，

Is a vector of input features of the same person, wherein

Selecting a representative according to the same person of different companies, and combining the same person with the representative

Then the y-tag is 0 at this time.

4. The method for identifying the same-name person among the enterprises as claimed in claim 3, wherein: in step 4, a sample is given

Wherein y is [0, 1]]The cosine similarity expression is as follows

The loss function expression may be used as follows:

，

wherein

；

Wherein

Show that

A new vector is obtained after the input of the twin network,

show that

A new vector is obtained after the input of the twin network,

meaning that the inner product of two vectors is taken,

representing a vector

And

2-norm multiplication;

、

and

express that there are many same-name pairs

To the same-name person pair

Each individual is distinguished by a plurality of labels (i)

I is from 0 to the total number of the same-name person pairs-1；

，

if y is 1 during the training process, the method is adopted

If y is 0, then

。

5. The method for identifying the same-name person among the enterprises according to claim 1, wherein: the sub-network of the twin network adopted in the step 4 is a Bi-LSTM network structure.

6. The method for identifying the same-name person among the enterprises according to claim 1, wherein: before the step 6, for the same-name people which are not marked, the representative people are selected by the machine through the set rules.

7. The method for identifying the same-name person among the enterprises as claimed in claim 6, wherein: for the same celebrity needing to be predicted, firstly judging whether the same celebrity belongs to the marked name, and if so, entering the step 6; if not, the machine selects the representative person according to the set rule, and then the step 6 is carried out.

8. The utility model provides an inter-enterprise homonym identification system which characterized in that: the device comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;

9. The utility model provides an electronic equipment of same name people discernment between enterprise which characterized in that: comprising a processor and a memory, said processor being coupled to said memory, said memory storing program code which, when executed by said processor, causes said processor to perform the method of any of claims 1 to 7 for performing the identification of homonyms between different enterprises.

10. A computer-readable storage medium characterized by: stored with program code for causing an electronic device to carry out the steps of the method as claimed in any of claims 1-5, when said program code is run on said electronic device.