Disclosure of Invention
In order to overcome the defects in the prior art, the application provides a method, a system, an electronic device and a storage medium for identifying the same-name person among enterprises, which belong to a supervised learning method and can reduce the calculation amount of a judgment process and shorten the calculation time of the judgment process.
In order to achieve the technical effects, the technical scheme of the application is as follows:
a method for identifying homonyms among enterprises comprises the following steps:
step 1, obtaining features for input.
And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name.
And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data for the same person of different companies, and the representative is used for comparing other people with the same name without comparing all data of other people with the same name.
The input form of the training data is: a sample
Wherein y has a value in the range of [0, 1]],
Is a vector of input features of the same person, wherein
Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Then the y-tag is 0 at this time.
And 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network. Given a sample
Y is [0, 1]]Wherein the cosine similarity expression is as follows
The loss function expression may be used as follows:
Wherein
:
Vectors representing the homonym feature components, such as the Zhang III + A feature vector,
another vector composed of homonym features, e.g. Zhang three + alpha, y
Whether the persons are the same person or not is considered to be the same person if y =1, and is considered to be two persons of the same name if y = 0.
: a common general cosine similarity calculation formula is shown,
representing cosine similarity representations of two name translation vectors,
show that
A new vector is obtained after the input of the twin network,
in the same way, the method for preparing the composite material,
meaning that the inner product of two vectors is taken,
representing a vector
And
the binary 2-norm multiplication.
、
And
express that there are many same-name pairs
To the same-name person pair
Each individual is distinguished by a plurality of labels (i)
I takes a value from 0 to the total number of pairs of the same person-1;
m is similar severity and is used for adjusting the similar severity of the two vectors, the larger the m is set, the higher the cosine similarity requirement of the two vectors is, and the value range of m is (0, 1);
,
if y is 1 during the training process, the method is adopted
If y is 0, then
。
And 5, inputting the training data set into the twin network for training to obtain a trained model.
Step 6, using the trained model to predict, comparing newly input data with the representatives of the same person, if the newly input data are the same (the same means reaching the judgment threshold of the same person, wherein the judgment threshold can be manually set), adding the corresponding group, and finishing the calculation of the current round; if the two are different, the new same-name person is considered to be a new same-name person, and the new same-name person is designated as a new representative person.
Preferably, the sub-network of the twin network employed in step 4 is a Bi-LSTM network structure.
Further, the application provides a system for identifying the same name among enterprises, which comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;
the data acquisition module is used for acquiring the characteristics of the same celebrity, the characteristic labeled data, the training data set and the twin network;
the data storage module is used for storing the data output by the data acquisition module and the data processing module;
and the data processing module is used for inputting the training data set into the twin network for training to obtain a trained model, predicting by using the trained model, comparing newly input data with the representatives of the same celebrity, adding the newly input data into the data of the same celebrity if the newly input data are the same with the representatives of the same celebrity, and considering that the newly input data are the same celebrity if the newly input data are different from the representatives of the same celebrity.
Further, the present application provides an inter-enterprise electronic device for a synonym, comprising a processor and a memory, the processor being connected to the memory, the memory storing program code which, when executed by the processor, causes the processor to execute the method of the present application to perform the inter-enterprise synonym identification.
Further, the present application provides a computer readable storage medium comprising program code for causing an electronic device to perform the steps of the method of the present application, when said program code is run on the electronic device.
The beneficial effect of this application does:
1. the invention provides a method for identifying homonyms among enterprises, which adopts a twin network, is equivalent to that a representative name vector is used as a center in a space according to the characteristics of the twin network, the name vector in a group is close to the representative name as much as possible, and different representative names are far away as possible around the vicinity of the representative name, so that the judgment sensitivity is improved, the identification accuracy is ensured, and the defect that other machine learning cannot determine the similarity between the representative name and the chain type is overcome. Meanwhile, the sub-network of the twin network comprises the recurrent neural network, the calculation accuracy is higher than that of the convolutional neural network, the applicability is stronger, and the method is also applicable to the calculation of general homonymous nodes except for large nodes (the large nodes refer to the number of the homonymous nodes larger than 100).
2. The invention selects a representative comparison mode, greatly reduces the calculated amount compared with pairwise comparison in the prior art, and greatly improves the calculation efficiency because the new homonym identification only needs to be compared with the representative selected at the early stage and does not need to be compared with all members in the same group.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example 1
As shown in fig. 1, a method for identifying a synonym between enterprises includes the following steps:
step 1, acquiring features for input; the characteristics can be selected, but are not limited to company names, keywords in the company names, industries where the companies are located, company addresses, the number of the same-name companies, whether the companies are directly related, whether two companies are brother companies, whether two companies are parents, whether two companies are grandfather relations, whether the companies are other two-degree related relations, the number of high-management companies, street numbers where the companies are located, the number of enterprises with name changes in national enterprise relations, the number of provinces where the names are located respectively, and the like, and the main purpose of the characteristics is to distinguish the same-name companies. Which are features that have proven effective in practice, by means of which better results can be obtained.
And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name. Specifically, the data after feature labeling can be obtained by adopting manual labeling or existing data. Feature labels refer to the same person, which is the same person, and which is not the same person, who specifies a given feature.
And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data for the same person of different companies, and the representative is used for comparing other people with the same name without comparing all data of other people with the same name. The input form of the training data is: a sample
Wherein y has a value in the range of [0, 1]],
Is a vector of input features of the same person, wherein
Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Then the y-tag is 0 at this time. To improve computational efficiency, the representative person may be generated by machine selection by setting rules. Examples are as follows:
the data marked with the characteristics are as follows: [ Zhangsan + A, Zhangsan + B, Zhangsan + C, Zhangsan + D, Zhangsan + E ] and [ Zhangsan + alpha, Zhangsan + beta, Zhangsan + gamma, Zhangsan + theta ] are two different Zhangsan, A and alpha respectively represent different company names, B and beta, C and gamma, D and theta respectively represent different contents in the same characteristic class, wherein Zhangsan + A and Zhangsan + alpha are selected as respectively representing, then [ Zhangsan + A, Zhangsan + B, 1], [ Zhangsan + A, Zhangsan + C, 1], [ Zhangsan + alpha, Zhangsan + beta, 1], [ Zhangsan + alpha, Zhangsan + gamma, 1], [ Zhangsan + A, Zhangsan + alpha, 0 ].
And 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network. Because of the twin network structure and the way we design to input data, we are guaranteed that the representation we choose is efficient and reliable. Siamese in twin networks (Siamese Network) means Siamese cat, twin or twin. Twin networks mean that the structures of two networks, Network _1 and Network _2, in this Network structure are generally the same, and the parameters are shared, i.e. the parameters are identical. In the supervised learning paradigm, a twin neural network maximizes the characterization of different tags and minimizes the characterization of the same tag. And because the sub-networks in the twin network comprise the recurrent neural network, the calculation accuracy is improved to a greater extent. The cyclic neural network has the memory property and is particularly suitable for processing the sequence problem, each feature of the same celebrity can be regarded as a sequence, and due to the memory property of the cyclic neural network, the vector converted by the cyclic neural network can reflect the slight difference among different feature data, so that the accuracy of subsequent calculation can be ensured compared with the vector converted by the convolutional neural network; experiments prove that the circular neural network has better calculation accuracy than the convolutional neural network for the sequence problem although the calculation speed is not as high as that of the convolutional neural network.
In the present invention application, information is input
And
the vector reconstruction is carried out through two sub-networks of the twin network respectively, and the twin network is to
And
because of the characteristics of the twin network, the vectors in the same group can be as close as possible and the vectors in different groups can be as far away as possible after training is finished, namely, the construction vector takes the representative name vector as the center in the space, the name vector in the group is as close as possible to the representative name, and the different representative names are as far away as possible around the vicinity of the representative name, so that the judgment sensitivity is improved, the identification accuracy is ensured, and the defect that the representative and chain type similarity problem cannot be determined in other machine learning is overcome. Chain similarity problem refers to A->B->C->D->E, if a is not compared to E, then a may be considered different from E, discarded, and because the use of the representational contrast approach greatly reduces the amount of computation required to complete the entire recognition.
And 5, inputting the training data set into the twin network for training to obtain a trained model.
Step 6, using the trained model to predict, comparing newly input data with the representatives of the same person, if the newly input data are the same (the same means reaching the judgment threshold of the same person, wherein the judgment threshold can be manually set), adding the corresponding group, and finishing the calculation of the current round; if the two are different, the new same-name person is considered to be a new same-name person, and the new same-name person is designated as a new representative person.
Example 2
As shown in fig. 1, a method for identifying a synonym between enterprises includes the following steps:
step 1, acquiring features for input; the characteristics can be selected, but are not limited to company names, keywords in the company names, industries where the companies are located, company addresses, the number of the same-name companies, whether the companies are directly related, whether two companies are brother companies, whether two companies are parents, whether two companies are grandfather relations, whether the companies are other two-degree related relations, the number of high-management companies, street numbers where the companies are located, the number of enterprises with name changes in national enterprise relations, the number of provinces where the names are located respectively, and the like, and the main purpose of the characteristics is to distinguish the same-name companies. Which are features that have proven effective in practice, by means of which better results can be obtained.
And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name. Specifically, the data after feature labeling can be obtained by adopting manual labeling or existing data. Feature labels refer to the same person, which is the same person, and which is not the same person, who specifies a given feature. The method has universality, only part of the same-name persons can be selected for labeling, for example, the total scale of the number of the same-name persons plus companies reaches hundreds of millions, for example, 5000 or 10000 pairs of data can be selected for labeling, and even the same-name persons which are not labeled can be well identified after model training is finished.
And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data aiming at the same person of different companies, and the representative is used for comparing other same-name persons with the representative without the need of other same-name persons and the same personAll data were compared. The input form of the training data is: a sample
Wherein y has a value in the range of [0, 1]],
Is a vector of input features of the same person, wherein
Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Then this y-tag is 0.
Examples are as follows:
the data marked with the characteristics are as follows: [ Zhangsan + A, Zhangsan + B, Zhangsan + C, Zhangsan + D, Zhangsan + E ] and [ Zhangsan + alpha, Zhangsan + beta, Zhangsan + gamma, Zhangsan + theta ] are two different Zhangsan, A and alpha respectively represent different company names, B and beta, C and gamma, D and theta respectively represent different contents in the same characteristic class, wherein Zhangsan + A and Zhangsan + alpha are selected as respectively representing, then [ Zhangsan + A, Zhangsan + B, 1], [ Zhangsan + A, Zhangsan + C, 1], [ Zhangsan + alpha, Zhangsan + beta, 1], [ Zhangsan + alpha, Zhangsan + gamma, 1], [ Zhangsan + A, Zhangsan + alpha, 0 ].
And 4, preparing a twin network structure. The twin network can adopt a Bi-LSTM network structure and the like, and a sample is given
Y is [0, 1]]Wherein the cosine similarity expression is as follows
The loss function expression may be used as follows:
The process is realized by a twin network, the twin network structure in the embodiment adopts a mode of combining BI-LSTM and cosine similarity, the expression adopted by the loss function can generate vectors, the vectors are compared through the cosine similarity, and the rest parts which are not further described can be realized by the prior art in the field.
Wherein
:
Vectors representing the homonym feature components, such as the Zhang III + A feature vector,
another vector composed of homonym features, e.g. Zhang three + alpha, y
Whether the persons are the same person or not is considered to be the same person if y =1, and is considered to be two persons of the same name if y = 0.
: a common general cosine similarity calculation formula is shown,
cosine similarity representation representing two name translation vectors,
Show that
A new vector is obtained after the input of the twin network,
in the same way, the method for preparing the composite material,
meaning that the inner product of two vectors is taken,
representing a vector
And
the binary 2-norm multiplication.
: there are many same-name people
To the same-name person pair
Each individual is distinguished by a plurality of labels (i)
I takes the value from 0 to the total number of pairs of the same person-1.
,
If y is 1 during the training process, the method is adopted
If y is 0, then
。
As shown in FIG. 2, taking the BI-LSTM as a twin network architecture diagram, the BI-LSTM is a two-layer LSTM network, each small square represents each unit of the LSTM, the two layers are connected as shown by arrows, and the characteristics of the same person are shown from bottom to top
,
As input, the vectors are respectively output after being calculated by a twin network
And
then the cosine similarity is calculated for the two vectors, i.e.
. Where the number represents the length of the vector. The LSTM is an excellent variant model of a Recurrent Neural Network (RNN), inherits the characteristics of most RNN models, solves the problem of vanizing Gradient generated by gradual reduction in the Gradient back propagation process, and can better capture the characteristics of tasks by overlapping two LSTMs to achieve better calculation effect.
And 5, inputting the training data set into the twin network for training to obtain a trained model.
And 6, predicting by using the trained model, comparing the new input data with the representatives of the same name persons, adding the same data if the new input data are the same as the representatives of the same name persons, and considering that the same name persons are the new same name persons if the new input data are different from the representatives of the same name persons.
In particular, before step 6, representative persons can be selected by the machine according to the set rules for the persons of the same name which are not marked.
Specifically, for the same celebrity needing to be predicted, whether the same celebrity belongs to the marked name is judged, and if the same celebrity belongs to the marked name, the step 6 is carried out; if not, the machine selects the representative person according to the set rule, and then the step 6 is carried out.
Example 3
On the basis of the embodiment 1 and the embodiment 2, the application provides a system for identifying the same name among enterprises, which comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;
the data acquisition module is used for acquiring the characteristics of the same celebrity, the data marked with the characteristics, the training data set and the twin network;
the data storage module is used for storing the data output by the data acquisition module and the data processing module;
and the data processing module is used for inputting the training data set into the twin network for training to obtain a trained model, predicting by using the trained model, comparing newly input data with the representatives of the same celebrity, adding the newly input data into the data of the same celebrity if the newly input data are the same with the representatives of the same celebrity, and considering that the newly input data are the same celebrity if the newly input data are different from the representatives of the same celebrity.
Example 4
On the basis of embodiments 1-3, the present application provides an electronic device for identifying a synonym between enterprises, which includes a processor and a memory, wherein the processor is connected to the memory, and the memory stores program codes, and when the program codes are executed by the processor, the processor executes the method of the present application to complete the identification of the synonym between different enterprises.
Example 5
The present application provides a computer readable storage medium comprising program code means for causing an electronic device to carry out the steps of the method of the present application, when said program code means are run on said electronic device.
The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.