+

CN112487819A - Method, system, electronic device and storage medium for identifying homonyms among enterprises - Google Patents

Method, system, electronic device and storage medium for identifying homonyms among enterprises Download PDF

Info

Publication number
CN112487819A
CN112487819A CN202011502898.4A CN202011502898A CN112487819A CN 112487819 A CN112487819 A CN 112487819A CN 202011502898 A CN202011502898 A CN 202011502898A CN 112487819 A CN112487819 A CN 112487819A
Authority
CN
China
Prior art keywords
same
data
person
name
representative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011502898.4A
Other languages
Chinese (zh)
Inventor
罗镇权
刘世林
张发展
祝凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Business Big Data Technology Co Ltd
Original Assignee
Chengdu Business Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Business Big Data Technology Co Ltd filed Critical Chengdu Business Big Data Technology Co Ltd
Priority to CN202011502898.4A priority Critical patent/CN112487819A/en
Publication of CN112487819A publication Critical patent/CN112487819A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及自然语言处理领域,具体的说,是一种企业间同名人识别方法、系统、电子设备及存储介质,其中一种企业间同名人识别方法,包括如下步骤:步骤1获取用于输入的特征。步骤2获得特征标注后的数据。步骤3将标注后数据整理成准备训练数据集,其中针对不同公司的同一个人,从数据中选择出一个代表。步骤4准备孪生网络结构,所述孪生网络的子网络中包含循环神经网络。步骤5将训练数据集输入孪生网络中进行训练,得到训练好的模型。步骤6使用训练好的模型进行预测。本发明与现有技术相比,可以大大的减少计算量,从原本达到N2时间复杂度大大的降低。

Figure 202011502898

The present application relates to the field of natural language processing, and in particular, relates to a method, system, electronic device and storage medium for identifying the same person among enterprises, wherein a method for identifying the same person between enterprises includes the following steps: Step 1 obtains the same name for inputting Characteristics. Step 2: Obtain the feature-labeled data. Step 3 organizes the labeled data into a preparatory training data set, in which a representative is selected from the data for the same person in different companies. Step 4 prepares the structure of the twin network, and the sub-network of the twin network includes a recurrent neural network. Step 5: Input the training data set into the Siamese network for training to obtain a trained model. Step 6 uses the trained model to make predictions. Compared with the prior art, the present invention can greatly reduce the amount of calculation, and greatly reduce the time complexity from the original to N 2 .

Figure 202011502898

Description

Method, system, electronic device and storage medium for identifying homonyms among enterprises
Technical Field
The application relates to the field of natural language processing, in particular to a method and a system for identifying a same-name person among enterprises, electronic equipment and a storage medium.
Background
With the rapid development of internet technology, more and more public data can be obtained by people, and more people pay attention to how to rapidly arrange the unstructured data. In the big data technology, information of enterprises, stockholders, high governance and the like is extracted to establish a knowledge map, and the method plays an important role in the fields of market research, investment analysis, financial supervision and the like. When the associated graph is drawn, if the natural person which cannot judge the enterprise information is the same person, a plurality of natural person entities with the same name which are actually the same entity natural person appear in one graph, and the inference analysis of the graph is influenced. Therefore, the alignment of the entities with the same name is important in the construction of the knowledge graph. If the identity card data exists, the homonymy alignment is simple, but the identity card information belongs to personal privacy, so that the identity card data is difficult to capture. Therefore, there is a need to generate a "unique ID" for the natural person of the public data by using a technical method to distinguish the same person of each different company.
The method of machine learning is adopted more popular at present, and the characteristics of the same person are input, then whether the person is the same person is judged through a machine learning model, and the same number is given as 'unique ID' for the person identified as the same person.
The method has the advantages that multiple natural human entities with the same name are distinguished, the important significance is achieved, the problem that the knowledge map is compelled to be subjected to composition in the very first place can be solved, and the method has more wide expected application in the future. At present, a machine learning method is generally adopted to judge the homonym, two people are classified into two classes, and the judgment is the homonym or the homonym, but the problem exists that the calculated amount is extremely large, and the time complexity reaches N2The actual requirements cannot be well met.
For example, the invention patent with the existing patent application number of CN 201910256769.2, application date of 2019.04.01 and name of "a method for disambiguating names of business and business high governors based on enterprise association" has the technical scheme that: the invention discloses a business high-authority management name disambiguation method based on enterprise incidence relation, which relates to the field of entity disambiguation and comprises the following steps: dividing a data set U to be disambiguated into n high-management name groups A according to high-management names; constructing a high-management and enterprise association relationship network G within N layers for each group A according to the name group division result obtained in the step S1; aiming at each name group A, calculating the associated density f between the high-management nodes in the name group A according to the density calculation rule; and constructing a clustering function CL according to the association closeness, and obtaining a disambiguation result by using a hierarchical clustering algorithm. The method adopts a multilayer relational network, and establishes a clustering function for disambiguation through the associated density, belongs to an unsupervised learning method, and the unsupervised learning method cannot control and judge the result, so that the result is likely to be clustered to obtain an unnecessary result, and the identification accuracy is not high.
Disclosure of Invention
In order to overcome the defects in the prior art, the application provides a method, a system, an electronic device and a storage medium for identifying the same-name person among enterprises, which belong to a supervised learning method and can reduce the calculation amount of a judgment process and shorten the calculation time of the judgment process.
In order to achieve the technical effects, the technical scheme of the application is as follows:
a method for identifying homonyms among enterprises comprises the following steps:
step 1, obtaining features for input.
And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name.
And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data for the same person of different companies, and the representative is used for comparing other people with the same name without comparing all data of other people with the same name.
The input form of the training data is: a sample
Figure 272875DEST_PATH_IMAGE001
Wherein y has a value in the range of [0, 1]],
Figure 777806DEST_PATH_IMAGE002
Is a vector of input features of the same person, wherein
Figure 875731DEST_PATH_IMAGE002
Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form
Figure 997270DEST_PATH_IMAGE002
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Figure 570334DEST_PATH_IMAGE002
Then the y-tag is 0 at this time.
And 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network. Given a sample
Figure 195351DEST_PATH_IMAGE003
Y is [0, 1]]Wherein the cosine similarity expression is as follows
Figure 461247DEST_PATH_IMAGE004
The loss function expression may be used as follows:
Figure 538924DEST_PATH_IMAGE005
wherein
Figure 181258DEST_PATH_IMAGE006
Wherein
Figure 660781DEST_PATH_IMAGE007
Figure 97579DEST_PATH_IMAGE008
Vectors representing the homonym feature components, such as the Zhang III + A feature vector,
Figure 662552DEST_PATH_IMAGE009
another vector composed of homonym features, e.g. Zhang three + alpha, y
Figure 842998DEST_PATH_IMAGE002
Whether the persons are the same person or not is considered to be the same person if y =1, and is considered to be two persons of the same name if y = 0.
Figure 442606DEST_PATH_IMAGE010
: a common general cosine similarity calculation formula is shown,
Figure 50305DEST_PATH_IMAGE011
representing cosine similarity representations of two name translation vectors,
Figure 368154DEST_PATH_IMAGE012
show that
Figure 86711DEST_PATH_IMAGE013
A new vector is obtained after the input of the twin network,
Figure 806406DEST_PATH_IMAGE014
in the same way, the method for preparing the composite material,
Figure 585006DEST_PATH_IMAGE015
meaning that the inner product of two vectors is taken,
Figure 124572DEST_PATH_IMAGE016
representing a vector
Figure 643890DEST_PATH_IMAGE012
And
Figure 952512DEST_PATH_IMAGE014
the binary 2-norm multiplication.
Figure 167592DEST_PATH_IMAGE017
Figure 928875DEST_PATH_IMAGE018
And
Figure 254814DEST_PATH_IMAGE019
express that there are many same-name pairs
Figure 683521DEST_PATH_IMAGE007
To the same-name person pair
Figure 803924DEST_PATH_IMAGE007
Each individual is distinguished by a plurality of labels (i)
Figure 318082DEST_PATH_IMAGE001
I takes a value from 0 to the total number of pairs of the same person-1;
m is similar severity and is used for adjusting the similar severity of the two vectors, the larger the m is set, the higher the cosine similarity requirement of the two vectors is, and the value range of m is (0, 1);
Figure 447712DEST_PATH_IMAGE020
Figure 465347DEST_PATH_IMAGE021
if y is 1 during the training process, the method is adopted
Figure 287809DEST_PATH_IMAGE022
If y is 0, then
Figure 23684DEST_PATH_IMAGE021
And 5, inputting the training data set into the twin network for training to obtain a trained model.
Step 6, using the trained model to predict, comparing newly input data with the representatives of the same person, if the newly input data are the same (the same means reaching the judgment threshold of the same person, wherein the judgment threshold can be manually set), adding the corresponding group, and finishing the calculation of the current round; if the two are different, the new same-name person is considered to be a new same-name person, and the new same-name person is designated as a new representative person.
Preferably, the sub-network of the twin network employed in step 4 is a Bi-LSTM network structure.
Further, the application provides a system for identifying the same name among enterprises, which comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;
the data acquisition module is used for acquiring the characteristics of the same celebrity, the characteristic labeled data, the training data set and the twin network;
the data storage module is used for storing the data output by the data acquisition module and the data processing module;
and the data processing module is used for inputting the training data set into the twin network for training to obtain a trained model, predicting by using the trained model, comparing newly input data with the representatives of the same celebrity, adding the newly input data into the data of the same celebrity if the newly input data are the same with the representatives of the same celebrity, and considering that the newly input data are the same celebrity if the newly input data are different from the representatives of the same celebrity.
Further, the present application provides an inter-enterprise electronic device for a synonym, comprising a processor and a memory, the processor being connected to the memory, the memory storing program code which, when executed by the processor, causes the processor to execute the method of the present application to perform the inter-enterprise synonym identification.
Further, the present application provides a computer readable storage medium comprising program code for causing an electronic device to perform the steps of the method of the present application, when said program code is run on the electronic device.
The beneficial effect of this application does:
1. the invention provides a method for identifying homonyms among enterprises, which adopts a twin network, is equivalent to that a representative name vector is used as a center in a space according to the characteristics of the twin network, the name vector in a group is close to the representative name as much as possible, and different representative names are far away as possible around the vicinity of the representative name, so that the judgment sensitivity is improved, the identification accuracy is ensured, and the defect that other machine learning cannot determine the similarity between the representative name and the chain type is overcome. Meanwhile, the sub-network of the twin network comprises the recurrent neural network, the calculation accuracy is higher than that of the convolutional neural network, the applicability is stronger, and the method is also applicable to the calculation of general homonymous nodes except for large nodes (the large nodes refer to the number of the homonymous nodes larger than 100).
2. The invention selects a representative comparison mode, greatly reduces the calculated amount compared with pairwise comparison in the prior art, and greatly improves the calculation efficiency because the new homonym identification only needs to be compared with the representative selected at the early stage and does not need to be compared with all members in the same group.
Drawings
FIG. 1 is a flow chart of a method for identifying homonyms between enterprises.
FIG. 2 is a diagram illustrating a twin network architecture according to the present invention, which is formed by BI-LSTM.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example 1
As shown in fig. 1, a method for identifying a synonym between enterprises includes the following steps:
step 1, acquiring features for input; the characteristics can be selected, but are not limited to company names, keywords in the company names, industries where the companies are located, company addresses, the number of the same-name companies, whether the companies are directly related, whether two companies are brother companies, whether two companies are parents, whether two companies are grandfather relations, whether the companies are other two-degree related relations, the number of high-management companies, street numbers where the companies are located, the number of enterprises with name changes in national enterprise relations, the number of provinces where the names are located respectively, and the like, and the main purpose of the characteristics is to distinguish the same-name companies. Which are features that have proven effective in practice, by means of which better results can be obtained.
And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name. Specifically, the data after feature labeling can be obtained by adopting manual labeling or existing data. Feature labels refer to the same person, which is the same person, and which is not the same person, who specifies a given feature.
And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data for the same person of different companies, and the representative is used for comparing other people with the same name without comparing all data of other people with the same name. The input form of the training data is: a sample
Figure 425847DEST_PATH_IMAGE001
Wherein y has a value in the range of [0, 1]],
Figure 563567DEST_PATH_IMAGE002
Is a vector of input features of the same person, wherein
Figure 291351DEST_PATH_IMAGE002
Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form
Figure 983364DEST_PATH_IMAGE002
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Figure 454797DEST_PATH_IMAGE002
Then the y-tag is 0 at this time. To improve computational efficiency, the representative person may be generated by machine selection by setting rules. Examples are as follows:
the data marked with the characteristics are as follows: [ Zhangsan + A, Zhangsan + B, Zhangsan + C, Zhangsan + D, Zhangsan + E ] and [ Zhangsan + alpha, Zhangsan + beta, Zhangsan + gamma, Zhangsan + theta ] are two different Zhangsan, A and alpha respectively represent different company names, B and beta, C and gamma, D and theta respectively represent different contents in the same characteristic class, wherein Zhangsan + A and Zhangsan + alpha are selected as respectively representing, then [ Zhangsan + A, Zhangsan + B, 1], [ Zhangsan + A, Zhangsan + C, 1], [ Zhangsan + alpha, Zhangsan + beta, 1], [ Zhangsan + alpha, Zhangsan + gamma, 1], [ Zhangsan + A, Zhangsan + alpha, 0 ].
And 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network. Because of the twin network structure and the way we design to input data, we are guaranteed that the representation we choose is efficient and reliable. Siamese in twin networks (Siamese Network) means Siamese cat, twin or twin. Twin networks mean that the structures of two networks, Network _1 and Network _2, in this Network structure are generally the same, and the parameters are shared, i.e. the parameters are identical. In the supervised learning paradigm, a twin neural network maximizes the characterization of different tags and minimizes the characterization of the same tag. And because the sub-networks in the twin network comprise the recurrent neural network, the calculation accuracy is improved to a greater extent. The cyclic neural network has the memory property and is particularly suitable for processing the sequence problem, each feature of the same celebrity can be regarded as a sequence, and due to the memory property of the cyclic neural network, the vector converted by the cyclic neural network can reflect the slight difference among different feature data, so that the accuracy of subsequent calculation can be ensured compared with the vector converted by the convolutional neural network; experiments prove that the circular neural network has better calculation accuracy than the convolutional neural network for the sequence problem although the calculation speed is not as high as that of the convolutional neural network.
In the present invention application, information is input
Figure 438234DEST_PATH_IMAGE013
And
Figure 71341DEST_PATH_IMAGE023
the vector reconstruction is carried out through two sub-networks of the twin network respectively, and the twin network is to
Figure 516229DEST_PATH_IMAGE013
And
Figure 791352DEST_PATH_IMAGE023
because of the characteristics of the twin network, the vectors in the same group can be as close as possible and the vectors in different groups can be as far away as possible after training is finished, namely, the construction vector takes the representative name vector as the center in the space, the name vector in the group is as close as possible to the representative name, and the different representative names are as far away as possible around the vicinity of the representative name, so that the judgment sensitivity is improved, the identification accuracy is ensured, and the defect that the representative and chain type similarity problem cannot be determined in other machine learning is overcome. Chain similarity problem refers to A->B->C->D->E, if a is not compared to E, then a may be considered different from E, discarded, and because the use of the representational contrast approach greatly reduces the amount of computation required to complete the entire recognition.
And 5, inputting the training data set into the twin network for training to obtain a trained model.
Step 6, using the trained model to predict, comparing newly input data with the representatives of the same person, if the newly input data are the same (the same means reaching the judgment threshold of the same person, wherein the judgment threshold can be manually set), adding the corresponding group, and finishing the calculation of the current round; if the two are different, the new same-name person is considered to be a new same-name person, and the new same-name person is designated as a new representative person.
Example 2
As shown in fig. 1, a method for identifying a synonym between enterprises includes the following steps:
step 1, acquiring features for input; the characteristics can be selected, but are not limited to company names, keywords in the company names, industries where the companies are located, company addresses, the number of the same-name companies, whether the companies are directly related, whether two companies are brother companies, whether two companies are parents, whether two companies are grandfather relations, whether the companies are other two-degree related relations, the number of high-management companies, street numbers where the companies are located, the number of enterprises with name changes in national enterprise relations, the number of provinces where the names are located respectively, and the like, and the main purpose of the characteristics is to distinguish the same-name companies. Which are features that have proven effective in practice, by means of which better results can be obtained.
And 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name. Specifically, the data after feature labeling can be obtained by adopting manual labeling or existing data. Feature labels refer to the same person, which is the same person, and which is not the same person, who specifies a given feature. The method has universality, only part of the same-name persons can be selected for labeling, for example, the total scale of the number of the same-name persons plus companies reaches hundreds of millions, for example, 5000 or 10000 pairs of data can be selected for labeling, and even the same-name persons which are not labeled can be well identified after model training is finished.
And 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data aiming at the same person of different companies, and the representative is used for comparing other same-name persons with the representative without the need of other same-name persons and the same personAll data were compared. The input form of the training data is: a sample
Figure 638086DEST_PATH_IMAGE001
Wherein y has a value in the range of [0, 1]],
Figure 707673DEST_PATH_IMAGE024
Is a vector of input features of the same person, wherein
Figure 639857DEST_PATH_IMAGE002
Selecting a representative according to the same person of different companies, wherein the rule for selecting the representative can be the rule screened from the existing characteristics by the company with the largest registered capital, and the same person and the representative form
Figure 453092DEST_PATH_IMAGE025
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Figure 419911DEST_PATH_IMAGE026
Then this y-tag is 0.
Examples are as follows:
the data marked with the characteristics are as follows: [ Zhangsan + A, Zhangsan + B, Zhangsan + C, Zhangsan + D, Zhangsan + E ] and [ Zhangsan + alpha, Zhangsan + beta, Zhangsan + gamma, Zhangsan + theta ] are two different Zhangsan, A and alpha respectively represent different company names, B and beta, C and gamma, D and theta respectively represent different contents in the same characteristic class, wherein Zhangsan + A and Zhangsan + alpha are selected as respectively representing, then [ Zhangsan + A, Zhangsan + B, 1], [ Zhangsan + A, Zhangsan + C, 1], [ Zhangsan + alpha, Zhangsan + beta, 1], [ Zhangsan + alpha, Zhangsan + gamma, 1], [ Zhangsan + A, Zhangsan + alpha, 0 ].
And 4, preparing a twin network structure. The twin network can adopt a Bi-LSTM network structure and the like, and a sample is given
Figure 660399DEST_PATH_IMAGE001
Y is [0, 1]]Wherein the cosine similarity expression is as follows
Figure 345459DEST_PATH_IMAGE027
The loss function expression may be used as follows:
Figure 696805DEST_PATH_IMAGE005
wherein
Figure 518131DEST_PATH_IMAGE028
The process is realized by a twin network, the twin network structure in the embodiment adopts a mode of combining BI-LSTM and cosine similarity, the expression adopted by the loss function can generate vectors, the vectors are compared through the cosine similarity, and the rest parts which are not further described can be realized by the prior art in the field.
Wherein
Figure 663942DEST_PATH_IMAGE001
Figure 570718DEST_PATH_IMAGE013
Vectors representing the homonym feature components, such as the Zhang III + A feature vector,
Figure 991335DEST_PATH_IMAGE023
another vector composed of homonym features, e.g. Zhang three + alpha, y
Figure 667167DEST_PATH_IMAGE026
Whether the persons are the same person or not is considered to be the same person if y =1, and is considered to be two persons of the same name if y = 0.
Figure 246528DEST_PATH_IMAGE029
: a common general cosine similarity calculation formula is shown,
Figure 640600DEST_PATH_IMAGE011
cosine similarity representation representing two name translation vectors,
Figure 333750DEST_PATH_IMAGE012
Show that
Figure 129667DEST_PATH_IMAGE013
A new vector is obtained after the input of the twin network,
Figure 882860DEST_PATH_IMAGE014
in the same way, the method for preparing the composite material,
Figure 498649DEST_PATH_IMAGE030
meaning that the inner product of two vectors is taken,
Figure 261069DEST_PATH_IMAGE016
representing a vector
Figure 911493DEST_PATH_IMAGE012
And
Figure 101166DEST_PATH_IMAGE014
the binary 2-norm multiplication.
Figure 469830DEST_PATH_IMAGE017
: there are many same-name people
Figure 504782DEST_PATH_IMAGE003
To the same-name person pair
Figure 275292DEST_PATH_IMAGE003
Each individual is distinguished by a plurality of labels (i)
Figure 370287DEST_PATH_IMAGE003
I takes the value from 0 to the total number of pairs of the same person-1.
Figure 960668DEST_PATH_IMAGE031
Figure 799311DEST_PATH_IMAGE032
If y is 1 during the training process, the method is adopted
Figure 689907DEST_PATH_IMAGE031
If y is 0, then
Figure 955803DEST_PATH_IMAGE021
As shown in FIG. 2, taking the BI-LSTM as a twin network architecture diagram, the BI-LSTM is a two-layer LSTM network, each small square represents each unit of the LSTM, the two layers are connected as shown by arrows, and the characteristics of the same person are shown from bottom to top
Figure 33481DEST_PATH_IMAGE013
Figure 675815DEST_PATH_IMAGE023
As input, the vectors are respectively output after being calculated by a twin network
Figure 158267DEST_PATH_IMAGE033
And
Figure 595065DEST_PATH_IMAGE034
then the cosine similarity is calculated for the two vectors, i.e.
Figure 160038DEST_PATH_IMAGE035
. Where the number represents the length of the vector. The LSTM is an excellent variant model of a Recurrent Neural Network (RNN), inherits the characteristics of most RNN models, solves the problem of vanizing Gradient generated by gradual reduction in the Gradient back propagation process, and can better capture the characteristics of tasks by overlapping two LSTMs to achieve better calculation effect.
And 5, inputting the training data set into the twin network for training to obtain a trained model.
And 6, predicting by using the trained model, comparing the new input data with the representatives of the same name persons, adding the same data if the new input data are the same as the representatives of the same name persons, and considering that the same name persons are the new same name persons if the new input data are different from the representatives of the same name persons.
In particular, before step 6, representative persons can be selected by the machine according to the set rules for the persons of the same name which are not marked.
Specifically, for the same celebrity needing to be predicted, whether the same celebrity belongs to the marked name is judged, and if the same celebrity belongs to the marked name, the step 6 is carried out; if not, the machine selects the representative person according to the set rule, and then the step 6 is carried out.
Example 3
On the basis of the embodiment 1 and the embodiment 2, the application provides a system for identifying the same name among enterprises, which comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;
the data acquisition module is used for acquiring the characteristics of the same celebrity, the data marked with the characteristics, the training data set and the twin network;
the data storage module is used for storing the data output by the data acquisition module and the data processing module;
and the data processing module is used for inputting the training data set into the twin network for training to obtain a trained model, predicting by using the trained model, comparing newly input data with the representatives of the same celebrity, adding the newly input data into the data of the same celebrity if the newly input data are the same with the representatives of the same celebrity, and considering that the newly input data are the same celebrity if the newly input data are different from the representatives of the same celebrity.
Example 4
On the basis of embodiments 1-3, the present application provides an electronic device for identifying a synonym between enterprises, which includes a processor and a memory, wherein the processor is connected to the memory, and the memory stores program codes, and when the program codes are executed by the processor, the processor executes the method of the present application to complete the identification of the synonym between different enterprises.
Example 5
The present application provides a computer readable storage medium comprising program code means for causing an electronic device to carry out the steps of the method of the present application, when said program code means are run on said electronic device.
The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for identifying a homonym among enterprises is characterized by comprising the following steps: the method comprises the following steps:
step 1, acquiring features for input;
step 2, obtaining data after characteristic marking, wherein the marked data at least comprises a person name and a characteristic corresponding to the person name;
step 3, arranging the marked data into a prepared training data set, wherein a representative is selected from the data aiming at the same person of different companies, and the representative is used for comparing other persons with the same name;
step 4, preparing a twin network structure, wherein a sub-network of the twin network comprises a recurrent neural network;
step 5, inputting the training data set into a twin network for training to obtain a trained model;
and 6, predicting by using the trained model, comparing the new input data with the representative of the same person, if the new input data are the same as the representative of the same person, adding the new input data into the data of the same person, and if the new input data are different from the representative of the same person, determining that the new person is the new same person.
2. The method for identifying the same-name person among the enterprises according to claim 1, wherein: and 2, acquiring the data after the characteristic marking by adopting a manual marking or third-party data purchasing mode.
3. The method for identifying the same-name person among the enterprises according to claim 1, wherein: the input form of the training data set in the step 3 is as follows: a sample
Figure 322829DEST_PATH_IMAGE001
Wherein y has a value in the range of [0, 1]],
Figure 922438DEST_PATH_IMAGE002
Is a vector of input features of the same person, wherein
Figure 530137DEST_PATH_IMAGE002
Selecting a representative according to the same person of different companies, and combining the same person with the representative
Figure 316827DEST_PATH_IMAGE002
Then the label of y is 1 at this time, and the same person with different name is combined with the representative
Figure 300964DEST_PATH_IMAGE002
Then the y-tag is 0 at this time.
4. The method for identifying the same-name person among the enterprises as claimed in claim 3, wherein: in step 4, a sample is given
Figure 20658DEST_PATH_IMAGE003
Wherein y is [0, 1]]The cosine similarity expression is as follows
Figure 799258DEST_PATH_IMAGE004
The loss function expression may be used as follows:
Figure 73245DEST_PATH_IMAGE005
wherein
Figure 861072DEST_PATH_IMAGE006
Wherein
Figure 172624DEST_PATH_IMAGE007
Show that
Figure 122125DEST_PATH_IMAGE008
A new vector is obtained after the input of the twin network,
Figure 148987DEST_PATH_IMAGE009
show that
Figure 474926DEST_PATH_IMAGE010
A new vector is obtained after the input of the twin network,
Figure 903633DEST_PATH_IMAGE011
meaning that the inner product of two vectors is taken,
Figure 24036DEST_PATH_IMAGE012
representing a vector
Figure 538194DEST_PATH_IMAGE013
And
Figure 402245DEST_PATH_IMAGE009
2-norm multiplication;
Figure 685459DEST_PATH_IMAGE014
Figure 976763DEST_PATH_IMAGE015
and
Figure 712637DEST_PATH_IMAGE016
express that there are many same-name pairs
Figure 380379DEST_PATH_IMAGE001
To the same-name person pair
Figure 518099DEST_PATH_IMAGE001
Each individual is distinguished by a plurality of labels (i)
Figure 980305DEST_PATH_IMAGE001
I is from 0 to the total number of the same-name person pairs-1;
m is similar severity and is used for adjusting the similar severity of the two vectors, the larger the m is set, the higher the cosine similarity requirement of the two vectors is, and the value range of m is (0, 1);
Figure 203476DEST_PATH_IMAGE017
Figure 674908DEST_PATH_IMAGE018
if y is 1 during the training process, the method is adopted
Figure 667135DEST_PATH_IMAGE017
If y is 0, then
Figure 31733DEST_PATH_IMAGE018
5. The method for identifying the same-name person among the enterprises according to claim 1, wherein: the sub-network of the twin network adopted in the step 4 is a Bi-LSTM network structure.
6. The method for identifying the same-name person among the enterprises according to claim 1, wherein: before the step 6, for the same-name people which are not marked, the representative people are selected by the machine through the set rules.
7. The method for identifying the same-name person among the enterprises as claimed in claim 6, wherein: for the same celebrity needing to be predicted, firstly judging whether the same celebrity belongs to the marked name, and if so, entering the step 6; if not, the machine selects the representative person according to the set rule, and then the step 6 is carried out.
8. The utility model provides an inter-enterprise homonym identification system which characterized in that: the device comprises a data acquisition module, a data storage module and a data processing module, wherein the data acquisition module is in signal connection with the data storage module, and the data storage module is in signal connection with the data processing module;
the data acquisition module is used for acquiring the characteristics of the same celebrity, the characteristic labeled data, the training data set and the twin network;
the data storage module is used for storing the data output by the data acquisition module and the data processing module;
and the data processing module is used for inputting the training data set into the twin network for training to obtain a trained model, predicting by using the trained model, comparing newly input data with the representatives of the same celebrity, adding the newly input data into the data of the same celebrity if the newly input data are the same with the representatives of the same celebrity, and considering that the newly input data are the same celebrity if the newly input data are different from the representatives of the same celebrity.
9. The utility model provides an electronic equipment of same name people discernment between enterprise which characterized in that: comprising a processor and a memory, said processor being coupled to said memory, said memory storing program code which, when executed by said processor, causes said processor to perform the method of any of claims 1 to 7 for performing the identification of homonyms between different enterprises.
10. A computer-readable storage medium characterized by: stored with program code for causing an electronic device to carry out the steps of the method as claimed in any of claims 1-5, when said program code is run on said electronic device.
CN202011502898.4A 2020-12-18 2020-12-18 Method, system, electronic device and storage medium for identifying homonyms among enterprises Pending CN112487819A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011502898.4A CN112487819A (en) 2020-12-18 2020-12-18 Method, system, electronic device and storage medium for identifying homonyms among enterprises

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011502898.4A CN112487819A (en) 2020-12-18 2020-12-18 Method, system, electronic device and storage medium for identifying homonyms among enterprises

Publications (1)

Publication Number Publication Date
CN112487819A true CN112487819A (en) 2021-03-12

Family

ID=74914108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011502898.4A Pending CN112487819A (en) 2020-12-18 2020-12-18 Method, system, electronic device and storage medium for identifying homonyms among enterprises

Country Status (1)

Country Link
CN (1) CN112487819A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269244A (en) * 2021-05-18 2021-08-17 上海睿翎法律咨询服务有限公司 Disambiguation processing method, system, device, processor and storage medium thereof aiming at cross-enterprise personnel rename in business and commerce registration information
CN113326377A (en) * 2021-06-02 2021-08-31 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise incidence relation
CN114861786A (en) * 2022-04-27 2022-08-05 河南天眼查科技有限公司 Two-classification model training method, enterprise pair-based method and device for classifying persons with the same name

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846426A (en) * 2018-05-30 2018-11-20 西安电子科技大学 Polarization SAR classification method based on the twin network of the two-way LSTM of depth
CN110275959A (en) * 2019-05-22 2019-09-24 广东工业大学 A Fast Learning Method for Large-Scale Knowledge Base
CN110427406A (en) * 2019-08-10 2019-11-08 吴诚诚 The method for digging and device of organization's related personnel's relationship
CN110472065A (en) * 2019-07-25 2019-11-19 电子科技大学 Across linguistry map entity alignment schemes based on the twin network of GCN
CN111444731A (en) * 2020-06-15 2020-07-24 深圳市友杰智新科技有限公司 Model training method and device and computer equipment
CN111652667A (en) * 2019-12-31 2020-09-11 成都数联铭品科技有限公司 Method for aligning entity data of main related natural persons of enterprise

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846426A (en) * 2018-05-30 2018-11-20 西安电子科技大学 Polarization SAR classification method based on the twin network of the two-way LSTM of depth
CN110275959A (en) * 2019-05-22 2019-09-24 广东工业大学 A Fast Learning Method for Large-Scale Knowledge Base
CN110472065A (en) * 2019-07-25 2019-11-19 电子科技大学 Across linguistry map entity alignment schemes based on the twin network of GCN
CN110427406A (en) * 2019-08-10 2019-11-08 吴诚诚 The method for digging and device of organization's related personnel's relationship
CN111652667A (en) * 2019-12-31 2020-09-11 成都数联铭品科技有限公司 Method for aligning entity data of main related natural persons of enterprise
CN111444731A (en) * 2020-06-15 2020-07-24 深圳市友杰智新科技有限公司 Model training method and device and computer equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NECULOIU PAUL 等: "Learning text similarity with siamese recurrent networks", 《PROCEEDINGS OF THE 1ST WORKSHOP ON REPRESENTATION LEARNING FOR NLP》, pages 148 - 157 *
孙禾 等: "基于改进孪生支持向量机的齿廓图像边缘失真分类研究", 《光子学报》, vol. 49, no. 10, pages 185 - 197 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269244A (en) * 2021-05-18 2021-08-17 上海睿翎法律咨询服务有限公司 Disambiguation processing method, system, device, processor and storage medium thereof aiming at cross-enterprise personnel rename in business and commerce registration information
CN113326377A (en) * 2021-06-02 2021-08-31 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise incidence relation
CN113326377B (en) * 2021-06-02 2023-10-13 上海生腾数据科技有限公司 Name disambiguation method and system based on enterprise association relationship
CN114861786A (en) * 2022-04-27 2022-08-05 河南天眼查科技有限公司 Two-classification model training method, enterprise pair-based method and device for classifying persons with the same name

Similar Documents

Publication Publication Date Title
CN111783474A (en) Comment text viewpoint information processing method and device and storage medium
CN112487819A (en) Method, system, electronic device and storage medium for identifying homonyms among enterprises
CN112287674B (en) Method and system for identifying homonymous large nodes among enterprises, electronic equipment and storage medium
CN109471944A (en) Training method, device and readable storage medium for text classification model
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN118468061B (en) Automatic algorithm matching and parameter optimizing method and system
CN111178701B (en) Risk control method and device based on feature derivation technology and electronic equipment
CN118503431B (en) Information management system and method based on artificial intelligence
Cheng et al. Blocking bug prediction based on XGBoost with enhanced features
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN117611011A (en) Data processing method and device, electronic equipment and storage medium
CN119513818B (en) Multi-mode data fusion method, device, equipment and medium
CN117216550A (en) Classification model training method, device, equipment, medium and program product
Priya et al. An enhanced animal species classification and prediction engine using cnn
Elgohary et al. Smart evaluation for deep learning model: churn prediction as a product case study
CN114077663A (en) Application log analysis method and device
CN113822684B (en) Black-birth user identification model training method and device, electronic equipment and storage medium
CN110489730A (en) Text handling method, device, terminal and storage medium
Fan Data mining model for predicting the quality level and classification of construction projects
CN119047469A (en) Entity information extraction method and device and electronic equipment
CN115831339B (en) Pre-prediction method and system for medical system risk management and control based on deep learning
CN115982646B (en) Management method and system for multisource test data based on cloud platform
CN117997571A (en) Malicious website identification method, website sample generation method and related equipment
CN114238657A (en) Graph database based automatic enterprise classification method and system in high and new technology field
CN115496357B (en) Enterprise credit risk early warning method and system integrating forum text temporal characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210312

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载