CN109697285B

CN109697285B - Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation

Info

Publication number: CN109697285B
Application number: CN201811523661.7A
Authority: CN
Inventors: 王建新; 余颖; 李敏
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2022-06-21
Anticipated expiration: 2038-12-13
Also published as: CN109697285A

Abstract

The invention discloses a hierarchical BiLSTM Chinese electronic medical record disease coding and labeling method for enhancing semantic representation. After preprocessing the input electronic medical record text, considering that in the composition of Chinese words, a single Chinese character contains specific semantics, the BiLSTM extracting the attention mechanism is used to extract Character-level feature vector representation to obtain the semantic and word formation features of a single Chinese character; concatenate the character-level word vector representation with the word-level vector representation obtained by word2vec training to obtain a character-level feature-enhanced word vector representation; represented by feature word vectors As input, BiLSTM is used again to learn the contextual features in the entire electronic medical record, and the attention mechanism is used to calculate the contribution of each feature word to obtain a text vector representation weighted by contextual features, which improves the prediction effect. The method of the invention is suitable for the task of disease label classification based on Chinese electronic medical record text, and effectively improves the classification effect.

Description

A Hierarchical BiLSTM Chinese Electronic Medical Record Disease Coding and Annotation Method with Enhanced Semantic Representation

技术领域technical field

本发明涉及医学信息学领域，特别是一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法。The invention relates to the field of medical informatics, in particular to a hierarchical BiLSTM Chinese electronic medical record disease coding and labeling method for enhancing semantic representation.

背景技术Background technique

电子健康病历(Electronic Health Records，EHRs，简称电子病历)已成为医学临床研究的重要数据资源之一。它将病人就医过程中的各种信息以数字化的数据进行存储，方便我们利用计算机来对临床数据进行分析和处理。对于一份电子病历，需要有描述病人疾病状况的统一标签规范，从而有利于将患者信息进行合理的分类以帮助临床决策。由世界卫生组织发布并持续更新的国际疾病分类编码(International Classification ofDiseases，ICD)是国际通用的疾病编码方案，它常被作为临床记录的标签，用于标识症状、体征、疾病、异常发现或操作等。目前，新修订的ICD编码第10版已被广泛应用于我国的医院信息系统中。Electronic health records (Electronic Health Records, EHRs, electronic medical records for short) have become one of the important data resources for medical clinical research. It stores all kinds of information in the process of patients' medical treatment as digital data, which is convenient for us to use computers to analyze and process clinical data. For an electronic medical record, there needs to be a unified labeling specification describing the patient's disease status, which is conducive to rational classification of patient information to help clinical decision-making. The International Classification of Diseases (ICD), published by the World Health Organization and continuously updated, is an internationally accepted disease coding scheme, which is often used as a label in clinical records to identify symptoms, signs, diseases, abnormal findings or actions. Wait. At present, the 10th edition of the newly revised ICD code has been widely used in the hospital information system of our country.

为电子病历标注ICD编码是利用电子病历的一项重要并且基础的工作。电子病历中诊断名称与ICD编码的缺失，不利于我们对临床数据的分析研究。通常，ICD编码的标注工作由各医院病案室的医务人员根据医生给出的临床诊断描述来进行人工判别。人工编码不仅要求编码人员掌握一定的医学知识、编码规则和医学术语，而且费时费力。因此，利用计算机来进行自动编码可以为编码标注工作提供有效的辅助，提高ICD编码的标注效率。Annotating ICD codes for electronic medical records is an important and basic work of using electronic medical records. The absence of diagnostic names and ICD codes in electronic medical records is not conducive to our analysis of clinical data. Usually, the labeling work of the ICD code is manually judged by the medical staff in the medical record room of each hospital according to the clinical diagnosis description given by the doctor. Manual coding not only requires coders to master certain medical knowledge, coding rules and medical terminology, but also takes time and effort. Therefore, the use of computer for automatic coding can provide effective assistance for coding and labeling work, and improve the labeling efficiency of ICD coding.

目前大部分的疾病编码自动标注工作都基于临床文本数据来进行,如放射科的报告、死亡证明、出院小结等。但是，绝大部分的研究工作集中在英文语料上，在中文临床文本上的疾病编码预测工作较少，且主要的方法是基于诊断名称的字符串语义比对。语义相似性的比较对诊断名称描述的质量要求较高，且在诊断名称缺失的情况下无法进行自动编码标注。目前还没有相关研究工作将神经网络模型用于中文电子病历的疾病编码标注任务。At present, most of the automatic labeling of disease codes is based on clinical text data, such as radiology reports, death certificates, discharge summaries, etc. However, most of the research work focuses on English corpus, and there is less disease coding prediction work on Chinese clinical texts, and the main method is the semantic comparison of strings based on diagnosis names. The comparison of semantic similarity has high requirements on the quality of the description of diagnosis names, and automatic coding and annotation cannot be performed in the case of missing diagnosis names. At present, there is no related research work using neural network models for disease coding and labeling tasks in Chinese electronic medical records.

中文电子病历文本的处理有两个特点：一是电子病历文本较长，长文本的上下文信息获取较难；二是中文汉字不同于英文，单个汉字也具有语义，尤其在医学用语中，诸如方位、身体部位等都是一个汉字来描述，因此，包含字符特征的语义表示能更好的表达词的语义。The processing of Chinese electronic medical record text has two characteristics: one is that the electronic medical record text is long, and it is difficult to obtain the context information of the long text; , body parts, etc. are all described by a Chinese character, so the semantic representation containing character features can better express the semantics of words.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是，针对现有技术不足，提供一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法，以端到端的方式完成自动标注，提高预测效果。The technical problem to be solved by the present invention is to provide a hierarchical BiLSTM Chinese electronic medical record disease coding and labeling method with enhanced semantic representation, which can complete automatic labeling in an end-to-end manner and improve the prediction effect.

为解决上述技术问题，本发明所采用的技术方案是：In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:

一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法，包括以下步骤：A hierarchical BiLSTM Chinese electronic medical record disease coding and annotation method with enhanced semantic representation includes the following steps:

1)利用中文分词工具，引入用户自定义医学临床用语词典进行分词，去除停用词，并根据词频筛选出特征词；1) Using the Chinese word segmentation tool, introduce a user-defined medical clinical term dictionary for word segmentation, remove stop words, and filter out feature words according to word frequency;

2)对特征词分别进行字符级别和词语级别的向量化表示，拼接字符级向量与词语级向量，构建词语的字符增强特征向量表示；2) The character-level and word-level vectorized representations of the feature words are respectively performed, and the character-level vector and the word-level vector are concatenated to construct the character-enhanced feature vector representation of the word;

3)利用拼接后的特征词得到整个文本的上下文特征，并采用关注机制，计算各个特征词的贡献度，获得整个文本的上下文特征加权向量表示。3) Use the spliced feature words to obtain the contextual features of the entire text, and use the attention mechanism to calculate the contribution of each feature word to obtain the contextual feature weighted vector representation of the entire text.

步骤1)中，根据以下规则选取所述特征词：

其中S_fw表示特征词集合，

表示词w_i的频率，N_d表示电子病历样本总数。In step 1), the feature words are selected according to the following rules:

where S _fw represents the feature word set,

represents the frequency of word _wi , and N _d represents the total number of electronic medical record samples.

步骤2)中，利用融合关注机制的双向LSTM训练特征词的字符级特征向量表示，利用基于词分布式表示的词向量表示方法word2vec得到特征词的词语级向量表示形式。In step 2), the character-level feature vector representation of the feature word is trained using the bidirectional LSTM fused with the attention mechanism, and the word-level vector representation of the feature word is obtained by using the word vector representation method word2vec based on word distributed representation.

双向长短期记忆网络训练的输出方式为：

其中

表示前向LSTM在第t个单元或t时刻的隐层输出，

则为后向LSTM在第t个单元的隐层输出。The output mode of bidirectional long short-term memory network training is:

in

represents the hidden layer output of the forward LSTM at the t-th unit or time t,

Then it is the output of the hidden layer of the backward LSTM in the t-th unit.

关注机制的计算方式为：The attention mechanism is calculated as:

u_ij＝tanh(W_ch_ij+b_c)；u _ij =tanh(W _c h _ij +b _c );

h_ij为第i个词的第j个字符在BiLSTM训练后的隐层输出，W_c为权值矩阵，b_c为偏置向量，u_c为随机初始化字符级的上下文特征向量，α_ij为利用softmax函数计算得到的第j个字符对于第i个词的权重大小，

为第i个词的上下文加权特征向量表示。h _ij is the hidden layer output of the j-th character of the i-th word after BiLSTM training, W _c is the weight matrix, b _c is the bias vector, u _c is the randomly initialized character-level context feature vector, α _ij is The weight of the j-th character for the i-th word calculated by the softmax function,

is the context-weighted feature vector representation for the ith word.

步骤3)中，计算整个文本的上下文特征加权向量的方法包括：将拼接后的特征词向量表示的文本输入第二层双向长短期记忆网络，学习得到整个文本的上下文特征，并采用关注机制，计算各个特征词的权重，得到上下文信息加权的文本特征向量。In step 3), the method for calculating the context feature weighted vector of the entire text includes: inputting the text represented by the spliced feature word vectors into the second-layer bidirectional long-term and short-term memory network, learning to obtain the context feature of the entire text, and adopting an attention mechanism, Calculate the weight of each feature word to obtain the text feature vector weighted by context information.

关注机制的计算方式为：The attention mechanism is calculated as:

u_i＝tanh(Wh_i+b_w)；u _i =tanh(Wh _i +b _w );

v＝∑_iα_ih_i；v=∑ _i α _i h _i ;

h_i是文本序列第i个词的字符加强特征向量经BiLSTM训练后得到的隐层的输出，W为权值矩阵，b_w为偏置向量，在应用关注机制时，相应引入并随机初始化一个词语级别的文档上下文特征向量u_w来完成权值的计算，α_i为每个词对应的权重，v为整个文本的上下文加权特征向量表示，将该向量输入全连接层，由sigmoid函数计算得到每个疾病编码的出现概率。h _i is the output of the hidden layer obtained after the BiLSTM training of the character-enhanced feature vector of the i-th word in the text sequence, W is the weight matrix, and b _w is the bias vector. When applying the attention mechanism, a corresponding one is introduced and randomly initialized. The word-level document context feature vector _uw is used to complete the weight calculation, α _i is the weight corresponding to each word, v is the context weighted feature vector representation of the entire text, and the vector is input into the fully connected layer, which is calculated by the sigmoid function. The probability of occurrence of each disease code.

与现有技术相比，本发明所具有的有益效果为：本发明针对中文自身特点，将单个汉字的语义特征融入词的特征向量表示，并结合关注机制，对输入序列中真正有贡献的特征词进行了加权，提高了疾病编码的预测效果；该方法适用于中文临床文本数据，利用神经网络模型自动提取文本特征，以端到端的方式完成自动标注。Compared with the prior art, the present invention has the following beneficial effects: aiming at the characteristics of Chinese itself, the present invention integrates the semantic features of a single Chinese character into the feature vector representation of words, and combines the attention mechanism to truly contribute to the features in the input sequence. The words are weighted to improve the prediction effect of disease coding; this method is suitable for Chinese clinical text data, using neural network model to automatically extract text features, and complete automatic labeling in an end-to-end manner.

附图说明Description of drawings

图1本发明的流程图；Fig. 1 is the flow chart of the present invention;

图2融合关注机制的层次BiLSTM特征学习模型；Figure 2. Hierarchical BiLSTM feature learning model fused with attention mechanism;

图3关注机制的计算；(a)将h_ij变成u_ij；(b)利用上下文特征向量计算每个u_ij的权重；(c)h_ij的加权求和得到应用关注机制的特征向量表示；Fig. 3 Calculation of attention mechanism; (a) change h _ij into u _ij ; (b) calculate the weight of each u _ij using the context feature vector; (c) the weighted summation of h _ij obtains the feature vector representation of applying the attention mechanism ;

图4为本发明实施实验结果图。FIG. 4 is a graph showing the experimental results of the implementation of the present invention.

具体实施方式Detailed ways

一、临床文本数据的预处理1. Preprocessing of clinical text data

利用中文分词工具“结巴”和用户自定义的医学词库，对输入的出院小结文本进行分词后，去除停用词，统计有效词的词频，基于词频从大到小排序后选择特征词，按以下规则选取：

其中S_fw表示特征词集合，

表示词w_i的频率，N_d表示电子病历总数。Use the Chinese word segmentation tool "Jieba" and a user-defined medical thesaurus to segment the input text of the discharge summary, remove stop words, count the word frequencies of valid words, and select feature words based on the word frequency in descending order, and press Choose from the following rules:

where S _fw represents the feature word set,

represents the frequency of word _wi , and N _d represents the total number of electronic medical records.

二、特征词的词向量表示Second, the word vector representation of feature words

1)基于字符的词向量表示1) Character-based word vector representation

首先，为每个字符初始化一个向量表示，然后输入融合关注机制的BiLSTM，训练得到每个特征词的字符级词向量表示，BiLSTM中的每个神经单元状态值c_t和输出值h_t具体计算过程为(t＝1,2,...,n，t表示网络中的第t个神经单元或者t时刻的神经单元)：First, initialize a vector representation for each character, and then input the BiLSTM fused with the attention mechanism, and train to obtain the character-level word vector representation of each feature word. The state value c _t and output value h _t of each neural unit in BiLSTM are calculated specifically The process is (t=1,2,...,n, t represents the t-th neural unit in the network or the neural unit at time t):

i_t＝sigmoid(W_i[x_t；h_t-1]+b_i) (1)i _t =sigmoid(W _i [x _t ; h _t-1 ]+ _bi ) (1)

f_t＝sigmoid(W_f[x_t；h_t-1]+b_f) (2)f _t =sigmoid(W _f [x _t ; h _t-1 ]+b _f ) (2)

g_t＝tanh(W_g[x_t；h_t-1]+b_g) (3)g _t =tanh(W _g [x _t ; h _t-1 ]+b _g ) (3)

o_t＝sigmoid(W_o[x_t；h_t-1]+b_o) (4)o _t =sigmoid(W _o [x _t ; h _t-1 ]+b _o ) (4)

c_t＝f_t*c_t-1+i_t*g_t (5)c _t =f _t *c _t-1 +i _t *g _t (5)

h_t＝o_t*tanh(c_t) (6)h _t =o _t *tanh(c _t ) (6)

每个神经单元包含一个输入门i，一个输出门o，一个遗忘门f，一个存储单元g，一个保存状态的单元c和一个隐藏状态h，它们均为向量，W_i,W_f,W_g,W_o为权值矩阵，b_i,b_f,b_g,b_o为偏置向量，“；”表示连接运算，“*”表示元素点乘，sigmoid函数的计算为

tanh函数的计算为

BiLSTM的输出方式为

Each neural unit contains an input gate i, an output gate o, a forget gate f, a storage unit g, a state-saving unit c and a hidden state h, all of which are vectors, W _i , W _f , W _g , W _o is the weight matrix, b _i , b _f , b _g , b _o are the bias vectors, ";" indicates the connection operation, "*" indicates the element-wise multiplication, and the calculation of the sigmoid function is

The tanh function is calculated as

The output method of BiLSTM is

2)注意力机制的应用2) Application of attention mechanism

关注机制计算方法为：The attention mechanism calculation method is:

u_ij＝tanh(W_ch_ij+b_c) (7)u _ij =tanh(W _c h _ij +b _c ) (7)

h_ij为第i个词的第j个字符在BiLSTM训练后的隐层输出，W_c为权值矩阵，b_c为偏置向量，u_c为随机初始化字符级的上下文特征向量，α_ij即为利用softmax函数计算得到的第j个字符对于第i个词的权重大小，

即为第i个词的上下文加权特征向量表示。h _ij is the hidden layer output of the j-th character of the i-th word after BiLSTM training, W _c is the weight matrix, b _c is the bias vector, u _c is the randomly initialized character-level context feature vector, α _ij is is the weight of the j-th character to the i-th word calculated by the softmax function,

It is the context-weighted feature vector representation of the ith word.

3)将训练得到的字符级词向量与使用word2vec生成的词向量进行拼接，得到字符级上下文特征加强的词特征向量。3) Splicing the character-level word vector obtained by training with the word vector generated by word2vec to obtain a word-level feature vector enhanced by character-level contextual features.

三、上下文特征提取3. Context Feature Extraction

将字符加强的特征向量序列输入第二层融合关注机制的BiLSTM，提取文本上下文信息特征，BiLSTM神经单元的计算和上下文特征加权的计算，与字符级词向量表示时的相同，具体的计算公式如下：The character-enhanced feature vector sequence is input into the BiLSTM of the second-layer fusion attention mechanism, and the text context information features are extracted. The calculation of the BiLSTM neural unit and the calculation of the context feature weighting are the same as those of the character-level word vector representation. The specific calculation formula is as follows :

u_i＝tanh(Wh_i+b_w) (10)u _i =tanh(Wh _i +b _w ) (10)

v＝∑_iα_ih_i (12)v=∑ _i α _i h _i (12)

四、实验验证4. Experimental verification

1)实验过程1) Experimental process

为了验证本方法的有效性，我们在真实的中文电子病历临床数据上进行了实验验证。该数据集包含7732个出院记录，共涉及1177个ICD-10疾病编码标签，ICD-10编码是由字母和数字组成的点分六位编码，以字母开头，前三位编码为一级编码，指明疾病类目。出院小结的平均长度为610个词语，平均每个出院小结对应3.6个疾病编码。In order to verify the effectiveness of this method, we conducted experiments on real Chinese electronic medical record clinical data. The dataset contains 7732 discharge records, involving a total of 1177 ICD-10 disease code labels. The ICD-10 code is a dotted six-digit code consisting of letters and numbers, starting with a letter, and the first three codes are the first-level codes. Specify the disease category. The average length of discharge summaries was 610 words, with an average of 3.6 disease codes per discharge summary.

实验在一台服务器上完成，该服务器包含256GB内存和NVIDIA GeForce Titan XPascal CUDA GPU处理器。我们将数据集按照9:1的比例分为训练集和测试集，并通过十次随机打乱数据进行了验证。评价指标选择了微平均的精确度(P)、召回率(R)和两者综合的指标F1值，以及从样本的角度评价误报情况的Hamming损失值。F1值越高、Hamming损失值越低说明模型性能越好。The experiments were done on a server containing 256GB of memory and an NVIDIA GeForce Titan XPascal CUDA GPU processor. We split the dataset into training and test sets in a 9:1 ratio, and performed validation by randomly shuffling the data ten times. The evaluation indicators selected the precision (P), recall rate (R) of the micro-average and the F1 value of the combination of the two, as well as the Hamming loss value to evaluate the false positive situation from the perspective of the sample. The higher the F1 value and the lower the Hamming loss value, the better the model performance.

2)实验结果2) Experimental results

因相关研究工作已指出了深度学习方法优于传统的机器学习方法，我们主要与其他常见的神经网络模型进行了对比实验，结果如表1所示，MA-BiLSTM表示我们的模型，D2V+CNN为相关研究工作中的方法，该方法在公开的英文数据集MIMIC III上取得目前最好效果。实验结果表明MA-BiLSTM在各项评价指标上均优于其他神经网络模型，说明结合关注机制的BiLSTM能够有效捕获长文本的上下文信息特征，并提高预测效果。Because the related research work has pointed out that the deep learning method is superior to the traditional machine learning method, we mainly conducted comparative experiments with other common neural network models. The results are shown in Table 1. MA-BiLSTM represents our model, D2V+CNN As a method in related research work, this method has achieved the best results on the public English dataset MIMIC III. The experimental results show that MA-BiLSTM is superior to other neural network models in various evaluation indicators, indicating that BiLSTM combined with attention mechanism can effectively capture the contextual information features of long texts and improve the prediction effect.

表1对比实验结果Table 1 Comparative experimental results

ModelModel Micro_P(CI:95％)Micro_P (CI: 95%) Micro_R(CI:95％)Micro_R (CI: 95%) Micro_F1(CI:95％)Micro_F1 (CI: 95%) hLoss(CI:95％)hLoss (CI: 95%) CBOWCBOW 0.614(±6.43e-03)0.614(±6.43e-03) 0.522(±5.30e-03)0.522(±5.30e-03) 0.564(±4.52e-03)0.564(±4.52e-03) 0.00248(±3.14e-05)0.00248(±3.14e-05) CNNCNN 0.647(±6.67e-03)0.647(±6.67e-03) 0.509(±6.51e-03)0.509(±6.51e-03) 0.569(±4.71e-03)0.569(±4.71e-03) 0.00237(±3.52e-05)0.00237(±3.52e-05) D2V+CNND2V+CNN 0.661(±9.57e-03)0.661(±9.57e-03) 0.514(±8.74e-03)0.514(±8.74e-03) 0.579(±7.14e-03)0.579(±7.14e-03) 0.00231(±3.70e-05)0.00231(±3.70e-05) MA-BiLSTMMA-BiLSTM 0.704(±1.13e-02)0.704(±1.13e-02) 0.586(±5.84e-03)0.586(±5.84e-03) 0.639(±4.45e-03)0.639(±4.45e-03) 0.00204(±3.47e-05)0.00204(±3.47e-05)

为分析模型各个模块的发挥的作用，我们设计了消融实验进行分析，结果如表2所示。从实验结果看，仅有词向量或字符向量表示文本中词语的特征，预测结果都发生了下降，因此，字符向量加强的词向量表示确实带来了更好的文本特征表示。关注机制在模型中起到了重要作用，去掉了关注机制，模型的性能下降明显。In order to analyze the role of each module of the model, we designed an ablation experiment for analysis, and the results are shown in Table 2. From the experimental results, only word vectors or character vectors represent the features of words in the text, and the prediction results have declined. Therefore, the word vector representation enhanced by character vectors does bring better text feature representation. The attention mechanism plays an important role in the model. If the attention mechanism is removed, the performance of the model decreases significantly.

在ICD-10全编码和一级编码上均进行了预测，7732个样本，对应一级编码为488个。实验结果如图4所示。一级编码上的预测结果在精确度上达到了80.5％，能较好的辅助病案室医务人员的疾病编码标注工作。Predicted on both ICD-10 full coding and primary coding, 7732 samples, corresponding to 488 primary coding. The experimental results are shown in Figure 4. The accuracy of the prediction results on the first-level coding has reached 80.5%, which can better assist the medical staff in the medical record room in the work of disease coding and labeling.

表2模型消融实验结果Table 2 Model ablation experimental results

Claims

1. A hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation is characterized by comprising the following steps:

1) utilizing a Chinese word segmentation tool, introducing a user-defined medical clinical word dictionary to segment words of the discharge summary text, removing stop words, and screening out characteristic words according to word frequency;

2) respectively carrying out character level and word level vectorization representation on the feature words, splicing the character level vectors and the word level vectors, and constructing character enhancement feature vector representation of the words; using character-level feature vector representation of a BilSTM training feature word fused with an attention mechanism, and using a word vector representation method word2vec based on word distributed representation to obtain a word-level vector representation form of the feature word;

3) obtaining a word vector representation sequence of the whole text by using the spliced feature words, calculating the contribution degree of each feature word by using an attention mechanism, obtaining context feature weighted vector representation of the whole text, namely inputting the text represented by the spliced feature word vector into a second-layer bidirectional long-short term memory network, learning to obtain context features of the whole text, and calculating the weight of each feature word by using the attention mechanism, so as to obtain a text feature vector weighted by context information;

the calculation mode of the attention mechanism is as follows:

v＝∑_iα_ih_i；

h_iis the output of a hidden layer obtained after the character reinforcing characteristic vector of the ith word of the text sequence is trained by BilSTM, W is a weight matrix, b_wFor a bias vector, when an attention mechanism is applied, a document context feature vector u at a word level is correspondingly introduced and randomly initialized_wTo complete the calculation of the weight value, α_iAnd v is represented by the context weighted feature vector of the whole text for the weight corresponding to each word, the vector is input into a full connection layer, and the occurrence probability of each disease code is calculated by a sigmoid function.

2. The method for disease-coding and labeling of BiLSTM Chinese electronic medical record with enhanced semantic representation according to claim 1, wherein in step 1), the feature words are selected according to the following rules:

wherein S_fwA set of characteristic words is represented,

the expression w_iFrequency of (N), N_dAnd the total number of samples of the electronic medical record is shown.

3. The hierarchical BilSTM Chinese electronic medical record disease coding labeling method based on enhanced semantic representation according to claim 1, wherein the output mode of the BilSTM is as follows:

wherein

Representing the hidden layer output of the forward LSTM at the t-th element or time t,

then it is output at the hidden layer of the t-th cell for backward LSTM.

4. The method for disease coding and labeling of BiLSTM Chinese electronic medical record with enhanced semantic representation according to claim 1, wherein in step 2), the calculation mode of the attention mechanism is as follows:

u_ij＝tanh(W_ch_ij+b_c)；

h_ijis the output of the j character of the i word in the hidden layer after the BilSTM training, W_cAs a weight matrix, b_cAs an offset vector, u_cFor the random initialization of the character-level context feature vector, α_ijFor the weight of the jth character to the ith word calculated by the softmax function,

the feature vector representation is weighted for the context of the ith word.