CN109697285B - Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation - Google Patents
Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation Download PDFInfo
- Publication number
- CN109697285B CN109697285B CN201811523661.7A CN201811523661A CN109697285B CN 109697285 B CN109697285 B CN 109697285B CN 201811523661 A CN201811523661 A CN 201811523661A CN 109697285 B CN109697285 B CN 109697285B
- Authority
- CN
- China
- Prior art keywords
- word
- feature
- character
- bilstm
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
本发明公开了一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法,对输入的电子病历文本进行预处理后,考虑中文词语构成中,单个汉字包含具体语义,利用引入关注机制的BiLSTM提取字符级特征向量表示,获得单个汉字的语义及构词特征;将字符级词向量表示与利用word2vec训练得到的词语级别的向量表示进行拼接,得到字符特征增强的词语向量表示;以特征词向量表示的文本序列作为输入,再次利用BiLSTM学习整个电子病历中的上下文特征,并采用关注机制,计算各个特征词的贡献度,得到上下文特征加权的文本向量表示,提高了预测效果。本发明的方法适用于基于中文电子病历文本的疾病标签分类任务,并有效提高了分类效果。
The invention discloses a hierarchical BiLSTM Chinese electronic medical record disease coding and labeling method for enhancing semantic representation. After preprocessing the input electronic medical record text, considering that in the composition of Chinese words, a single Chinese character contains specific semantics, the BiLSTM extracting the attention mechanism is used to extract Character-level feature vector representation to obtain the semantic and word formation features of a single Chinese character; concatenate the character-level word vector representation with the word-level vector representation obtained by word2vec training to obtain a character-level feature-enhanced word vector representation; represented by feature word vectors As input, BiLSTM is used again to learn the contextual features in the entire electronic medical record, and the attention mechanism is used to calculate the contribution of each feature word to obtain a text vector representation weighted by contextual features, which improves the prediction effect. The method of the invention is suitable for the task of disease label classification based on Chinese electronic medical record text, and effectively improves the classification effect.
Description
技术领域technical field
本发明涉及医学信息学领域,特别是一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法。The invention relates to the field of medical informatics, in particular to a hierarchical BiLSTM Chinese electronic medical record disease coding and labeling method for enhancing semantic representation.
背景技术Background technique
电子健康病历(Electronic Health Records,EHRs,简称电子病历)已成为医学临床研究的重要数据资源之一。它将病人就医过程中的各种信息以数字化的数据进行存储,方便我们利用计算机来对临床数据进行分析和处理。对于一份电子病历,需要有描述病人疾病状况的统一标签规范,从而有利于将患者信息进行合理的分类以帮助临床决策。由世界卫生组织发布并持续更新的国际疾病分类编码(International Classification ofDiseases,ICD)是国际通用的疾病编码方案,它常被作为临床记录的标签,用于标识症状、体征、疾病、异常发现或操作等。目前,新修订的ICD编码第10版已被广泛应用于我国的医院信息系统中。Electronic health records (Electronic Health Records, EHRs, electronic medical records for short) have become one of the important data resources for medical clinical research. It stores all kinds of information in the process of patients' medical treatment as digital data, which is convenient for us to use computers to analyze and process clinical data. For an electronic medical record, there needs to be a unified labeling specification describing the patient's disease status, which is conducive to rational classification of patient information to help clinical decision-making. The International Classification of Diseases (ICD), published by the World Health Organization and continuously updated, is an internationally accepted disease coding scheme, which is often used as a label in clinical records to identify symptoms, signs, diseases, abnormal findings or actions. Wait. At present, the 10th edition of the newly revised ICD code has been widely used in the hospital information system of our country.
为电子病历标注ICD编码是利用电子病历的一项重要并且基础的工作。电子病历中诊断名称与ICD编码的缺失,不利于我们对临床数据的分析研究。通常,ICD编码的标注工作由各医院病案室的医务人员根据医生给出的临床诊断描述来进行人工判别。人工编码不仅要求编码人员掌握一定的医学知识、编码规则和医学术语,而且费时费力。因此,利用计算机来进行自动编码可以为编码标注工作提供有效的辅助,提高ICD编码的标注效率。Annotating ICD codes for electronic medical records is an important and basic work of using electronic medical records. The absence of diagnostic names and ICD codes in electronic medical records is not conducive to our analysis of clinical data. Usually, the labeling work of the ICD code is manually judged by the medical staff in the medical record room of each hospital according to the clinical diagnosis description given by the doctor. Manual coding not only requires coders to master certain medical knowledge, coding rules and medical terminology, but also takes time and effort. Therefore, the use of computer for automatic coding can provide effective assistance for coding and labeling work, and improve the labeling efficiency of ICD coding.
目前大部分的疾病编码自动标注工作都基于临床文本数据来进行,如放射科的报告、死亡证明、出院小结等。但是,绝大部分的研究工作集中在英文语料上,在中文临床文本上的疾病编码预测工作较少,且主要的方法是基于诊断名称的字符串语义比对。语义相似性的比较对诊断名称描述的质量要求较高,且在诊断名称缺失的情况下无法进行自动编码标注。目前还没有相关研究工作将神经网络模型用于中文电子病历的疾病编码标注任务。At present, most of the automatic labeling of disease codes is based on clinical text data, such as radiology reports, death certificates, discharge summaries, etc. However, most of the research work focuses on English corpus, and there is less disease coding prediction work on Chinese clinical texts, and the main method is the semantic comparison of strings based on diagnosis names. The comparison of semantic similarity has high requirements on the quality of the description of diagnosis names, and automatic coding and annotation cannot be performed in the case of missing diagnosis names. At present, there is no related research work using neural network models for disease coding and labeling tasks in Chinese electronic medical records.
中文电子病历文本的处理有两个特点:一是电子病历文本较长,长文本的上下文信息获取较难;二是中文汉字不同于英文,单个汉字也具有语义,尤其在医学用语中,诸如方位、身体部位等都是一个汉字来描述,因此,包含字符特征的语义表示能更好的表达词的语义。The processing of Chinese electronic medical record text has two characteristics: one is that the electronic medical record text is long, and it is difficult to obtain the context information of the long text; , body parts, etc. are all described by a Chinese character, so the semantic representation containing character features can better express the semantics of words.
发明内容SUMMARY OF THE INVENTION
本发明所要解决的技术问题是,针对现有技术不足,提供一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法,以端到端的方式完成自动标注,提高预测效果。The technical problem to be solved by the present invention is to provide a hierarchical BiLSTM Chinese electronic medical record disease coding and labeling method with enhanced semantic representation, which can complete automatic labeling in an end-to-end manner and improve the prediction effect.
为解决上述技术问题,本发明所采用的技术方案是:In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:
一种增强语义表示的层次BiLSTM中文电子病历疾病编码标注方法,包括以下步骤:A hierarchical BiLSTM Chinese electronic medical record disease coding and annotation method with enhanced semantic representation includes the following steps:
1)利用中文分词工具,引入用户自定义医学临床用语词典进行分词,去除停用词,并根据词频筛选出特征词;1) Using the Chinese word segmentation tool, introduce a user-defined medical clinical term dictionary for word segmentation, remove stop words, and filter out feature words according to word frequency;
2)对特征词分别进行字符级别和词语级别的向量化表示,拼接字符级向量与词语级向量,构建词语的字符增强特征向量表示;2) The character-level and word-level vectorized representations of the feature words are respectively performed, and the character-level vector and the word-level vector are concatenated to construct the character-enhanced feature vector representation of the word;
3)利用拼接后的特征词得到整个文本的上下文特征,并采用关注机制,计算各个特征词的贡献度,获得整个文本的上下文特征加权向量表示。3) Use the spliced feature words to obtain the contextual features of the entire text, and use the attention mechanism to calculate the contribution of each feature word to obtain the contextual feature weighted vector representation of the entire text.
步骤1)中,根据以下规则选取所述特征词:其中Sfw表示特征词集合,表示词wi的频率,Nd表示电子病历样本总数。In step 1), the feature words are selected according to the following rules: where S fw represents the feature word set, represents the frequency of word wi , and N d represents the total number of electronic medical record samples.
步骤2)中,利用融合关注机制的双向LSTM训练特征词的字符级特征向量表示,利用基于词分布式表示的词向量表示方法word2vec得到特征词的词语级向量表示形式。In step 2), the character-level feature vector representation of the feature word is trained using the bidirectional LSTM fused with the attention mechanism, and the word-level vector representation of the feature word is obtained by using the word vector representation method word2vec based on word distributed representation.
双向长短期记忆网络训练的输出方式为:其中表示前向LSTM在第t个单元或t时刻的隐层输出,则为后向LSTM在第t个单元的隐层输出。The output mode of bidirectional long short-term memory network training is: in represents the hidden layer output of the forward LSTM at the t-th unit or time t, Then it is the output of the hidden layer of the backward LSTM in the t-th unit.
关注机制的计算方式为:The attention mechanism is calculated as:
uij=tanh(Wchij+bc);u ij =tanh(W c h ij +b c );
hij为第i个词的第j个字符在BiLSTM训练后的隐层输出,Wc为权值矩阵,bc为偏置向量,uc为随机初始化字符级的上下文特征向量,αij为利用softmax函数计算得到的第j个字符对于第i个词的权重大小,为第i个词的上下文加权特征向量表示。h ij is the hidden layer output of the j-th character of the i-th word after BiLSTM training, W c is the weight matrix, b c is the bias vector, u c is the randomly initialized character-level context feature vector, α ij is The weight of the j-th character for the i-th word calculated by the softmax function, is the context-weighted feature vector representation for the ith word.
步骤3)中,计算整个文本的上下文特征加权向量的方法包括:将拼接后的特征词向量表示的文本输入第二层双向长短期记忆网络,学习得到整个文本的上下文特征,并采用关注机制,计算各个特征词的权重,得到上下文信息加权的文本特征向量。In step 3), the method for calculating the context feature weighted vector of the entire text includes: inputting the text represented by the spliced feature word vectors into the second-layer bidirectional long-term and short-term memory network, learning to obtain the context feature of the entire text, and adopting an attention mechanism, Calculate the weight of each feature word to obtain the text feature vector weighted by context information.
关注机制的计算方式为:The attention mechanism is calculated as:
ui=tanh(Whi+bw);u i =tanh(Wh i +b w );
v=∑iαihi;v=∑ i α i h i ;
hi是文本序列第i个词的字符加强特征向量经BiLSTM训练后得到的隐层的输出,W为权值矩阵,bw为偏置向量,在应用关注机制时,相应引入并随机初始化一个词语级别的文档上下文特征向量uw来完成权值的计算,αi为每个词对应的权重,v为整个文本的上下文加权特征向量表示,将该向量输入全连接层,由sigmoid函数计算得到每个疾病编码的出现概率。h i is the output of the hidden layer obtained after the BiLSTM training of the character-enhanced feature vector of the i-th word in the text sequence, W is the weight matrix, and b w is the bias vector. When applying the attention mechanism, a corresponding one is introduced and randomly initialized. The word-level document context feature vector uw is used to complete the weight calculation, α i is the weight corresponding to each word, v is the context weighted feature vector representation of the entire text, and the vector is input into the fully connected layer, which is calculated by the sigmoid function. The probability of occurrence of each disease code.
与现有技术相比,本发明所具有的有益效果为:本发明针对中文自身特点,将单个汉字的语义特征融入词的特征向量表示,并结合关注机制,对输入序列中真正有贡献的特征词进行了加权,提高了疾病编码的预测效果;该方法适用于中文临床文本数据,利用神经网络模型自动提取文本特征,以端到端的方式完成自动标注。Compared with the prior art, the present invention has the following beneficial effects: aiming at the characteristics of Chinese itself, the present invention integrates the semantic features of a single Chinese character into the feature vector representation of words, and combines the attention mechanism to truly contribute to the features in the input sequence. The words are weighted to improve the prediction effect of disease coding; this method is suitable for Chinese clinical text data, using neural network model to automatically extract text features, and complete automatic labeling in an end-to-end manner.
附图说明Description of drawings
图1本发明的流程图;Fig. 1 is the flow chart of the present invention;
图2融合关注机制的层次BiLSTM特征学习模型;Figure 2. Hierarchical BiLSTM feature learning model fused with attention mechanism;
图3关注机制的计算;(a)将hij变成uij;(b)利用上下文特征向量计算每个uij的权重;(c)hij的加权求和得到应用关注机制的特征向量表示;Fig. 3 Calculation of attention mechanism; (a) change h ij into u ij ; (b) calculate the weight of each u ij using the context feature vector; (c) the weighted summation of h ij obtains the feature vector representation of applying the attention mechanism ;
图4为本发明实施实验结果图。FIG. 4 is a graph showing the experimental results of the implementation of the present invention.
具体实施方式Detailed ways
一、临床文本数据的预处理1. Preprocessing of clinical text data
利用中文分词工具“结巴”和用户自定义的医学词库,对输入的出院小结文本进行分词后,去除停用词,统计有效词的词频,基于词频从大到小排序后选择特征词,按以下规则选取:其中Sfw表示特征词集合,表示词wi的频率,Nd表示电子病历总数。Use the Chinese word segmentation tool "Jieba" and a user-defined medical thesaurus to segment the input text of the discharge summary, remove stop words, count the word frequencies of valid words, and select feature words based on the word frequency in descending order, and press Choose from the following rules: where S fw represents the feature word set, represents the frequency of word wi , and N d represents the total number of electronic medical records.
二、特征词的词向量表示Second, the word vector representation of feature words
1)基于字符的词向量表示1) Character-based word vector representation
首先,为每个字符初始化一个向量表示,然后输入融合关注机制的BiLSTM,训练得到每个特征词的字符级词向量表示,BiLSTM中的每个神经单元状态值ct和输出值ht具体计算过程为(t=1,2,...,n,t表示网络中的第t个神经单元或者t时刻的神经单元):First, initialize a vector representation for each character, and then input the BiLSTM fused with the attention mechanism, and train to obtain the character-level word vector representation of each feature word. The state value c t and output value h t of each neural unit in BiLSTM are calculated specifically The process is (t=1,2,...,n, t represents the t-th neural unit in the network or the neural unit at time t):
it=sigmoid(Wi[xt;ht-1]+bi) (1)i t =sigmoid(W i [x t ; h t-1 ]+ bi ) (1)
ft=sigmoid(Wf[xt;ht-1]+bf) (2)f t =sigmoid(W f [x t ; h t-1 ]+b f ) (2)
gt=tanh(Wg[xt;ht-1]+bg) (3)g t =tanh(W g [x t ; h t-1 ]+b g ) (3)
ot=sigmoid(Wo[xt;ht-1]+bo) (4)o t =sigmoid(W o [x t ; h t-1 ]+b o ) (4)
ct=ft*ct-1+it*gt (5)c t =f t *c t-1 +i t *g t (5)
ht=ot*tanh(ct) (6)h t =o t *tanh(c t ) (6)
每个神经单元包含一个输入门i,一个输出门o,一个遗忘门f,一个存储单元g,一个保存状态的单元c和一个隐藏状态h,它们均为向量,Wi,Wf,Wg,Wo为权值矩阵,bi,bf,bg,bo为偏置向量,“;”表示连接运算,“*”表示元素点乘,sigmoid函数的计算为tanh函数的计算为BiLSTM的输出方式为 Each neural unit contains an input gate i, an output gate o, a forget gate f, a storage unit g, a state-saving unit c and a hidden state h, all of which are vectors, W i , W f , W g , W o is the weight matrix, b i , b f , b g , b o are the bias vectors, ";" indicates the connection operation, "*" indicates the element-wise multiplication, and the calculation of the sigmoid function is The tanh function is calculated as The output method of BiLSTM is
2)注意力机制的应用2) Application of attention mechanism
关注机制计算方法为:The attention mechanism calculation method is:
uij=tanh(Wchij+bc) (7)u ij =tanh(W c h ij +b c ) (7)
hij为第i个词的第j个字符在BiLSTM训练后的隐层输出,Wc为权值矩阵,bc为偏置向量,uc为随机初始化字符级的上下文特征向量,αij即为利用softmax函数计算得到的第j个字符对于第i个词的权重大小,即为第i个词的上下文加权特征向量表示。h ij is the hidden layer output of the j-th character of the i-th word after BiLSTM training, W c is the weight matrix, b c is the bias vector, u c is the randomly initialized character-level context feature vector, α ij is is the weight of the j-th character to the i-th word calculated by the softmax function, It is the context-weighted feature vector representation of the ith word.
3)将训练得到的字符级词向量与使用word2vec生成的词向量进行拼接,得到字符级上下文特征加强的词特征向量。3) Splicing the character-level word vector obtained by training with the word vector generated by word2vec to obtain a word-level feature vector enhanced by character-level contextual features.
三、上下文特征提取3. Context Feature Extraction
将字符加强的特征向量序列输入第二层融合关注机制的BiLSTM,提取文本上下文信息特征,BiLSTM神经单元的计算和上下文特征加权的计算,与字符级词向量表示时的相同,具体的计算公式如下:The character-enhanced feature vector sequence is input into the BiLSTM of the second-layer fusion attention mechanism, and the text context information features are extracted. The calculation of the BiLSTM neural unit and the calculation of the context feature weighting are the same as those of the character-level word vector representation. The specific calculation formula is as follows :
ui=tanh(Whi+bw) (10)u i =tanh(Wh i +b w ) (10)
v=∑iαihi (12)v=∑ i α i h i (12)
hi是文本序列第i个词的字符加强特征向量经BiLSTM训练后得到的隐层的输出,W为权值矩阵,bw为偏置向量,在应用关注机制时,相应引入并随机初始化一个词语级别的文档上下文特征向量uw来完成权值的计算,αi为每个词对应的权重,v为整个文本的上下文加权特征向量表示,将该向量输入全连接层,由sigmoid函数计算得到每个疾病编码的出现概率。h i is the output of the hidden layer obtained after the BiLSTM training of the character-enhanced feature vector of the i-th word in the text sequence, W is the weight matrix, and b w is the bias vector. When applying the attention mechanism, a corresponding one is introduced and randomly initialized. The word-level document context feature vector uw is used to complete the weight calculation, α i is the weight corresponding to each word, v is the context weighted feature vector representation of the entire text, and the vector is input into the fully connected layer, which is calculated by the sigmoid function. The probability of occurrence of each disease code.
四、实验验证4. Experimental verification
1)实验过程1) Experimental process
为了验证本方法的有效性,我们在真实的中文电子病历临床数据上进行了实验验证。该数据集包含7732个出院记录,共涉及1177个ICD-10疾病编码标签,ICD-10编码是由字母和数字组成的点分六位编码,以字母开头,前三位编码为一级编码,指明疾病类目。出院小结的平均长度为610个词语,平均每个出院小结对应3.6个疾病编码。In order to verify the effectiveness of this method, we conducted experiments on real Chinese electronic medical record clinical data. The dataset contains 7732 discharge records, involving a total of 1177 ICD-10 disease code labels. The ICD-10 code is a dotted six-digit code consisting of letters and numbers, starting with a letter, and the first three codes are the first-level codes. Specify the disease category. The average length of discharge summaries was 610 words, with an average of 3.6 disease codes per discharge summary.
实验在一台服务器上完成,该服务器包含256GB内存和NVIDIA GeForce Titan XPascal CUDA GPU处理器。我们将数据集按照9:1的比例分为训练集和测试集,并通过十次随机打乱数据进行了验证。评价指标选择了微平均的精确度(P)、召回率(R)和两者综合的指标F1值,以及从样本的角度评价误报情况的Hamming损失值。F1值越高、Hamming损失值越低说明模型性能越好。The experiments were done on a server containing 256GB of memory and an NVIDIA GeForce Titan XPascal CUDA GPU processor. We split the dataset into training and test sets in a 9:1 ratio, and performed validation by randomly shuffling the data ten times. The evaluation indicators selected the precision (P), recall rate (R) of the micro-average and the F1 value of the combination of the two, as well as the Hamming loss value to evaluate the false positive situation from the perspective of the sample. The higher the F1 value and the lower the Hamming loss value, the better the model performance.
2)实验结果2) Experimental results
因相关研究工作已指出了深度学习方法优于传统的机器学习方法,我们主要与其他常见的神经网络模型进行了对比实验,结果如表1所示,MA-BiLSTM表示我们的模型,D2V+CNN为相关研究工作中的方法,该方法在公开的英文数据集MIMIC III上取得目前最好效果。实验结果表明MA-BiLSTM在各项评价指标上均优于其他神经网络模型,说明结合关注机制的BiLSTM能够有效捕获长文本的上下文信息特征,并提高预测效果。Because the related research work has pointed out that the deep learning method is superior to the traditional machine learning method, we mainly conducted comparative experiments with other common neural network models. The results are shown in Table 1. MA-BiLSTM represents our model, D2V+CNN As a method in related research work, this method has achieved the best results on the public English dataset MIMIC III. The experimental results show that MA-BiLSTM is superior to other neural network models in various evaluation indicators, indicating that BiLSTM combined with attention mechanism can effectively capture the contextual information features of long texts and improve the prediction effect.
表1对比实验结果Table 1 Comparative experimental results
为分析模型各个模块的发挥的作用,我们设计了消融实验进行分析,结果如表2所示。从实验结果看,仅有词向量或字符向量表示文本中词语的特征,预测结果都发生了下降,因此,字符向量加强的词向量表示确实带来了更好的文本特征表示。关注机制在模型中起到了重要作用,去掉了关注机制,模型的性能下降明显。In order to analyze the role of each module of the model, we designed an ablation experiment for analysis, and the results are shown in Table 2. From the experimental results, only word vectors or character vectors represent the features of words in the text, and the prediction results have declined. Therefore, the word vector representation enhanced by character vectors does bring better text feature representation. The attention mechanism plays an important role in the model. If the attention mechanism is removed, the performance of the model decreases significantly.
在ICD-10全编码和一级编码上均进行了预测,7732个样本,对应一级编码为488个。实验结果如图4所示。一级编码上的预测结果在精确度上达到了80.5%,能较好的辅助病案室医务人员的疾病编码标注工作。Predicted on both ICD-10 full coding and primary coding, 7732 samples, corresponding to 488 primary coding. The experimental results are shown in Figure 4. The accuracy of the prediction results on the first-level coding has reached 80.5%, which can better assist the medical staff in the medical record room in the work of disease coding and labeling.
表2模型消融实验结果Table 2 Model ablation experimental results
Claims (4)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811523661.7A CN109697285B (en) | 2018-12-13 | 2018-12-13 | Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811523661.7A CN109697285B (en) | 2018-12-13 | 2018-12-13 | Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109697285A CN109697285A (en) | 2019-04-30 |
| CN109697285B true CN109697285B (en) | 2022-06-21 |
Family
ID=66231615
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201811523661.7A Active CN109697285B (en) | 2018-12-13 | 2018-12-13 | Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109697285B (en) |
Families Citing this family (30)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110427610A (en) * | 2019-06-25 | 2019-11-08 | 平安科技(深圳)有限公司 | Text analyzing method, apparatus, computer installation and computer storage medium |
| CN110491499A (en) * | 2019-07-10 | 2019-11-22 | 厦门大学 | Clinical aid decision-making method and system towards mark electronic health record |
| CN110491465B (en) * | 2019-08-20 | 2020-09-15 | 山东众阳健康科技集团有限公司 | Disease classification coding method, system, device and medium based on deep learning |
| CN110633470A (en) * | 2019-09-17 | 2019-12-31 | 北京小米智能科技有限公司 | Named entity recognition method, device and storage medium |
| CN112632269A (en) * | 2019-09-24 | 2021-04-09 | 北京国双科技有限公司 | Method and related device for training document classification model |
| CN110837494B (en) * | 2019-10-12 | 2022-03-25 | 云知声智能科技股份有限公司 | Method and device for identifying unspecified diagnosis coding errors of medical record home page |
| CN110781407B (en) * | 2019-10-21 | 2024-07-23 | 腾讯科技(深圳)有限公司 | User tag generation method, device and computer readable storage medium |
| CN110866401A (en) * | 2019-11-18 | 2020-03-06 | 山东健康医疗大数据有限公司 | Chinese electronic medical record named entity identification method and system based on attention mechanism |
| CN110867231A (en) * | 2019-11-18 | 2020-03-06 | 中山大学 | Disease prediction method, device, computer equipment and medium based on text classification |
| CN110895580B (en) * | 2019-12-12 | 2020-07-07 | 山东众阳健康科技集团有限公司 | ICD operation and operation code automatic matching method based on deep learning |
| CN113012774B (en) * | 2019-12-18 | 2024-08-30 | 医渡云(北京)技术有限公司 | Automatic medical record coding method and device, electronic equipment and storage medium |
| CN111429204A (en) * | 2020-03-10 | 2020-07-17 | 携程计算机技术(上海)有限公司 | Hotel recommendation method, system, electronic equipment and storage medium |
| CN111914539B (en) * | 2020-07-31 | 2024-09-10 | 长江航道测量中心 | Channel notification information extraction method and system based on BiLSTM-CRF model |
| CN112183104B (en) * | 2020-08-26 | 2024-06-14 | 望海康信(北京)科技股份公司 | Code recommendation method, system, corresponding equipment and storage medium |
| CN112052646B (en) * | 2020-08-27 | 2024-03-29 | 安徽聚戎科技信息咨询有限公司 | Text data labeling method |
| CN112185564B (en) * | 2020-10-20 | 2022-09-06 | 福州数据技术研究院有限公司 | An ophthalmic disease prediction method and storage device based on structured electronic medical records |
| CN112380863A (en) * | 2020-10-29 | 2021-02-19 | 国网天津市电力公司 | Sequence labeling method based on multi-head self-attention mechanism |
| CN112259260B (en) * | 2020-11-18 | 2023-11-17 | 中国科学院自动化研究所 | Intelligent medical question-answering method, system and device based on intelligent wearable equipment |
| CN112732915A (en) * | 2020-12-31 | 2021-04-30 | 平安科技(深圳)有限公司 | Emotion classification method and device, electronic equipment and storage medium |
| CN112632911B (en) * | 2021-01-04 | 2022-05-13 | 福州大学 | Chinese Character Encoding Method Based on Character Embedding |
| CN113593709B (en) * | 2021-07-30 | 2022-09-30 | 江先汉 | Disease coding method, system, readable storage medium and device |
| CN113901805B (en) * | 2021-10-15 | 2025-01-28 | 长三角信息智能创新研究院 | Automatic ICD9 code assignment method for medical record text based on label attributes and feature enhancement |
| CN114049926A (en) * | 2021-10-27 | 2022-02-15 | 徐州医科大学 | A text classification method for electronic medical records |
| CN114417836B (en) * | 2022-01-18 | 2025-02-11 | 北京工业大学 | A semantic segmentation method for Chinese electronic medical record text based on deep learning |
| CN115834242A (en) * | 2022-12-28 | 2023-03-21 | 深信服科技股份有限公司 | Network traffic feature extraction method, device, device, and storage medium |
| CN116467440B (en) * | 2023-03-30 | 2025-06-27 | 浙江大学 | Multi-level semantic text classification method based on Litsea artificial liver medical record |
| CN116955628B (en) * | 2023-08-08 | 2024-05-03 | 武汉市万睿数字运营有限公司 | Complaint event classification method, complaint event classification device, computer equipment and storage medium |
| CN116884630B (en) * | 2023-09-06 | 2024-08-23 | 深圳达实旗云健康科技有限公司 | Method for improving disease automatic coding efficiency |
| CN117438024B (en) * | 2023-12-15 | 2024-03-08 | 吉林大学 | Intelligent acquisition and analysis system and method for acute diagnosis patient sign data |
| CN118518984B (en) * | 2024-07-24 | 2024-09-27 | 新疆西部明珠工程建设有限公司 | Intelligent fault positioning system and method for power transmission and distribution line |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080288292A1 (en) * | 2007-05-15 | 2008-11-20 | Siemens Medical Solutions Usa, Inc. | System and Method for Large Scale Code Classification for Medical Patient Records |
| WO2015084615A1 (en) * | 2013-12-03 | 2015-06-11 | 3M Innovative Properties Company | Constraint-based medical coding |
| US10509889B2 (en) * | 2014-11-06 | 2019-12-17 | ezDI, Inc. | Data processing system and method for computer-assisted coding of natural language medical text |
| EP3273373A1 (en) * | 2016-07-18 | 2018-01-24 | Fresenius Medical Care Deutschland GmbH | Drug dosing recommendation |
| CN106484674B (en) * | 2016-09-20 | 2020-09-25 | 北京工业大学 | Chinese electronic medical record concept extraction method based on deep learning |
| CN106844308B (en) * | 2017-01-20 | 2020-04-03 | 天津艾登科技有限公司 | Method for automatic disease code conversion using semantic recognition |
| CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
| CN107731269B (en) * | 2017-10-25 | 2020-06-26 | 山东众阳软件有限公司 | Disease coding method and system based on original diagnosis data and medical record file data |
| CN107977361B (en) * | 2017-12-06 | 2021-05-18 | 哈尔滨工业大学深圳研究生院 | Chinese clinical medical entity identification method based on deep semantic information representation |
| CN108460013B (en) * | 2018-01-30 | 2021-08-20 | 大连理工大学 | A sequence tagging model and method based on a fine-grained word representation model |
| CN108536754A (en) * | 2018-03-14 | 2018-09-14 | 四川大学 | Electronic health record entity relation extraction method based on BLSTM and attention mechanism |
| CN108628823B (en) * | 2018-03-14 | 2022-07-01 | 中山大学 | A Named Entity Recognition Method Combining Attention Mechanism and Multi-task Co-training |
| CN108628824A (en) * | 2018-04-08 | 2018-10-09 | 上海熙业信息科技有限公司 | A kind of entity recognition method based on Chinese electronic health record |
-
2018
- 2018-12-13 CN CN201811523661.7A patent/CN109697285B/en active Active
Also Published As
| Publication number | Publication date |
|---|---|
| CN109697285A (en) | 2019-04-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109697285B (en) | Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation | |
| Yu et al. | Automatic ICD code assignment of Chinese clinical notes based on multilayer attention BiRNN | |
| CN111274806B (en) | Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record | |
| CN112001177B (en) | Electronic medical record named entity recognition method and system integrating deep learning and rules | |
| CN110162779B (en) | Medical record quality evaluation method, device and equipment | |
| CN106874643B (en) | Method and system for automatically constructing knowledge base based on word vector to realize auxiliary diagnosis and treatment | |
| Cornegruta et al. | Modelling radiological language with bidirectional long short-term memory networks | |
| CN110069779B (en) | Symptom entity identification method of medical text and related device | |
| CN110532398B (en) | Automatic family map construction method based on multi-task joint neural network model | |
| CN112151183A (en) | An entity recognition method for Chinese electronic medical records based on Lattice LSTM model | |
| CN110442840B (en) | Sequence labeling network updating method, electronic medical record processing method and related device | |
| CN106844351B (en) | A multi-data source-oriented medical institution organization entity identification method and device | |
| CN109003677B (en) | Medical record data structured analysis and processing method | |
| CN109993227A (en) | Method, system, device and medium for automatically adding International Classification of Diseases codes | |
| CN112017744A (en) | Electronic case automatic generation method, device, equipment and storage medium | |
| CN110427486A (en) | Classification method, device and the equipment of body patient's condition text | |
| CN112800244B (en) | Method for constructing knowledge graph of traditional Chinese medicine and national medicine | |
| CN117787282B (en) | Doctor-patient text intelligent extraction method based on large language model | |
| CN113704415B (en) | Vector representation generation method and device for medical text | |
| CN116842168B (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
| Hsu et al. | Multi-label classification of ICD coding using deep learning | |
| CN112349367B (en) | Method, device, electronic equipment and storage medium for generating simulated medical record | |
| CN117454217A (en) | A method, device and system for identifying depressive emotions based on deep integrated learning | |
| CN110060749B (en) | Intelligent diagnosis method of electronic medical record based on SEV-SDG-CNN | |
| Grissette | Semisupervised neural biomedical sense disambiguation approach for aspect-based sentiment analysis on social networks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |