Deep Learning – Natural Language Processing


Entity-enhanced BERT for Medical Specialty Prediction Based on Clinical Questionnaire Data

Objective

  • Medical text data include large amounts of information regarding patients, which increases the sequence length. Hence, a few studies have attempted to extract entities from the text as concise features and provide domain-specific knowledge for clinical text classification.
  • However, It is still insufficient to inject entity information into the model effectively.
  • We propose  Entity-enhanced BERT (E-BERT), a single medical specialty prediction model by adding two modules that integrate entity information within BERT, to processes medical text.


Data

This is clinical questionnaire data containing information such as the patient’s symptoms, location of pain, disease, and lifestyle habits in a question and answer format.

Related Work

1. Clinical text data categorization and feature extraction using medical-fissure algorithm and neg-seq algorithm

  • One of the pipelines for disease prediction uses only entities extracted from medical records.
  • They cannot reflect relationships between entities and the entire text. Furthermore, there is a risk of losing meaningful information that can be obtained from other sentences.
[ NLP pipeline for Outcome Prediction ]

2. Clinical text data categorization and feature extraction using medical-fissure algorithm and neg-seq algorithm

  • Extending BERT for multi-type text classification and incorporating object-based medical knowledge graph.
  • It is an independent framework for text and entity. Therefore, they cannot directly reflect the relationship between text and entities, and the model complexity is also high.
[ KG-MTT-BERT ]

Proposed Method

  • In this study, we propose Entity-enhanced BERT (E-BERT), which utilizes the structural attributes of BERT for medical specialty prediction.
  • E-BERT has an entity embedding layer and entity-aware attention to inject domain-specific knowledge and focus on relationships between medical-related entities within the sequences.
  • E-BERT effectively incorporate domain-specific knowledge and other information, enabling the capture of contextual information in the text.

Comparison of Data2Text models using Sequence-to-Sequence

Objective

1. Difficult to compare the performance of different Sequence-to-Sequence models.

  • The data consists of a set of values and fields, and there are significant performance differences depending on how the model learns these structures.
  • Therefore, we compare two representative models utilizing sequence-to-sequence under the same conditions to find a more effective methodology for learning the structure of the data.

2. There is a Problem that words that are not in the vocabulary, such as proper nouns, cannot be printed.

  • Previous studies have proposed replacing “unk” token with attention distribution as a solution to the OOV(out-of-vocabulary) problem.
  • However, this is a method that only applies if print out “unk” token.

Data

Wikibio Dataset : Wikibio consists of biographies of people recorded on Wikipedia. A given table is input data and the first sentence of the description is label.

Lebret, R., Grangier, D., & Auli, M. (2016). Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771.

Related Work

1. Table-to-text Generation by Structure-aware Seq2seq Learning

Liu, T., Wang, K., Sha, L., Chang, B., & Sui, Z. (2018, April). Table-to-text generation by structure-aware seq2seq learning. In Thirty-Second AAAI Conference on Artificial Intelligence.

2. Order-Planning Neural Text Generation From Structured Data(Sha et al., 2018)

Sha, L., Mou, L., Liu, T., Poupart, P., Li, S., Chang, B., & Sui, Z. (2018, April). Order-planning neural text generation from structured data. In Thirty-Second AAAI Conference on Artificial Intelligence.

Proposed Method

We compare two representative models utilizing sequence-to-sequence under the same conditions to find a more effective methodology for learning the structure of the data. In addition, it adds a copy mechanism to improve performance by allowing the output of words that are not in the word vocabulary, such as proper nouns.


Research on Artificial Intelligence Writing Technology Using Natural Language Processing

Objective

Development of  Novel Writing Platform based on “user input” through learning of various novel genres.

Data

KoGPT2 fine tuning is performed using novel text data. In the case of Semantic Role Labeling, we use ETRI Semantic Role Labeling Corpus for training SRL model.

Related Work

KoGPT2 is a pretrained language model and optimized for sentence generation so that the next word in a given text can be well predicted. KoGPT2 is a transformer decoder language model that has been learned with more than 40GB of text to overcome insufficient Korean performance.

Proposed Method

It is a structure that combines Generate Layer for novel generation and SRL Layer for reflecting user input. When a sentence is entered, the following sentence is generated through KoGPT2 and the generated sentence is corrected through SRL layer.


AI consultation chatbot

Objective

Development of a consultation AI chatbot for providing telemedicine counseling solutions

Data

EMR dataset was used for training the chatbot model.

Related Work

GPT-2 is a pretrained language model and optimized for sentence generation so that the next word in a given text can be well predicted. GPT-2 is a transformer decoder language model that has been learned with more than 40GB of text to overcome insufficient Korean performance.

Proposed Method

A Sequence consists of a list of questions and answers and a diagnostic name. It was used for fine-tuning GPT and predicting a diagnostic name. Therefore, it is possible to generate appropriate questions about the patient’s answers and finally predict the diagnostic name.


Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports

Objective

Spelling correction; natural language processing : In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams–based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place.

Data

In this study, the bacterial culture and antimicrobial susceptibility reports from Korea University Anam Hospital, Korea University Guro Hospital, and Korea University Ansan Hospital were used. The bacterial culture and antimicrobial susceptibility report data were collected for 17 years (from 2002 to 2018), and in each year, reports for 1 month were used for the experiment. In total, 180,000 items were retrieved, with 27,544 having meaningful test results. Using the self-developed rule-based ETL algorithm, unstructured bacterial culture and antimicrobial susceptibility reports were converted into structured text data. After preprocessing through lexical processing, such as sentence segmentation, tokenization, and stemming using regular expressions, there were 320 types of bacterial identification words in the report. Among the extracted bacterial identification words, 16 types of spelling errors and 914 misspelled words were found.

Related Work

BioWordVec learns clinical record data from PubMed and MIMIC-III clinical databases using fastText. Based on 28,714,373 PubMed documents and 2,083,180 MIMIC-III clinical database documents, the entire corpus was built. The Medical Subject Headings (MeSH) term graph was organized to create a heading sequence and to carry out word embedding based on a sequence combining MeSH and PubMed. BioWordVec provided a 200-dimensional pretrained word embedding matrix.

Proposed Method

For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking.