{"id":762,"date":"2021-10-27T15:44:07","date_gmt":"2021-10-27T06:44:07","guid":{"rendered":"https:\/\/aidalab.cafe24.com\/?page_id=762"},"modified":"2026-02-26T19:24:41","modified_gmt":"2026-02-26T10:24:41","slug":"deep-learning-nlp","status":"publish","type":"page","link":"https:\/\/aida.korea.ac.kr\/?page_id=762","title":{"rendered":""},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Deep Learning &#8211; Natural Language Processing<\/h1>\n\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong><strong>\ufeffPredicting the Difficulty of Korean CSAT Reading Comprehension Questions Using LLM-based Qualitative Features<\/strong><\/strong><\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<p>\ufeffThe Korean College Scholastic Ability Test (KCSAT) is a high-stakes national standardized examination that significantly influences university admissions. In the reading section, item difficulty directly affects score distribution, fairness, and discrimination power. However, current difficulty control mechanisms rely primarily on post-exam statistical analysis and expert judgment. As a result, it is challenging to quantitatively estimate the answer rate of an item during the item development stage.<\/p>\n<p>\ufeffThis project aims to develop an AI-based prediction system that estimates the answer rate and difficulty level of KCSAT reading comprehension items prior to exam administration. By enabling pre-exam quantitative prediction, the proposed system supports more objective and data-driven difficulty control.<\/p>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<p>The dataset consists of KCSAT reading comprehension items collected from past national exams, mock exams, and academic achievement tests. Each item includes passage text, question text, answer choices, optional image descriptions, and exam-type information.<\/p>\n\n<p>Two prediction targets are defined:<br>\n(1) continuous answer rate (regression task) <br>\n(2) three-level difficulty class derived from answer rate (classification task)<\/p>\n\n<p>Beyond raw textual input, we construct two complementary structured feature sets: quantitative item features and LLM-derived features generated via prompt engineering.<\/p>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p>\ufeffOne of the early data-driven approaches to question difficulty prediction (QDP) for reading comprehension in standardized tests was proposed by Huang et al. (2017). They introduced a Test-aware Attention-based Convolutional Neural Network (TACNN) framework to estimate question difficulty prior to exam administration. Their model represented documents, questions, and answer options using sentence-level CNN encoders and applied an attention mechanism to identify difficulty-relevant textual components. To address the incomparability of difficulty values across different test administrations, they further proposed a test-dependent pairwise training strategy that optimized relative difficulty differences within the same test. Experimental results demonstrated that incorporating attention mechanisms and test-aware learning improved predictive performance compared to conventional CNN-based baselines. This work established a foundational deep learning framework for question difficulty prediction in reading problems of standardized assessments.<\/p>\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2026\/02\/figure1.png\" alt=\"\" class=\"wp-image-1727\"\/>\n    <figcaption class=\"wp-element-caption\">\n        \ufeff[Huang, Zhenya, et al. &#8220;Question Dif\ufb01culty Prediction for READING Problems in Standard Tests.&#8221;\u00a0Proceedings of the AAAI conference on artificial intelligence. Vol. 31. No. 1. 2017.]\n    <\/figcaption>\n<\/figure>\n\n<p><strong>\ufeffProposed method<\/strong><\/p>\n\n\n\n<p>We propose a multi-level AI framework that systematically compares different text representation strategies and prediction models. First, multiple text representation approaches are evaluated, including TF-IDF, Word2Vec with TF-IDF-weighted document embeddings, KoBigBird-based embeddings, and end-to-end fine-tuned KoBigBird models. Second, structured meta-features are integrated through late concatenation with text representations. This enables the model to jointly consider linguistic characteristics and LLM-derived reasoning signals. Third, various machine learning algorithms are applied for both regression and classification tasks, including Ridge regression, Support Vector Machines, Random Forest, XGBoost, and LightGBM. To reflect real-world deployment scenarios, a time-based split was adopted by assigning the most recent year as the test set. In addition, a stratified split was performed to preserve the difficulty distribution across train, validation, and test sets for comparison.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2026\/02\/figure2.png\" alt=\"\" class=\"wp-image-1727\"\/>\n<\/figure>\n\n<p><strong>\ufeffContribution<\/strong><\/p>\n\n\n\n<p>This study makes three primary contributions.First, we formalize and quantify LLM-based qualitative characteristics\u2014such as reasoning level, cognitive load, and answer design complexity\u2014and integrate them as structured predictive features for difficulty estimation.\nSecond, we establish a unified evaluation framework that systematically compares diverse model configurations combining text representations and structured item-level features.\nThird, we empirically demonstrate that the impact of LLM-derived features varies depending on the prediction objective, providing insight into how LLM-based signals function in high-stakes national language assessment settings.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong><strong>OPSA : Order Preserving token Shuffling Augmentation with wargame simulation dataset<\/strong><\/strong><\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<p>Table-to-text generation is the task of producing textual descriptions from structured tabular data while addressing challenges such as limited domain-specific datasets and the risk of generating hallucinated content. Existing methods do not simultaneously mitigate data scarcity and ensure accuracy without relying on extensive additional neural network training. In this study, we propose the Order Preserving Token Shuffling Augmentation (OPSA), a novel data augmentation methodology that shuffles tokens within a table while preserving their overall order, thereby improving the quality, diversity, and contextual relevance of generated text. By reducing the likelihood of hallucination and enhancing domain-specific outcomes\u2014particularly in military simulations\u2014OPSA provides a cost-effective solution for more effective information dissemination in specialized domains. <\/p>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<p>This study utilized battle scenario data generated through military operation simulations using the Changjo21, Changgong, and Cheonghae models from South Korea, with actual names and location information modified for security reasons.<\/p>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p><strong>Structure-aware seq2seq[1]:<\/strong> Structure-aware seq2seq is a framework designed for data-to-text generation that incorporates structured information from the input data, allowing the model to better understand and represent the hierarchical relationships in the data, thus enhancing the coherence and relevance of the generated text.<\/p>\n\n\n\n<p><strong>Few-shot NLG[2]:<\/strong> Structure-aware seq2seq is a framework designed for data-to-text generation that incorporates structured information from the input data, allowing the model to better understand and represent the hierarchical relationships in the data, thus enhancing the coherence and relevance of the generated text.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2025\/02\/OPSA.png\" alt=\"\" class=\"wp-image-1727\"\/>\n    <figcaption class=\"wp-element-caption\">\n        [1]Liu, Tianyu, et al. &#8220;Table-to-text generation by structure-aware seq2seq learning.&#8221;\u00a0Proceedings of the AAAI conference on artificial intelligence. Vol. 32. No. 1. 2018.<br>\n        [2]Chen, Zhiyu, et al. &#8220;Few-shot NLG with pre-trained language model.&#8221;\u00a0arXiv preprint arXiv:1904.09521\u00a0(2019).\n<br>\n    <\/figcaption>\n<\/figure>\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<p>Order Preserving Token Shuffling Augmentation (OPSA) is designed to enhance table-to-text generation by augmenting training data without requiring additional neural network training. OPSA comprises two main components: the Order-Enhancing Table Transformation (OET) module and the patch-level token shuffling module. The OET module converts structured table data into a sequential format that integrates positional information, enabling more effective modeling of tabular structure. Meanwhile, the patch-level token shuffling module rearranges tokens while preserving their original order, thus maintaining the semantic integrity of the content. By retaining the sequence of patches corresponding to the same field, OPSA effectively augments the dataset and improves model performance when generating natural language descriptions from tabular data.<\/p>\n\n\n<figure class=\"wp-block-image aligncenter size-full\">\n    <img decoding=\"async\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2025\/02\/OPSA5.png\" alt=\"\" class=\"wp-image-1727\"\/>\n<\/figure>\n\n\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports<\/strong><\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<p>Spelling correction; natural language processing : In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams\u2013based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place.<\/p>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-1 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<p>In this study, the bacterial culture and antimicrobial susceptibility reports from Korea University Anam Hospital, Korea University Guro Hospital, and Korea University Ansan Hospital were used. The bacterial culture and antimicrobial susceptibility report data were collected for 17 years (from 2002 to 2018), and in each year, reports for 1 month were used for the experiment. In total, 180,000 items were retrieved, with 27,544 having meaningful test results. Using the self-developed rule-based ETL algorithm, unstructured bacterial culture and antimicrobial susceptibility reports were converted into structured text data. After preprocessing through lexical processing, such as sentence segmentation, tokenization, and stemming using regular expressions, there were 320 types of bacterial identification words in the report. Among the extracted bacterial identification words, 16 types of spelling errors and 914 misspelled words were found.<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"323\" height=\"985\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-21.png\" alt=\"\" class=\"wp-image-1657\" style=\"width:156px;height:476px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-21.png 323w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-21-98x300.png 98w\" sizes=\"auto, (max-width: 323px) 100vw, 323px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p>BioWordVec learns clinical record data from PubMed and MIMIC-III clinical databases using fastText. Based on 28,714,373 PubMed documents and 2,083,180 MIMIC-III clinical database documents, the entire corpus was built. The Medical Subject Headings (MeSH) term graph was organized to create a heading sequence and to carry out word embedding based on a sequence combining MeSH and PubMed. BioWordVec provided a 200-dimensional pretrained word embedding matrix.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"228\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-23-1024x228.png\" alt=\"\" class=\"wp-image-1659\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-23-1024x228.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-23-300x67.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-23-768x171.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-23.png 1300w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<p>For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"675\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-24-1024x675.png\" alt=\"\" class=\"wp-image-1660\" style=\"width:631px;height:416px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-24-1024x675.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-24-300x198.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-24-768x507.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-24.png 1034w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">AI consultation chatbot<\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<p>Development of a consultation AI chatbot for providing telemedicine counseling solutions<\/p>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<p>EMR dataset was used for training the chatbot model.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"413\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-18-1024x413.png\" alt=\"\" class=\"wp-image-1652\" style=\"width:950px;height:383px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-18-1024x413.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-18-300x121.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-18-768x310.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-18-1536x619.png 1536w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-18.png 1722w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p>GPT-2 is a pretrained language model and optimized for sentence generation so that the next word in a given text can be well predicted. GPT-2 is a transformer decoder language model that has been learned with more than 40GB of text to overcome insufficient Korean performance.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"471\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-19-1024x471.png\" alt=\"\" class=\"wp-image-1653\" style=\"width:950px;height:436px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-19-1024x471.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-19-300x138.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-19-768x354.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-19-1536x707.png 1536w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-19.png 1616w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<p>A Sequence consists of a list of questions and answers and a diagnostic name. It was used for fine-tuning GPT and predicting a diagnostic name. Therefore, it is possible to generate appropriate questions about the patient&#8217;s answers and finally predict the diagnostic name.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"765\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-20-1024x765.png\" alt=\"\" class=\"wp-image-1654\" style=\"width:788px;height:588px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-20-1024x765.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-20-300x224.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-20-768x574.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-20.png 1115w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Research on Artificial Intelligence Writing Technology Using Natural Language Processing<\/strong>                                                                                               <\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<p>Development of&nbsp; Novel Writing Platform based on &#8220;user input&#8221; through learning of various novel genres.<\/p>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<p>KoGPT2 fine tuning is performed using novel text data. In the case of Semantic Role Labeling, we use ETRI Semantic Role Labeling Corpus for training SRL model.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"349\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-15-1024x349.png\" alt=\"\" class=\"wp-image-1647\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-15-1024x349.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-15-300x102.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-15-768x262.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-15-1536x524.png 1536w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-15.png 1686w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p>KoGPT2 is a pretrained language model and optimized for sentence generation so that the next word in a given text can be well predicted. KoGPT2 is a transformer decoder language model that has been learned with more than 40GB of text to overcome insufficient Korean performance.<br><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"502\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-16-1024x502.png\" alt=\"\" class=\"wp-image-1648\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-16-1024x502.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-16-300x147.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-16-768x376.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-16.png 1382w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<p>It is a structure that combines Generate Layer for novel generation and SRL Layer for reflecting user input. When a sentence is entered, the following sentence is generated through KoGPT2 and the generated sentence is corrected through SRL layer.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"402\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-17-1024x402.png\" alt=\"\" class=\"wp-image-1649\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-17-1024x402.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-17-300x118.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-17-768x302.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-17.png 1440w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Comparison of Data2Text models using Sequence-to-Sequence<\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p>1. Difficult to compare the performance of different Sequence-to-Sequence models.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The data consists of a set of values and fields, and there are significant performance differences depending on how the model learns these structures. <\/li>\n\n\n\n<li>Therefore, we compare two representative models utilizing sequence-to-sequence under the same conditions to find a more effective methodology for learning the structure of the data.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow\">\n<p>2. There is a Problem that words that are not in the vocabulary, such as proper nouns, cannot be printed.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Previous studies have proposed replacing \u201cunk\u201d token with attention distribution as a solution to the OOV(out-of-vocabulary) problem.<\/li>\n\n\n\n<li>However, this is a method that only applies if print out \u201cunk\u201d token.<\/li>\n<\/ul>\n<\/div><\/div>\n<\/div><\/div>\n<\/div><\/div>\n\n\n\n<p><strong>Data<\/strong><\/p>\n\n\n\n<p>Wikibio Dataset : Wikibio consists of biographies of people recorded on Wikipedia. A given table is input data and the first sentence of the\u000bdescription is label. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"399\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-1024x399.png\" alt=\"\" class=\"wp-image-1629\" style=\"width:950px;height:370px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-1024x399.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-300x117.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-768x300.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-1536x599.png 1536w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-2048x799.png 2048w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-2000x780.png 2000w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/\uc7ac\ud65c\uc6a9\uc0ac\uc9c4-1-1800x702.png 1800w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Lebret, R., Grangier, D., &amp; Auli, M. (2016). Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771.<\/figcaption><\/figure>\n\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<p>1. Table-to-text Generation by Structure-aware Seq2seq Learning<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"983\" height=\"459\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-11.png\" alt=\"\" class=\"wp-image-1638\" style=\"width:756px;height:352px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-11.png 983w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-11-300x140.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-11-768x359.png 768w\" sizes=\"auto, (max-width: 983px) 100vw, 983px\" \/><figcaption class=\"wp-element-caption\">Liu, T., Wang, K., Sha, L., Chang, B., &amp; Sui, Z. (2018, April). Table-to-text generation by structure-aware seq2seq learning. In&nbsp;Thirty-Second AAAI Conference on Artificial Intelligence.<\/figcaption><\/figure><\/div>\n\n\n<p>2. Order-Planning Neural Text Generation From Structured Data(Sha et al., 2018)<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"685\" height=\"676\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-12.png\" alt=\"\" class=\"wp-image-1640\" style=\"width:433px;height:428px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-12.png 685w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-12-300x296.png 300w\" sizes=\"auto, (max-width: 685px) 100vw, 685px\" \/><figcaption class=\"wp-element-caption\">Sha, L., Mou, L., Liu, T., Poupart, P., Li, S., Chang, B., &amp; Sui, Z. (2018, April). Order-planning neural text generation from structured data. In&nbsp;Thirty-Second AAAI Conference on Artificial Intelligence.<\/figcaption><\/figure><\/div>\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<p>We compare two representative models utilizing sequence-to-sequence under the same conditions to find a more effective methodology for learning the structure of the data. In addition, it adds a copy mechanism to improve performance by allowing the output of words that are not in the word vocabulary, such as proper nouns.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-2 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"663\" height=\"495\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-13.png\" alt=\"\" class=\"wp-image-1643\" style=\"width:436px;height:325px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-13.png 663w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-13-300x224.png 300w\" sizes=\"auto, (max-width: 663px) 100vw, 663px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"918\" height=\"516\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-14.png\" alt=\"\" class=\"wp-image-1644\" style=\"width:492px;height:277px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-14.png 918w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-14-300x169.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2022\/05\/image-14-768x432.png 768w\" sizes=\"auto, (max-width: 918px) 100vw, 918px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n\n\n\n\n<h2 class=\"wp-block-heading\">Entity-enhanced BERT for Medical Specialty Prediction Based on Clinical Questionnaire Data<\/h2>\n\n\n\n<p><strong>Objective<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Medical text data include large amounts of information regarding patients, which increases the sequence length. Hence, a few studies have attempted to extract entities from the text as concise features and provide domain-specific knowledge for clinical text classification.<\/li>\n\n\n\n<li>However, It is still insufficient to inject entity information into the model effectively.<\/li>\n\n\n\n<li>We propose&nbsp; Entity-enhanced BERT (E-BERT), a single medical specialty prediction model by adding two modules that integrate entity information within BERT, to processes medical text.<\/li>\n<\/ul>\n\n\n\n<p><br><strong>Data<\/strong><\/p>\n\n\n\n<p>This is clinical questionnaire data containing information such as the patient&#8217;s symptoms, location of pain, disease, and lifestyle habits in a question and answer format.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"871\" height=\"334\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_data-1.png\" alt=\"\" class=\"wp-image-2019\" style=\"width:500px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_data-1.png 871w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_data-1-300x115.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_data-1-768x295.png 768w\" sizes=\"auto, (max-width: 871px) 100vw, 871px\" \/><\/figure><\/div>\n\n\n<p><strong>Related Work<\/strong><\/p>\n\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<p>1. Clinical text data categorization and feature extraction using medical-fissure algorithm and neg-seq algorithm<\/p>\n<\/div><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>One of the pipelines for disease prediction uses only entities extracted from medical records.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They cannot reflect relationships between entities and the entire text. Furthermore, there is a risk of losing meaningful information that can be obtained from other sentences.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"862\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related1-1024x862.png\" alt=\"\" class=\"wp-image-2016\" style=\"width:500px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related1-1024x862.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related1-300x253.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related1-768x646.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related1.png 1182w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">[ NLP pipeline for Outcome Prediction ]<\/figcaption><\/figure><\/div>\n\n\n<div class=\"wp-block-group\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<p>2. Clinical text data categorization and feature extraction using medical-fissure algorithm and neg-seq algorithm<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Extending BERT for multi-type text classification and incorporating object-based medical knowledge graph.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It is an independent framework for text and entity. Therefore, they cannot directly reflect the relationship between text and entities, and the model complexity is also high.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"707\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related2-1024x707.png\" alt=\"\" class=\"wp-image-2017\" style=\"width:500px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related2-1024x707.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related2-300x207.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related2-768x530.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_related2.png 1308w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">[ KG-MTT-BERT ]<\/figcaption><\/figure><\/div>\n\n\n<p><strong>Proposed Method<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In this study, we propose Entity-enhanced BERT (E-BERT), which utilizes the structural attributes of BERT for medical specialty prediction.<\/li>\n\n\n\n<li>E-BERT has an entity embedding layer and entity-aware attention to inject domain-specific knowledge and focus on relationships between medical-related entities within the sequences.<\/li>\n\n\n\n<li>E-BERT effectively incorporate domain-specific knowledge and other information, enabling the capture of contextual information in the text.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"561\" src=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-1024x561.png\" alt=\"\" class=\"wp-image-2018\" style=\"width:900px\" srcset=\"https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-1024x561.png 1024w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-300x164.png 300w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-768x421.png 768w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-1536x841.png 1536w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-2048x1122.png 2048w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-2000x1095.png 2000w, https:\/\/aida.korea.ac.kr\/wp-content\/uploads\/2023\/12\/E-bert_method-1-1800x986.png 1800w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Deep Learning &#8211; Natural Language Processing \ufeffPredicting the Difficulty of Korean CSAT Reading Comprehension Questions Using LLM-based Qualitative Features Objective \ufeffThe Korean College Scholastic Ability Test (KCSAT) is a high-stakes national standardized examination that significantly influences university admissions. In the reading section, item difficulty directly affects score distribution, fairness, and discrimination power. However, current difficulty &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/aida.korea.ac.kr\/?page_id=762\" class=\"more-link\">Read more<span class=\"screen-reader-text\"> &#8220;&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-762","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages\/762","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=762"}],"version-history":[{"count":89,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages\/762\/revisions"}],"predecessor-version":[{"id":2553,"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=\/wp\/v2\/pages\/762\/revisions\/2553"}],"wp:attachment":[{"href":"https:\/\/aida.korea.ac.kr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=762"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}