Deep Learning – Multi modal


OHiFormer: Object-Wise Hierarchical Dependency-Based Transformer for Screen Summarization

Objective

Screen summarization aims to generate concise textual descriptions that communicate the crucial contents and functionalities of a mobile user interface (UI) screen. Previous works have focused on encoding the absolute position of objects at the view hierarchy to extract the semantic representation of the UI screen. However, the importance of the hierarchical dependency between objects in the UI structure was overlooked. In this study, we propose an object-wise hierarchical dependency-based Transformer, which involves a modified self-attention mechanism using structural relative position encoding to represent the hierarchically connected UI.

Data

We used the Screen Summarization dataset [1] to evaluate the performance of our proposed model. The dataset contains 22,417 screens from 6K mobile apps with five human-annotated summary labels for each screen.

[1] B. Wang et al., ‘‘Screen2Words: Automatic mobile UI summarization with multimodal learning’’, 2021

Related Work

User Interface (UI) is a hierarchical collection of a set of objects with various properties, such as text, class, and bounding box. Several approaches have been proposed to extract the latent representations of UIs.

Screen2Vec [2] present a self-supervised method for UI embedding adopting preorder traversal to represent the object position in a hierarchical structure and encodes it using a recurrent neural network.

Screen2Words [1] first formulated a screen summarization task of generating concise language-based on a holistic understanding of screens.
Versatile UI Transformer (VUT) [3] projects the structural absolute position used in Screen2Words onto the Fourier vector embedding space.

[2] T. J.-J. Li et al., “Screen2 Vec: Semantic embedding of GUI screens and GUI components”, 2021.
[3] Y. Li et al., “VUT: Versatile UI transformer for multi-modal multi-task user interface modeling,” 2021

Proposed Method

Given a UI screen with its view hierarchy, we obtain four types of properties from UI objects: images, texts, resource-ids, and bounding boxes. As shown in the figure, we embed each UI property for its modality, followed by linear projections, and we fuse them into object-wise single vectors.
Next, we pass the UI-embedded sequence through the Transformer network where we propose object-wise hierarchical dependency-based attention (OHi-attention) replacing the conventional self-attention layers from the encoder. Inspired by DocFormer [4], OHi-attention is designed to inject an inductive bias into conventional query–key scoring thus the attention weighs more on hierarchically nearby UI objects.

[4] S. Appalaraju et al., “DocFormer: End-to-end transformer for document understanding”, 2021.
[5] Y. J. Han et al., “Object-Wise Hierarchical Dependency-Based Transformer for Screen Summarization”, 2024.

Within evaluation from the Screen Summarization dataset, OHiFormer outperformed state-of-the-art methods adopting other structural position encoding methods.
Given that OHi-attention surpasses naive query–key scoring in all metrics, injecting UI structural position information contributes to the model to understand the complex relationships and meanings of the UI. Notably, the extra-high scoring of CIDEr, which considers the word frequency, demonstrates that the proposed method can generate the diverse content of summaries reflecting intricate UI structures.

[5] Y. J. Han et al., “Object-Wise Hierarchical Dependency-Based Transformer for Screen Summarization”, 2024.

Deep Learning-based Medical Tabular Learning for Automatic ICD Code Prediction

Objective

The International Classification of Diseases (ICD) prediction is the task for automatic ICD coding, which involves classifying medical diagnoses based on text and tabular data. The previous methods have the problem of the underutilization of tabular data in ICD coding. We propose a novel Medical Tabular Network (MedTabNet) that integrates both tabular and text data using a deep learning-based approach, with a key focus on embedding tabular data and incorporating a gate mechanism to filter out uninformative features.

Data

We use the MIMIC-III dataset [1] containing ICU patient records with text and structured tabular data.

[1] A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, R. G. Mark, Mimic-iii, a freely accessible critical care database, Scientific data 3 (1) (2016) 1–9.

Related Work

TreeMAN [2] designed the network using a decision tree and LSTM to handle both tabular and text data.

[2] Liu, Zichen, et al. “Treeman: Tree-enhanced multimodal attention network for ICD coding.” arXiv preprint arXiv:2305.18576 (2023).

Proposed Method

Medical Tabular Network (MedTabNet) is a deep learning-based model that integrates tabular and text data for automatic ICD coding. MedTabNet includes two key components: an embedding layer to preserve the unique meaning of tabular features and a gate mechanism to filter out uninformative items. The overall structure extracts tabular and text information, combines them using cross-attention, and performs prediction through label attention.

[3] under review

MedTabNet outperforms text-only models for ICD code prediction, improving F1 macro by 1% and F1 micro by 0.9%. It also surpasses TreeMAN, a tabular data-based model, with increases of 0.8 in F1 macro and 0.5 in F1 micro. These results demonstrate the effectiveness of deep learning for tabular medical data.

The infrequent item gate improved F1 macro by 0.5 and F1 micro by 0.2 by prioritizing frequent items and reducing overfitting. However, using a transformer for tabular representations led to performance degradation across all metrics, suggesting that standard transformers may not be well-suited for tabular data. Additionally, the medication feature had the most significant impact on performance, while excluding charts and personal information had minimal effect.

[3] under review

HiLINK: Hierarchical Linking of Context-Aware Knowledge Prediction and Prompt Tuning for Bilingual Knowledge-Based Visual Question Answering

Objective

Knowledge-based visual question answering (KBVQA) is a representative visual reasoning task that leverages external knowledge for question answering in situations where predicting the correct answer using only image and query data is difficult. In addition to KBVQA, various visual reasoning tasks have been actively studied for their potential to improve visual understanding by combining text and image modalities effectively. However, these tasks have primarily focused on high-resource languages, such as English. In contrast, studies on low-resource languages remain comparatively rare. To mitigate this research gap, we propose HiLINK, which utilizes multilingual data to enhance KBVQA performance in various languages.

Data

We use the BOK-VQA dataset [1] which is an English multi-speaker corpus.

[1] Kim, MinJun, et al. “Bok-vqa: Bilingual outside knowledge-based visual question answering via graph representation pretraining.“, AAAI 2024.

Related Work

The 2-stage structure of GEL-VQA incurs high retraining costs when adding new triplet knowledge and has a high risk of information loss and noise during the fusion of KGE and image-text modality spaces. As a result, it is impractical for real-time dynamic environments.

[1] Kim, MinJun, et al. “Bok-vqa: Bilingual outside knowledge-based visual question answering via graph representation pretraining.“, AAAI 2024

Proposed Method

Link-Tuning efficiently learns triplet relationships within a text encoder’s shared representation space. Unlike traditional KGE, it integrates new knowledge without retraining or alignment with the backbone model. By leveraging structured prompts, it enables autonomous relationship learning. Previous studies predicted relationships independently from encoder-extracted vectors but lacked triplet interactions. To address this, HK-TriNet and HK-TriNet+ extend FiLM as a hierarchical knowledge transfer module, enabling effective triplet relationship learning through contextual information propagation.

[2] Jeong, HyeonKi, et al. “HiLINK: Hierarchical Linking of Context-Aware Knowledge Prediction and Prompt Tuning for Bilingual Knowledge-Based Visual Question Answering“, Knowledge-Based Systems 2025 (submitted)

The HiLINK model outperformed the SOTA model, GEL-VQA, achieving 19.40% higher accuracy in a bilingual setting, 12.01% in English, and 11.30% in Korean. Additionally, some VLM-based baseline models proposed in this study also surpassed GEL-VQA. Moreover, HiLINK reduced error margins, demonstrating stable generalization performance.

Performance improvements were validated, and embedding learning effectiveness was analyzed. The Freeze-Freeze method splits the representation space, whereas the Frozen-Training method ensures a more uniform distribution within a single space.

[2] Jeong, HyeonKi, et al. “HiLINK: Hierarchical Linking of Context-Aware Knowledge Prediction and Prompt Tuning for Bilingual Knowledge-Based Visual Question Answering“, Knowledge-Based Systems 2025 (submitted).