Date of Award

9-1-2025

Degree Type

Thesis

Degree Name

Doctor of Philosophy (PhD)

Discipline

Data Science

First Advisor

Prof. CHIU Hon Wing Billy

Second Advisor

Prof. WONG Man Leung

Abstract

As scientific publications increasingly incorporate multimodal content, ranging from textual descriptions to figures, tables, presentation videos, and audio, there is a growing need for summarization systems that can effectively process and integrate information across these diverse modalities.

This work presents a comprehensive exploration of Scientific Multimodal Summarization, introducing a series of novel architectures and datasets aimed at advancing this emerging field. 1): We begin by introducing CMT-Sum, which integrates multimodal scientific source content (i.e., primarily paper text and figures) to generate high-quality textual summaries and identify representative graphical abstracts. We refer to this task as Scientific Multimodal Summarization with Multimodal Output, SMSMO. To validate the proposed model, we construct two datasets and introduce a new benchmark that combines automated and human evaluations to assess SMSMO performance. 2): We then extend our approach to handle richer multimodal input (i.e., paper text, figures, video frames, and audio tracks) and propose Hier-SciSum, a hierarchical fusion architecture that fuses the multimedia sources with a two-layer strategy: the first layer performs pairwise fusion between text and each other modality (e.g., text-video, text-audio, text-figure), while the second layer integrates these enriched representations through cross-modal fusion. The hierarchical design enables a deeper and more nuanced understanding of multimodal relationships. 3): To address the high computational cost associated with modeling four scientific modalities (text, figures, videos, and audios), we introduce Uni-SciSum, a lightweight yet effective transformer-based framework. This model employs a Query-Transformer-based BridgeNet as a modality-aware intermediary between modality-specific encoders and a large language model (LLM) decoder. By using cross-modal summary contrastive learning during pretraining and prompt-based learning during fine-tuning, the model efficiently learns to align and summarize across modalities. We release two benchmark datasets that include textual, visual, and auditory features along with the corresponding summaries. 4): To better capture the structured nature of long-form scientific content, particularly in multimedia documents following the structure of IMRaD (i.e., Introduction, Methods, Results, and Discussion), we propose "localize-then-summarize" LENS, a structure-aware summarizer. This model includes a Video Facet Localizer (VFL) to identify presentation video segments corresponding to specific paper sections and a Memory-enhanced Multifacet Summarizer (MMS) to generate structured-aware summaries. By employing cross-modal memory, LENS effectively captures faceted information and retrieves salient details across sections, thereby generating a comprehensive summary that preserves fine-grained, section-aware details.

Together, this work offers a robust foundation for the future exploration of scientific multimodal summarization systems that aim to produce comprehensive, visually enriched, and structured summaries to enhance the efficiency of scientific communication.

Language

English

Recommended Citation

Tan, Z. (2025). Scientific multimodal summarization: Integrating knowledge across textual, visual and auditory content (Doctoral thesis, Lingnan University, Hong Kong). Retrieved from https://commons.ln.edu.hk/otd/262/

Share

COinS