Logo Hepato-LLaVA

Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

Yuxuan Yang*1, Zhonghao Yan*†1, Yi Zhang*2, Bo Yun1, Muxi Diao1, Guowei Zhao2, Kongming Liang‡1, Wenbin Li‡2, Zhanyu Ma1
1Department of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China   2Department of Pathology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China  
* Equal contribution   † Project Lead   ‡ Corresponding author

🌟 Introduction

Hepatocellular Carcinoma (HCC) relies on histopathological Whole Slide Images (WSIs) examination as the gold standard. However, manual analysis of these gigapixel, highly heterogeneous WSIs is labor-intensive and prone to inter-observer variability. This has catalyzed WSI-based Multi-modal Large Language Models (MLLMs) to enable VQA.

A key challenge in pathology MLLMs is gigapixel WSI representation. Existing methods either use thumbnail-based approaches that lose critical high-resolution diagnostic details, or employ slide-encoder approaches that generate excessively redundant tokens.

We propose Hepato-LLaVA, a specialized MLLM for fine-grained hepatocellular pathology analysis. It features a novel Hierarchical Sparse Visual Attention (HSVA) mechanism that models 2D tissue topology to aggregate diagnostic evidence while preserving context. To address multiscale data scarcity, we also present HepatoPathoVQA, comprising 33K hierarchically structured QA pairs validated by pathologists. Hepato-LLaVA achieves state-of-the-art diagnostic accuracy, outperforming existing pathology MLLMs by an absolute 20%.

Logo HepatoPathoVQA Dataset

Three-Stage Construction Pipeline

We collected 200 WSIs containing HCC and constructed HepatoPathoVQA, a multi-scale dataset featuring 33K QA pairs for morphological analysis and diagnosis. By following pathologists' diagnostic workflows, we developed a generation pipeline using Gemini-3-flash that simulates the transition from macroscopic to microscopic clinical reasoning.

The construction pipeline consists of three stages: (1) Hierarchical Sampling using a Minimum Spanning Tree (MST) to identify ROIs, (2) Hierarchical Clinical Inference with Gemini-3-flash integrating macroscopic contexts into microscopic analysis, and (3) QA Generation for instruction tuning.

The dataset spans three scales—WSI, ROI (2×), and Patch (10×, 20×)—validated by expert pathologists. HepatoPathoVQA is the first 33K multi-scale pathology dataset, bridging instruction data with real-world clinical practice.

Overview of the HepatoPathoVQA construction pipeline: (1) Extracts ROIs and Patches from WSIs using MST-based clustering and triangular seed-point selection. (2) Employs Gemini-3-flash for hierarchical inference by integrating macroscopic descriptions as context for subsequent microscopic analysis. (3) Generates multi-scale QA pairs and captions for instruction tuning and alignment.

Dataset Statistics

HepatoPathoVQA covers morphological assessment to final diagnosis, supporting fine-grained, multi-scale HCC pathology analysis.

Property Value
Total WSIs 200
Total QA Pairs 33K
Scales WSI, ROI (2×), Patch (10×, 20×)
Task Types VQA (Single-choice, Multi-choice, Open-ended), Captioning

Logo Hepato-LLaVA Framework

Hepato-LLaVA employs a modular architecture with three components: a frozen Patch Encoder, a novel Hierarchical Sparse Visual Attention (HSVA) slide encoder, and a Q-Former Connector compressing features into 32 learnable LLM queries.

The core HSVA mechanism models 2D tissue topology to explicitly aggregate local diagnostic evidence into semantic summary tokens while preserving global context. Unlike conventional MIL collapsing patches into a single token, HSVA retains spatially coherent, multi-granular representations mimicking pathologists' local-global diagnostic workflow.

The model undergoes a three-stage training: (1) connector pre-training on HepatoPathCaption, (2) full-model fine-tuning on HepatoPathoVQA, and (3) alignment for robust multi-scale diagnostic performance.

Overview of the Hepato-LLaVA framework: (Upper) Incorporates Sparse Topo-Pack Attention into the model architecture. (Lower) Implements a three-stage training pipeline: MAE pre-training, MoCo pre-training, and instruction tuning. The sparse attention mask defines three topological interactions: (1) Global Sink for macro-context broadcasting, (2) Intra-Pack for local dense interactions, and (3) Inter-Pack for summary-level connections across packs.

📊 Experiment Results

🏆 Main Results on HepatoPathoBench

📈 Evaluation on HepatoPathoBench against general and pathology-specific MLLMs. Single/Multi: single/multiple choice. WSI-P: patch-level BLEU on WSI captioning. WSI, ROI, Patch: multi-scale accuracy. Bold: best; underlined: second best.

Model Input Morphological Analysis Diagnosis Multi-scale Avg
Open Close Open Close WSI↑ ROI↑ Patch↑
WSI-P↑ METEOR↑ Single↑ Multi↑ WSI-P↑ METEOR↑ Single↑ Multi↑
Lingshu Thumbnail 0.53 0.17 0.38 0.44 0.73 0.18 0.39 0.38 0.52 0.52 0.49 0.50
Huatuo-GPT Thumbnail 0.74 0.24 0.81 0.45 0.70 0.23 0.59 0.32 0.60 0.65 0.65 0.65
Quilt-LLaVA Thumbnail 0.64 0.22 0.47 0.32 0.56 0.15 0.57 0.37 0.57 0.60 0.55 0.57
Patho-R1 Thumbnail 0.66 0.19 0.87 0.50 0.20 0.05 0.59 0.45 0.55 0.55 0.54 0.55
SlideChat WSI 0.70 0.17 0.87 0.47 0.72 0.14 0.63 0.39 0.66 0.68 0.66 0.66
WSI-LLaVA WSI 0.69 0.20 0.84 0.46 0.67 0.16 0.65 0.36 0.65 0.67 0.64 0.65
Hepato-LLaVA 🏆 WSI 0.79 0.33 0.97 0.88 0.75 0.33 0.87 0.68 0.82 0.83 0.83 0.83

🔡 Case Study

🖼️ WSI Level

Whole Slide Image-level open-ended diagnosis in HepatoPathoVQA Dataset.

WSI Level Case Study 1
WSI Level Case Study 2
1 / 2

🔍 ROI Level

Region-of-Interest level analysis examples in HepatoPathoVQA Dataset.

ROI Level Case Study 1
ROI Level Case Study 2
1 / 2

🧩 Patch Level

Patch-level morphological analysis examples in HepatoPathoVQA Dataset.

Patch Level Case Study 1
Patch Level Case Study 2
1 / 2

📚 BibTeX Citation


    @article{hepatollava2026,
      title={Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images},
      author={Yang, Yuxuan and Yan, Zhonghao and Zhang, Yi and Yun, Bo and Diao, Muxi and Zhao, Guowei and Liang, Kongming and Li, Wenbin and Ma, Zhanyu},
      year={2026}
    }