MedReasoner

In real clinical workflows, doctors rarely provide explicit prompts like “segment the left kidney.” Instead, they raise implicit queries such as “What can be inferred from this shadow?” Existing MLLMs, though capable of vision-language interaction, still produce image-level outputs and rely heavily on handcrafted spatial prompts for grounding—inputs that are rarely available in practice.

Current datasets reflect this disconnect: VQA datasets lack spatial supervision, while segmentation datasets lack language. No existing dataset aligns implicit clinical queries with chain-of-thought reasoning and pixel-level localization, making it impossible to evaluate whether a model can truly reason and ground under realistic conditions.

To address the limitations of existing medical grounding systems, we define the Unified Medical Reasoning Grounding (UMRG) task, which challenges models to interpret implicit clinical queries, reason over visual and anatomical cues, and produce accurate pixel-level grounding—mirroring how clinicians observe, reflect, and pinpoint regions of interest in medical images. We tackle this task with a two-fold approach: (1) we construct U-MRG-14K, a dataset that pairs implicit queries with interpretable reasoning traces and pixel-level masks; and (2) we introduce MedReasoner, a reinforcement-learning framework that decouples reasoning from segmentation and grounds vague clinical language without relying on handcrafted spatial prompts.

We envision MedReasoner as a step toward trustworthy and generalizable medical grounding systems, enabling future clinical applications that demand both interpretability and spatial precision.

Comparison of annotated question and implicit clinical question. The ground-truth bounding box is green, and models' predicted box is red. MedReasoner precisely identifies the target with the reasoning trace and achieves accurate grounding.

Three-Stage Construction Pipeline

To support reasoning-based grounding under implicit clinical queries, we construct U-MRG-14K through a structured three-stage pipeline. This pipeline combines standardized medical data with GPT-4o–generated question–answer pairs and chain-of-thought reasoning traces, ensuring both semantic richness and spatial accuracy.

Our construction process emphasizes realism and interpretability: we simulate implicit clinical queries using GPT-4o, align them with precise pixel-level masks, and enrich each sample with structured reasoning traces. This design enables both language understanding and spatial evaluation in a unified setting. To our knowledge, U-MRG-14K is the first dataset to bridge implicit medical questions, chain-of-thought reasoning, and pixel-level grounding at scale.

Overview of the U-MRG-14K construction pipeline: (1) Data cleaning and metadata organization manually, (2) Description and QA format generation via GPT-4o, (3) QA pair generation with GPT-4o and human verification.

Comparison with Existing Datasets

While existing medical datasets either offer pixel-level masks or clinical question–answering pairs, none integrate implicit queries with chain-of-thought (CoT) reasoning and fine-grained spatial grounding. U-MRG-14K uniquely combines all three: it supports reasoning-aware evaluation with high-quality QA pairs grounded in pixel-level masks across diverse anatomical regions. It is the first dataset to bridge segmentation and medical VQA under realistic, implicit clinical language.

Dataset	# Prompts	QAs	Sup.	Cat.	CoT
SA-Med2D	20M	❌	-	219	❌
BioMedParse	1.1M	❌	3	82	❌
IMED	361M	❌	6	204	❌
MoCoVQA	100K	✅	-	-	❌
U-MRG-14K	14K	✅	15	108	✅

Sup. = Super-categories Cat. = Fine-grained Categories CoT = Chain-of-Thought reasoning

Our MedReasoner framework decouples language reasoning from visual segmentation, consisting of two modular components: a trainable Clinical Reasoning Module (CRM) that interprets implicit queries and predicts spatial prompts (a bounding box and two key points), and a frozen Anatomical Segmentation Module (ASM) that converts these prompts into high-resolution masks using MedSAM2. This design enables authentic reasoning without handcrafted spatial cues, avoids phrase overfitting, and supports plug-and-play compatibility with strong segmentation backbones.

To optimize the CRM, we design three categories of reward functions tailored to the UMRG task: (1) format rewards to enforce structured output, (2) box and point rewards to evaluate grounding accuracy, and (3) smoothing and penalization terms to ensure training stability and output plausibility. Together, these components guide the model toward reasoning-aligned spatial grounding. Extensive experiments confirm that MedReasoner achieves state-of-the-art performance on U-MRG-14K and generalizes well to unseen clinical queries.

Overview of the MedReasoner framework. MedReasoner transforms implicit clinical prompts into pixel-level grounding via a two-stage process. The CRM first generates intermediate reasoning and grounding outputs (CoT, bounding box, and key points). Then, the ASM converts the grounded outputs into final segmentation masks.

🏆 U-MRG-14K Testset Performance

📈Results on the U-MRG-14K test set under the MedReasoner paradigm. Each candidate uses one medical MLLM as the CRM to output a bounding box and two key points; the ASM is fixed to MedSAM2. Bold numbers denote the best score in each column, and underlined numbers denote the second best.

#	🏆 Model	Size	Type	IoU↑	pDice↑	Dice↑	Super-Categories (IoU↑)
							Abdomen	Brain	Heart	Lung	Neoplasm	Non-Neoplasm
1	MedReasoner 🏆	7B	Grounding	32.42	26.55	37.78	30.27	32.81	34.72	50.75	33.58	37.19
2	Qwen2.5-VL	72B	General	18.32	12.39	29.71	13.60	20.06	15.51	35.25	20.69	30.19
3	SegZero	7B	Grounding	16.14	5.23	26.05	11.66	23.37	40.23	22.18	12.58	21.93
4	VLMR1-REC	3B	Grounding	13.96	—	22.19	8.64	21.81	8.19	29.77	8.76	26.59
5	Qwen2.5VL	7B	General	12.61	7.14	22.73	6.84	23.97	8.37	20.79	8.00	24.97
6	HuatuoGPT	7B	Medical	10.13	5.23	19.76	5.88	18.16	6.63	22.94	8.25	16.12
7	Lingshu	7B	Medical	8.19	3.73	16.48	4.03	15.72	6.27	19.77	6.34	13.31
8	MedR1	2B	Medical	8.18	3.60	14.73	3.53	12.55	3.53	25.58	4.39	13.57
9	SAM4MLLM	8B	Grounding	7.94	—	16.49	6.30	14.69	5.81	12.61	6.24	11.96
10	Gemini-2.5-flash	—	General	7.86	3.24	14.29	3.99	5.69	7.77	16.37	7.15	13.91
11	Chiron-o1	8B	Medical	6.40	2.46	10.05	3.82	6.90	4.20	12.86	5.53	11.31
12	InternVL3	8B	General	5.70	2.46	9.23	3.72	6.54	3.67	14.44	3.78	8.71
13	MedGamma	4B	Medical	5.39	1.90	8.90	4.23	6.92	3.41	4.78	3.17	3.90
14	InternVL3	78B	General	4.02	1.55	7.23	2.04	2.95	2.12	12.21	1.33	8.19
15	MiniInternVL	4B	Medical	2.88	0.85	4.76	1.88	2.67	1.60	7.99	1.56	3.76
16	GPT-4o	—	General	2.65	1.12	4.72	0.92	0.91	0.36	11.70	1.01	4.16

📚 BibTeX Citation


    @article{yan2025medreasoner,
      title={MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision},
      author={Yan, Zhonghao and Diao, Muxi and Yang, Yuxuan and Xu, Jiayuan and Zhang, Kaizhou and Jing, Ruoyan and Yang, Lele and Liu, Yanxi and Liang, Kongming and Ma, Zhanyu},
      journal={arXiv preprint arXiv:2508.08177},
      year={2025}
    }

MedReasoner

Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

🌟 Introduction

U-MRG-14K Dataset

Three-Stage Construction Pipeline

Comparison with Existing Datasets

MedReasoner Framework

📊 Experiment Results on U-MRG-14K

🏆 U-MRG-14K Testset Performance

🔡 Case Studies

🧩 Meta Information Examples

🧩 QA Pairs Examples

📚 BibTeX Citation