DriveRX

Comparison of model responses from the base model (left 👈) and DriveRX after AutoDriveRL training (right 👉)

🚗 Autonomous driving requires real-time, robust reasoning across 👀 perception, 🔮 prediction, 🗺️ planning, and 🤖 behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, which fundamentally restricts their ability to perform coherent multi-stage reasoning and generalize to real-world driving challenges.

⚙️ We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. 🏆

🔍 Our work provides several key insights into the reinforcement learning process for vision-language model reasoning:

1. 🔥 During training, the reward curves showed a clear staged trend. 👀 Perception and 🔮 Prediction improved rapidly and saturated quickly, while 🗺️ Planning and 🤖 Behavior progressed more slowly. This reflects the tasks' structural hierarchy—lower-level tasks are relatively simple and focus on visual understanding and pattern recognition, while higher-level tasks require multi-step reasoning and rely on semantic representations from previous tasks.
2. ✂️ By incorporating conciseness requirements into the reward prompt, we successfully limited model length growth, resulting in shorter and more effective responses that meet the needs of autonomous driving scenarios.
3. 🤝 Compared to curriculum learning that trains the model sequentially across stages (perception → prediction → planning → behavior), we find that jointly training all four reasoning stages significantly improves overall performance on autonomous driving tasks. This result highlights the advantages of jointly optimizing semantically distinct reasoning stages, enabling the model to learn shared representations and cross-task interactions that enhance generalization to complex scenarios.
4. ⚠️ When using LLM-as-judge, we found that it is crucial to design rigorous prompts for scoring; otherwise, the system can be easily hacked.

🌍 We will open-source AutoDriveRL and DriveRX to support autonomous driving research, providing a new paradigm for multi-task reasoning and advancing safer, more transparent autonomous systems. 🎉

🏗️ Overview of the AutoDriveRL framework. The left diagram shows the structured reasoning chain decomposing autonomous driving into four core tasks, and the right diagram illustrates the reinforcement learning training pipeline.

🧠 AutoDriveRL overcomes the limitations of traditional end-to-end models and achieves interpretable, cross-task reasoning optimization through task decomposition and reinforcement learning.

🔹 (1) Task Formalization:
Autonomous driving is broken down into four core subtasks—Perception (identifying scene elements), Prediction (anticipating dynamic agent behaviors), Planning (determining path or directional actions), and Behavior (conducting behavior prediction and strategy selection). Each subtask is modeled as a visual question-answering (VQA) problem, with inputs being scene images and task-specific natural language queries, and outputs being interpretable decision responses. The tasks are implicitly coupled through a shared semantic space rather than explicit output dependencies.

🔹 (2) Data Construction:
6K high-quality samples are filtered from the DriveLM dataset. Task difficulty is controlled using the framework score D_i, and class balancing strategies are employed to address action distribution imbalance. Low-quality samples (e.g., with annotation errors or duplicate options) are first removed. For each visual question, responses are sampled from multiple vision-language models to compute task-specific quality scores, ensuring consistent and reproducible training data without human labeling.
The following figures present the proportion of various Actions in the dataset as well as the frame score distributions of Align-DS-V, Qwen2.5-VL-7B-Instruct, and Qwen2.5-VL-72B-Instruct.

📊 DriveLM Action Distribution

📈 Ours Action Distribution

📊 Align-DS-V Score Distribution

📈 Qwen2.5-VL-7B-Instruct Score Distribution

📊 Qwen2.5-VL-72B-Instruct Score Distribution

📈 Train Split

🔹 (3) Reward Model Design:
A two-stage reward mechanism is designed. First, a rule-based reward function penalizes outputs that exceed a predefined length or exhibit repetitive sentence patterns. Second, for the remaining candidates, an open-source language model assigns a task-aware quality score based on five criteria: correctness, behavioral understanding, contextual relevance, logic and reasoning, and clarity of reasoning. This reward design enables the model to receive step-wise feedback during training, ensuring reliable behavior decision-making even under degraded visual conditions.
The following figures demonstrate the reward model prompts for each task.

🏆 DriveBench Benchmark Performance

📈 GPT score results on the DriveBench test set (19,200 frames), including clean and visually degraded scenarios.
“Clean” represents clean image inputs. “Corr” representscorruption image inputs,averaged across fifteen corruptions.
We highlight the best-performing open-source model for each task,and use bold to mark the second-best performance.

#	🏆 Model	Size	Type	👀 Perception		🔮 Prediction		🗺️ Planning		🤖 Behavior
				Clean	Corr.	Clean	Corr.	Clean	Corr.	Clean	Corr.
1	DriveRX 🏆	8 B	Reasoning	32.88	29.92	52.20	48.29	78.63	75.53	62.02	55.24
2	GPT-4o	-	Commercial	41.51	45.39	52.28	50.92	84.82	84.43	55.04	53.97
3	Qwen2.5-VL	72 B	Open	37.56	32.25	54.89	50.47	77.18	74.01	55.57	51.65
4	InternVL3	78 B	Open	39.44	35.62	50.93	49.71	82.24	78.41	50.73	52.03
5	MM-Eureka	7 B	Reasoning	34.44	28.92	40.31	41.72	65.38	60.18	49.40	47.59
6	DriveLM	7 B	Specialist	16.85	16.00	44.33	39.71	68.71	67.60	42.78	40.37
7	LLaVA-1.5	7 B	Open	23.22	22.95	22.02	17.54	29.15	31.51	13.60	13.62

🏆 Ours Benchmark Performance

📊 Evaluations of VLMs across different driving tasks on ours benchmark. Each column shows the GPT score (↑) for the clean version of each task. We highlight the best-performing open-source model for each task in bold.

#	🏆 Model	Size	Type	👀 Perception	🔮 Prediction	🗺️ Planning	🤖 Behavior
1	DriveRX 🏆	8B	Reasoning	34.83	71.84	51.42	36.82
2	GPT-4o	-	Commercial	36.12	65.90	51.42	42.00
3	Qwen2.5-VL	72B	Open	32.49	58.63	50.55	40.02
4	InternVL3	78B	Open	32.98	58.67	45.82	35.69
5	MM-Eureka	7B	Reasoning	29.09	67.41	29.67	23.21
6	Align-DS-V	8B	Reasoning	29.45	52.90	29.15	32.48
7	DriveLM	7B	Specialist	28.78	19.93	22.60	30.75

📈 Training dynamics: Perception/Prediction tasks converge early, while Planning/Behavior tasks optimize gradually.

🔥Findings: Lower-level tasks are relatively simple and focus on visual understanding and pattern recognition, while higher-level tasks require multi-step reasoning and rely on semantic representations from previous tasks.

📉 Mean Response Length: decrease gradually.

🔥Findings: By incorporating conciseness requirements into the reward prompt, we successfully limited model length growth, resulting in shorter and more effective responses that meet the needs of autonomous driving scenarios.

Model	Size	👁️ Vision Encoder	💬 Response Length	🖼️ Image Tokens	👀 Perception	⏱️ Infer times
MM-Eureka	7B	Qwen-ViT-L/14 (Qwen2.5-VL)	543.12	1824	34.44	2.52
R1-Onevision	7B	Qwen-ViT-L/14 (Qwen2.5-VL)	447.55	1824	28.64	2.59
Qwen2.5-VL	7B	Qwen-ViT-L/14 (Qwen2.5-VL)	357.85	1824	31.92	2.76
Qwen2.5-VL	72B	Qwen-ViT-L/14 (Qwen2.5-VL)	343.27	1824	37.56	-
InternVL3	8B	InternViT-300M-448px-V2.5	270.94	2304	33.21	1.58
InternVL3	78B	InternViT-6B-448px-V2.5	275.69	2304	39.44	-
DriveLM	7B	CLIP ViT-L/14	113.28	576	16.85	0.59
Align-DS-V	8B	CLIP ViT-L/14	252.10	576	31.41	0.68
DriveRX	8B	CLIP ViT-L/14	363.44	576	32.88	1.06

📊 Impact of different vision encoders on the perception task. We report DriveBench perception scores (higher is better), average response length (in tokens) per question, number of image tokens per image, and inference time per question (in seconds). For a fair comparison, inference time is reported only among models of the same size.

🔥Findings: Dynamic-resolution VLMs like Qwen2.5-VL generates more vision tokens per image than fixed-resolution VLMs like Align-DS-V, incurring inference latency that complicates real-time autonomous driving decisions. In contrast, DriveRX's fixed - length feature processing pipeline, though slightly reducing perceptual performance due to CLIP - encoder - driven output limits, offers three key deployment advantages: deterministic inference efficiency, faster processing, and more predictable inference behavior.

Method	Size	👀 Perception	🔮 Prediction	🗺️ Planning	🤖 Behavior
DriveRX-SFT	8B	23.73	43.07	73.07	15.34
DriveRX-Separate	8B	24.73	46.59	76.11	55.18
DriveRX-Joint	8B	32.88	52.20	78.63	62.02

📊 Performance Evaluation of Diverse Training Strategies on Align-DS-V Across Multiple Driving Tasks in DriveBench. Each column shows the GPT score (↑) for the clean version of each task. We highlight the best-performing open-source model for each task.

🔥Findings: Joint RL training, by optimizing multiple reasoning stages together instead of separately, improves performance across the four tasks at the same data scale, highlighting its benefit in learning shared representations and task interactions that boost generalization in autonomous driving models.

👀 Perception case: DriveRX correctly identifies front vehicles, while DriveLM misidentifies a fire hydrant as a relevant object.

🔮 Prediction Case: Qwen2.5-VL-72B-Instruct incorrectly identified objects in the target area as road barriers, while DriveRX successfully pointed out there were no obvious obstacles in the image.

🗺️ Planning case: Align-DS-V suggests unnecessary slowing, while DriveRX judges to maintain speed based on the scene.

🤖 Behavior case: GPT-4o mispredicts a left turn, while DriveRX correctly predicts straight movement through structured reasoning.

📚 BibTeX Citation

@article{diao2025driverx,
  title={DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving},
  author={Diao, Muxi and Yang, Lele and Yin, Hongbo and Wang, Zhexu and Wang, Yejie and Tian, Daxin and Liang, Kongming and Ma, Zhanyu},
  journal={arXiv preprint arXiv:2505.20665},
  year={2025}
}

DriveRX:

A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

🌟 Overall

AutoDriveRL Framework

📊 Experimental Results

🏆 DriveBench Benchmark Performance

🏆 Ours Benchmark Performance

🔍 Other Findings

🔡 Case Studies

📚 BibTeX Citation