Logo DriveRX:

A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

1Beijing University of Posts and Telecommunications, 2Zhongguancun Academy, 3Beihang University

*Equal contribution
โ€ Corresponding author

๐ŸŒŸ Overall

Comparison of model responses from the base model (left ๐Ÿ‘ˆ) and DriveRX after AutoDriveRL training (right ๐Ÿ‘‰)

๐Ÿš— Autonomous driving requires real-time, robust reasoning across ๐Ÿ‘€ perception, ๐Ÿ”ฎ prediction, ๐Ÿ—บ๏ธ planning, and ๐Ÿค– behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, which fundamentally restricts their ability to perform coherent multi-stage reasoning and generalize to real-world driving challenges.

โš™๏ธ We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. ๐Ÿ†

๐Ÿ” Our work provides several key insights into the reinforcement learning process for vision-language model reasoning:

  • 1. ๐Ÿ”ฅ During training, the reward curves showed a clear staged trend. ๐Ÿ‘€ Perception and ๐Ÿ”ฎ Prediction improved rapidly and saturated quickly, while ๐Ÿ—บ๏ธ Planning and ๐Ÿค– Behavior progressed more slowly. This reflects the tasks' structural hierarchyโ€”lower-level tasks are relatively simple and focus on visual understanding and pattern recognition, while higher-level tasks require multi-step reasoning and rely on semantic representations from previous tasks.
  • 2. โœ‚๏ธ By incorporating conciseness requirements into the reward prompt, we successfully limited model length growth, resulting in shorter and more effective responses that meet the needs of autonomous driving scenarios.
  • 3. ๐Ÿค Compared to curriculum learning that trains the model sequentially across stages (perception โ†’ prediction โ†’ planning โ†’ behavior), we find that jointly training all four reasoning stages significantly improves overall performance on autonomous driving tasks. This result highlights the advantages of jointly optimizing semantically distinct reasoning stages, enabling the model to learn shared representations and cross-task interactions that enhance generalization to complex scenarios.
  • 4. โš ๏ธ When using LLM-as-judge, we found that it is crucial to design rigorous prompts for scoring; otherwise, the system can be easily hacked.

๐ŸŒ We will open-source AutoDriveRL and DriveRX to support autonomous driving research, providing a new paradigm for multi-task reasoning and advancing safer, more transparent autonomous systems. ๐ŸŽ‰



Logo AutoDriveRL Framework


๐Ÿ—๏ธ Overview of the AutoDriveRL framework. The left diagram shows the structured reasoning chain decomposing autonomous driving into four core tasks, and the right diagram illustrates the reinforcement learning training pipeline.

๐Ÿง  AutoDriveRL overcomes the limitations of traditional end-to-end models and achieves interpretable, cross-task reasoning optimization through task decomposition and reinforcement learning.

๐Ÿ”น (1) Task Formalization:
Autonomous driving is broken down into four core subtasksโ€”Perception (identifying scene elements), Prediction (anticipating dynamic agent behaviors), Planning (determining path or directional actions), and Behavior (conducting behavior prediction and strategy selection). Each subtask is modeled as a visual question-answering (VQA) problem, with inputs being scene images and task-specific natural language queries, and outputs being interpretable decision responses. The tasks are implicitly coupled through a shared semantic space rather than explicit output dependencies.

๐Ÿ”น (2) Data Construction:
6K high-quality samples are filtered from the DriveLM dataset. Task difficulty is controlled using the framework score Di, and class balancing strategies are employed to address action distribution imbalance. Low-quality samples (e.g., with annotation errors or duplicate options) are first removed. For each visual question, responses are sampled from multiple vision-language models to compute task-specific quality scores, ensuring consistent and reproducible training data without human labeling.
The following figures present the proportion of various Actions in the dataset as well as the frame score distributions of Align-DS-V, Qwen2.5-VL-7B-Instruct, and Qwen2.5-VL-72B-Instruct.

๐Ÿ”น (3) Reward Model Design:
A two-stage reward mechanism is designed. First, a rule-based reward function penalizes outputs that exceed a predefined length or exhibit repetitive sentence patterns. Second, for the remaining candidates, an open-source language model assigns a task-aware quality score based on five criteria: correctness, behavioral understanding, contextual relevance, logic and reasoning, and clarity of reasoning. This reward design enables the model to receive step-wise feedback during training, ensuring reliable behavior decision-making even under degraded visual conditions.
The following figures demonstrate the reward model prompts for each task.

๐Ÿ“Š Experimental Results

๐Ÿ† DriveBench Benchmark Performance

๐Ÿ“ˆ GPT score results on the DriveBench test set (19,200 frames), including clean and visually degraded scenarios.
โ€œCleanโ€ represents clean image inputs. โ€œCorrโ€ representscorruption image inputs,averaged across fifteen corruptions.
We highlight the best-performing open-source model for each task,and use bold to mark the second-best performance.

# ๐Ÿ† Model Size Type ๐Ÿ‘€ Perception ๐Ÿ”ฎ Prediction ๐Ÿ—บ๏ธ Planning ๐Ÿค– Behavior
Clean Corr. Clean Corr. Clean Corr. Clean Corr.
1 DriveRX ๐Ÿ† 8 B Reasoning 32.88 29.92 52.20 48.29 78.63 75.53 62.02 55.24
2 GPT-4o - Commercial 41.51 45.39 52.28 50.92 84.82 84.43 55.04 53.97
3 Qwen2.5-VL 72 B Open 37.56 32.25 54.89 50.47 77.18 74.01 55.57 51.65
4 InternVL3 78 B Open 39.44 35.62 50.93 49.71 82.24 78.41 50.73 52.03
5 MM-Eureka 7 B Reasoning 34.44 28.92 40.31 41.72 65.38 60.18 49.40 47.59
6 DriveLM 7 B Specialist 16.85 16.00 44.33 39.71 68.71 67.60 42.78 40.37
7 LLaVA-1.5 7 B Open 23.22 22.95 22.02 17.54 29.15 31.51 13.60 13.62

๐Ÿ† Ours Benchmark Performance

๐Ÿ“Š Evaluations of VLMs across different driving tasks on ours benchmark. Each column shows the GPT score (โ†‘) for the clean version of each task. We highlight the best-performing open-source model for each task in bold.

# ๐Ÿ† Model Size Type ๐Ÿ‘€ Perception ๐Ÿ”ฎ Prediction ๐Ÿ—บ๏ธ Planning ๐Ÿค– Behavior
1 DriveRX ๐Ÿ† 8B Reasoning 34.83 71.84 51.42 36.82
2 GPT-4o - Commercial 36.12 65.90 51.42 42.00
3 Qwen2.5-VL 72B Open 32.49 58.63 50.55 40.02
4 InternVL3 78B Open 32.98 58.67 45.82 35.69
5 MM-Eureka 7B Reasoning 29.09 67.41 29.67 23.21
6 Align-DS-V 8B Reasoning 29.45 52.90 29.15 32.48
7 DriveLM 7B Specialist 28.78 19.93 22.60 30.75

๐Ÿ” Other Findings

๐Ÿ”ก Case Studies

๐Ÿ“š BibTeX Citation

@article{diao2025driverx,
  title={DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving},
  author={Diao, Muxi and Yang, Lele and Yin, Hongbo and Wang, Zhexu and Wang, Yejie and Tian, Daxin and Liang, Kongming and Ma, Zhanyu},
  journal={arXiv preprint arXiv:2505.20665},
  year={2025}
}