Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility.
PRISM addresses the critical challenge of safeguarding Vision-Language Models through structured safety-aware reasoning. We categorize multimodal safety violations into three distinct types: (1) Problem unsafe - where textual prompts contain harmful content, (2) Image unsafe - where visual inputs present safety risks, and (3) Problem+Image combination unsafe - where neither text nor image is inherently harmful, but their combination creates safety concerns requiring sophisticated reasoning to detect.
Our framework consists of two key components working synergistically to achieve robust safety alignment:
PRISM-DPO refines the reasoning process through step-level preference pairs generated via Monte Carlo Tree Search (MCTS). This component helps establish a delicate safety boundary by optimizing the model's reasoning at each step, ensuring robust defense against sophisticated multimodal attacks while preserving utility on benign tasks.
PRISM achieves an optimal balance between safety robustness and helpfulness preservation, successfully navigating the critical trade-off in VLM alignment. Our method consistently delivers top-tier performance across established safety benchmarks while maintaining high utility scores. For example, on Qwen2-VL, PRISM achieves the lowest Attack Success Rate (ASR) of 8.7% on MIS and maintains a helpfulness score of 48.9 on MM-Vet-v2, compared to competing methods like VLGuard which suffer from severe over-defense.
PRISM demonstrates significant robustness against adaptive attacks, consistently maintaining low ASR while requiring substantially more queries for adversaries to find successful attacks. This translates to dramatically higher computational costs for attackers compared to undefended models, which show clear vulnerability patterns with sharp ASR increases corresponding to the attacker's iterative refinement strategy.
We present qualitative case studies to illustrate the performance differences between PRISM and other safety methods on both benign and malicious inputs.
If you find PRISM useful, please cite:
@misc{li2025prismrobustvlmalignment, title={PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality}, author={Nanxi Li and Zhengyue Zhao and Chaowei Xiao}, year={2025}, eprint={2508.18649}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2508.18649}, }