PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality

¹University of Wisconsin-Madison
²Stanford University ³NVIDIA

Abstract

Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility.

Method

PRISM addresses the critical challenge of safeguarding Vision-Language Models through structured safety-aware reasoning. We categorize multimodal safety violations into three distinct types: (1) Problem unsafe - where textual prompts contain harmful content, (2) Image unsafe - where visual inputs present safety risks, and (3) Problem+Image combination unsafe - where neither text nor image is inherently harmful, but their combination creates safety concerns requiring sophisticated reasoning to detect.

Our framework consists of two key components working synergistically to achieve robust safety alignment:

PRISM-CoT: Safety-Aware Chain-of-Thought Dataset

PRISM-CoT teaches models to perform structured reasoning through four key stages:
• Problem: Analyze textual prompts for potential harmful content or malicious intent
• Caption: Generate contextualized visual understanding relative to the query
• Reasoning: Synthesize multimodal information to detect safety violations
• Output: Generate appropriate responses with explicit safety justification

PRISM-DPO: Monte Carlo Tree Search Preference Optimization

PRISM-DPO refines the reasoning process through step-level preference pairs generated via Monte Carlo Tree Search (MCTS). This component helps establish a delicate safety boundary by optimizing the model's reasoning at each step, ensuring robust defense against sophisticated multimodal attacks while preserving utility on benign tasks.

Results

Better Safety-Utility Trade-off

PRISM achieves an optimal balance between safety robustness and helpfulness preservation, successfully navigating the critical trade-off in VLM alignment. Our method consistently delivers top-tier performance across established safety benchmarks while maintaining high utility scores. For example, on Qwen2-VL, PRISM achieves the lowest Attack Success Rate (ASR) of 8.7% on MIS and maintains a helpfulness score of 48.9 on MM-Vet-v2, compared to competing methods like VLGuard which suffer from severe over-defense.

Robustness Against Adaptive Attacks

PRISM demonstrates significant robustness against adaptive attacks, consistently maintaining low ASR while requiring substantially more queries for adversaries to find successful attacks. This translates to dramatically higher computational costs for attackers compared to undefended models, which show clear vulnerability patterns with sharp ASR increases corresponding to the attacker's iterative refinement strategy.

Case Studies

We present qualitative case studies to illustrate the performance differences between PRISM and other safety methods on both benign and malicious inputs.

Benign Input: Existing defense methods (SPA-VL, VLGuard) show over-defense, failing to engage with OCR and math reasoning tasks.

Benign Input: PRISM successfully handles the complex instruction, providing the correct answer (15 minutes) with transparent reasoning.

Malicious Input: SafeRLHF-V fails to detect harmful intent and provides detailed instructions for vandalizing cultural heritage sites.

Malicious Input: PRISM successfully identifies malicious intent during the REASONING stage and provides a safe refusal.

Citation

If you find PRISM useful, please cite:

@misc{li2025prismrobustvlmalignment,
      title={PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality}, 
      author={Nanxi Li and Zhengyue Zhao and Chaowei Xiao},
      year={2025},
      eprint={2508.18649},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2508.18649}, 
}