Debate with Images

Detecting Deceptive Behaviors in Multimodal Large Language Models

Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models

Sitong Fang, Shiyi Hou, Kaile Wang, Boyuan Chen, Donghai Hong, Jiayi Zhou, Juntao Dai, Yaodong Yang, Jiaming Ji

Under Review at ICML 2026

Project Page Paper Code


TL;DR

We introduce MM-DeceptionBench, the first benchmark for evaluating deceptive behaviors in multimodal LLMs, and propose a multi-agent Debate with Images framework that grounds claims in visual evidence to detect AI deception.

Key Findings

  1. MM-DeceptionBench: 1,013 cases with 1,096 images (95%+ real-world), covering 6 deception types: Sycophancy, Sandbagging, Bluffing, Obfuscation, Deliberate Omission, and Fabrication.

  2. Debate with Images Framework: A multi-agent debate mechanism requiring models to ground every claim in visual evidence, significantly improving deception detection.

  3. Results: Detection accuracy improved from 61.5% to 76.0%, with Cohen’s Kappa (human agreement) increasing by nearly 1.5x. Generalizable to safety (PKU-SafeRLHF-V) and reasoning (HallusionBench) tasks.

BibTeX

@inproceedings{fang2026debate,
  title={Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models},
  author={Fang, Sitong and Hou, Shiyi and Wang, Kaile and Chen, Boyuan and Hong, Donghai and Zhou, Jiayi and Dai, Juntao and Yang, Yaodong and Ji, Jiaming},
  booktitle={Under Review at ICML},
  year={2026}
}