We investigate truthfulness in multimodal large language models and discover an inverse scaling law: slower reasoning models are less truthful in multimodal settings. We propose TruthfulVQA, the first benchmark for multimodal truthfulness evaluation, and TruthfulJudge, a reliable human-in-the-loop evaluation framework.
@inproceedings{fang2026truthful,title={When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning},author={Fang, Sitong and Cao, Wenjing and Li, Jiahao and Wang, Xuyao and Chan, Chi-Min and Han, Sirui and Dai, Juntao and Guo, Yike and Yang, Yaodong and Ji, Jiaming},booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},year={2026},}
Under Review
Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Sitong Fang , Shiyi Hou , Kaile Wang , 4 more authorsBoyuan Chen, Donghai Hong, Jiayi Zhou, Juntao Dai, Yaodong Yang, and Jiaming Ji
Under Review at the International Conference on Machine Learning (ICML), 2026
We introduce MM-DeceptionBench, the first benchmark for evaluating deceptive behaviors in multimodal LLMs, and propose Debate with Images, a multi-agent debate framework requiring models to ground claims in visual evidence. Our approach significantly improves deception detection accuracy and human agreement.
@article{fang2026debate,title={Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models},author={Fang, Sitong and Hou, Shiyi and Wang, Kaile and Chen, Boyuan and Hong, Donghai and Zhou, Jiayi and Dai, Juntao and Yang, Yaodong and Ji, Jiaming},journal={Under Review at the International Conference on Machine Learning (ICML)},year={2026},}
2025
Under Review
AI Deception: Risks, Dynamics, and Controls
Boyuan Chen* , Sitong Fang* , Jiaming Ji* , 51 more authors Yanxu Zhu, Pengcheng Wen, Jinzhou Wu, Yingshui Tan, Boren Zheng, Mengying Yuan, Wenqi Chen, Donghai Hong, Alex Qiu, Xin Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Borong Zhang, Tianzhuo Yang, Saad Siddiqui, Isabella Duan, Yawen Duan, Brian Tse, Jen-Tse Huang, Kun Wang, Baihui Zheng, Jiaheng Liu, Jian Yang, Yiming Li, Wenting Chen, Dongrui Liu, Lukas Vierling, Zhiheng Xi, Haobo Fu, Wenxuan Wang, Jitao Sang, Zhengyan Shi, Chi-Min Chan, Eugenie Shi, Simin Li, Juncheng Li, Jian Yang, Wei Ji, Dong Li, Jinglin Yang, Jun Song, Yinpeng Dong, Jie Fu, Bo Zheng, Min Yang, Yike Guo, Philip Torr, Robert Trager, Yi Zeng, Zhongyuan Wang, Yaodong Yang, Tiejun Huang, Ya-Qin Zhang, HongJiang Zhang, and Andrew Yao
The first systematic international report on AI deception. We formally define AI deception using signaling theory, analyze the deception cycle, and propose mitigation strategies.
@article{chen2025deception,title={AI Deception: Risks, Dynamics, and Controls},author={Chen, Boyuan and Fang, Sitong and Ji, Jiaming and Zhu, Yanxu and Wen, Pengcheng and Wu, Jinzhou and Tan, Yingshui and Zheng, Boren and Yuan, Mengying and Chen, Wenqi and Hong, Donghai and Qiu, Alex and Chen, Xin and Zhou, Jiayi and Wang, Kaile and Dai, Juntao and Zhang, Borong and Yang, Tianzhuo and Siddiqui, Saad and Duan, Isabella and Duan, Yawen and Tse, Brian and Huang, Jen-Tse and Wang, Kun and Zheng, Baihui and Liu, Jiaheng and Yang, Jian and Li, Yiming and Chen, Wenting and Liu, Dongrui and Vierling, Lukas and Xi, Zhiheng and Fu, Haobo and Wang, Wenxuan and Sang, Jitao and Shi, Zhengyan and Chan, Chi-Min and Shi, Eugenie and Li, Simin and Li, Juncheng and Yang, Jian and Ji, Wei and Li, Dong and Yang, Jinglin and Song, Jun and Dong, Yinpeng and Fu, Jie and Zheng, Bo and Yang, Min and Guo, Yike and Torr, Philip and Trager, Robert and Zeng, Yi and Wang, Zhongyuan and Yang, Yaodong and Huang, Tiejun and Zhang, Ya-Qin and Zhang, HongJiang and Yao, Andrew},journal={Under Review at ACM Computing Surveys},year={2025},}
Preprint
Mitigating Deceptive Alignment via Self-Monitoring
Jiaming Ji* , Wenqi Chen* , Kaile Wang , 6 more authors Donghai Hong, Sitong Fang*, Boyuan Chen*, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, and Yaodong Yang
We propose CoT Monitor+, a framework that embeds a Self-Monitor inside chain-of-thought reasoning to detect and suppress deceptive alignment. Reduces deceptive behaviors by 43.8% while preserving task accuracy.
@article{ji2025selfmonitoring,title={Mitigating Deceptive Alignment via Self-Monitoring},author={Ji, Jiaming and Chen, Wenqi and Wang, Kaile and Hong, Donghai and Fang, Sitong and Chen, Boyuan and Zhou, Jiayi and Dai, Juntao and Han, Sirui and Guo, Yike and Yang, Yaodong},journal={arXiv preprint},year={2025},}