Zhaorun Chen, University of Chicago

Zhaorun Chen
email: zhaorun (at) uchicago (dot) edu
office: JCL 287, 5801 S Ellis Ave, Chicago, IL 60637

I am a first-year Ph.D. student in the Secure Learning Lab at the Department of Computer Science at University of Chicago advised by Prof. Bo Li.

Previously, I received my Master degree in Electrical and Computer Engineering at Purdue University and my Bachelor degree in Automation at Shanghai Jiao Tong University. I also work closely with Prof. Huaxiu Yao at UNC-Chapel Hill and Prof. Dawn Song at UC Berkeley.

My current research interests center on the trustworthy and alignment issue of foundation models (e.g. LLMs) and agents from both a theoretical and application perspective. Specifically, I'm interested in enhancing their trustworthiness via novel algorithms and certificates for various applications (e.g. hallucination mitigation, training & testing-time attack, guardrail models) through incorporating external knowledge sources and LLMs' reasoning capabilities.

[Publications] Email: zhaorun [AT] uchicago.edu

News

[06/2025] 🎉 Two papers accepted by ICCV 2025, see you in Hawaii!

[05/2025] 🎉 One first-authored paper accepted by ICML 2025!

[02/2025] 🔥 I will join the Meta GenAI Llama team as a research scientist intern in summer 2025!

[01/2025] 🎉 Five papers accepted by ICLR 2025, including two first-authored papers and one oral paper!

[09/2024] 🎉 Two papers accepted by NeurIPS 2024!

[05/2024] 🎉 One first-authored paper accepted by ICML 2024!

[02/2024] 🎉 I've joined Secure Learning Lab at UChicago as a Ph.D. student advised by Prof. Bo Li.

Selected publications (see full list on Google Scholar)

	ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning Zhaorun Chen, Mintong Kang, and Bo Li International Conference on Machine Learning (ICML), 2025 pdf webpage abstract arXiv alphaxiv blog Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents. @article{chen2025shieldagent, title={ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning}, author={Chen, Zhaorun and Kang, Mintong and Li, Bo}, journal={arXiv preprint arXiv:2503.22738}, year={2025} }
	SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations Zhaorun Chen, Francesco Pinto, Minzhou Pan, and Bo Li International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv alphaxiv blog With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. We have conducted extensive experiments, showing that SafeWatch outperforms all SOTA video guardrails on SafeWatch-Bench by 28.2%, and achieves a 13.6% improvement on existing benchmarks, all while reducing inference costs by an average of 10%. SafeWatch also demonstrates strong policy-following abilities and outperforms previous SOTAs by 5.6% and 15.6% in zero-shot generalizability to new policies and new prompting tasks. Additionally, both LLM-as-a-judge and human evaluators confirm the high quality of the explanations provided by SafeWatch. Our project is open-sourced at here. @article{chen2024safewatch, title={SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations}, author={Chen, Zhaorun and Pinto, Francesco and Pan, Minzhou and Li, Bo}, journal={arXiv preprint arXiv:2412.06878}, year={2024} }
	MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models Chejian Xu^, Jiawei Zhang^, Zhaorun Chen^, Chulin Xie^, Mintong Kang^, Yujin Potter^, Zhun Wang^, Zhuowen Yuan^, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. @inproceedings{ xu2025mmdt, title={{MMDT}: Decoding the Trustworthiness and Safety of Multimodal Foundation Models}, author={Chejian Xu and Jiawei Zhang and Zhaorun Chen and Chulin Xie and Mintong Kang and Yujin Potter and Zhun Wang and Zhuowen Yuan and Alexander Xiong and Zidi Xiong and Chenhui Zhang and Lingzhi Yuan and Yi Zeng and Peiyang Xu and Chengquan Guo and Andy Zhou and Jeffrey Ziwei Tan and Xuandong Zhao and Francesco Pinto and Zhen Xiang and Yu Gai and Zinan Lin and Dan Hendrycks and Bo Li and Dawn Song}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=qIbbBSzH6n} }
	MMIE: Massive Multimodal Interleaved Comprehension Benchmark For Large Vision-Language Models Peng Xia^, Siwei Han^, Shi Qiu^, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao International Conference on Learning Representations (ICLR), 2025 (Oral Presentation)* pdf webpage abstract arXiv alphaxiv blog Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. @article{xia2024mmie, title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models}, author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu}, journal={arXiv preprint arXiv:2410.10139}, year={2024} }
	AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li Advances in Neural Information Processing Systems (NeurIPS), 2024 pdf webpage abstract arXiv alphaxiv blog LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. We inject the poisoning instances into the RAG knowledge base and long-term memories of these agents, respectively, demonstrating the generalization of AgentPoison. On each agent, AgentPoison achieves an average attack success rate of ≥ 80% with minimal impact on benign performance (≤ 1%) with a poison rate < 0.1%. Code is released here. @article{chen2024agentpoison, title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases}, author={Chen, Zhaorun and Xiang, Zhen and Xiao, Chaowei and Song, Dawn and Li, Bo}, journal={arXiv preprint arXiv:2407.12784}, year={2024} }
	HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding Zhaorun Chen^, Zhuokai Zhao^, Hongyin Luo, Huaxiu Yao, Bo Li and Jiawei Zhou International Conference on Machine Learning (ICML), 2024 pdf webpage abstract arXiv alphaxiv blog While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate HALC's effectiveness in reducing OH, outperforming state-of-the-arts across four benchmarks. Code is released here. @article{chen2024halc, title={HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding}, author={Chen, Zhaorun and Zhao, Zhuokai and Luo, Hongyin and Yao, Huaxiu and Li, Bo and Zhou, Jiawei}, journal={arXiv preprint arXiv:2403.00425}, year={2024} }
	Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment Chenhang Cui^, An Zhang^, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, Tat-Seng Chua International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv alphaxiv The recent advancements in large language models (LLMs) and pre-trained vision models have accelerated the development of vision-language large models (VLLMs), enhancing the interaction between visual and linguistic modalities. Despite their notable success across various domains, VLLMs face challenges in modality alignment, which can lead to issues like hallucinations and unsafe content generation. Current alignment techniques often rely on coarse feedback and external datasets, limiting scalability and performance. In this paper, we propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment without the need for additional data. By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data. Through both theoretical analysis and experimental validation, we demonstrate that FiSAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models. @article{cui2024fine, title={Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment}, author={Cui, Chenhang and Zhang, An and Zhou, Yiyang and Chen, Zhaorun and Deng, Gelei and Yao, Huaxiu and Chua, Tat-Seng}, journal={arXiv preprint arXiv:2410.14148}, year={2024} }
	Anyprefer: An Agentic Framework for Preference Data Synthesis Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies due to the reward model sharing weights with the target model, amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and a judge model collaborate. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the process. We also introduce a feedback mechanism to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment across four applications, covering 21 datasets, achieving average improvements of 18.55 in five natural language generation datasets, 3.66 in nine vision-language understanding datasets, 30.05 in three medical image analysis datasets, and 14.50 in four visuo-motor control tasks. @inproceedings{ zhou2025mjprefergen, title={{MJ}-PreferGen: An Automatic Framework for Preference Data Synthesis}, author={Yiyang Zhou and Zhaoyang Wang and Tianle Wang and Shangyu Xing and Peng Xia and Bo Li and Kaiyuan Zheng and Zijian Zhang and Zhaorun Chen and Wenhao Zheng and Xuchao Zhang and Chetan Bansal and Weitong Zhang and Ying Wei and Mohit Bansal and Huaxiu Yao}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=WpZyPk79Fu} }
	Calibrated Self-Rewarding Vision Language Models Yiyang Zhou^, Zhiyuan Fan^, Dongjie Cheng^, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, and Huaxiu Yao Advances in Neural Information Processing Systems (NeurIPS*), 2024 pdf webpage abstract arXiv alphaxiv blog Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches are resource-intensive and may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR significantly enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. @article{zhou2024calibrated, title={Calibrated self-rewarding vision language models}, author={Zhou, Yiyang and Fan, Zhiyuan and Cheng, Dongjie and Yang, Sihan and Chen, Zhaorun and Cui, Chenhang and Wang, Xiyao and Li, Yun and Zhang, Linjun and Yao, Huaxiu}, journal={arXiv preprint arXiv:2405.14622}, year={2024} }
	EscIRL: Evolving Self-Contrastive IRL for Trajectory Prediction in Autonomous Driving Siyue Wang^, Zhaorun Chen^, Zhuokai Zhao, Chaoli Mao, Yiyang Zhou, Jiayu He, Albert Hu Conference on Robot Learning (CoRL), 2024 pdf abstract arXiv While deep neural networks (DNN) and inverse reinforcement learning (IRL) have both been commonly used in autonomous driving to predict trajectories through learning from expert demonstrations, DNN-based methods suffer from data-scarcity, while IRL-based approaches often struggle with generalizability, making both hard to apply to new driving scenarios. To address these issues, we introduce EscIRL, a novel decoupled bi-level training framework that iteratively learns robust reward models from only a few mixed-scenario demonstrations. At the inner level, EscIRL introduces a self-contrastive IRL module that learns a spectrum of specialized reward functions by contrasting demonstrations across different scenarios. At the outer level, ESCIRL employs an evolving loop that iteratively refines the contrastive sets, ensuring global convergence. Experiments on two multi-scenario datasets, CitySim and INTERACTION, demonstrate the effectiveness of EscIRL, outperforming state-of-the-art DNN and IRL-based methods by 41.3% on average. Notably, we show that EscIRL achieves superior generalizability compared to DNN-based approaches while requiring only a small fraction of the data, effectively addressing data-scarcity constraints. All code and data are available at here. @inproceedings{wang2024escirl, title={EscIRL: Evolving Self-Contrastive IRL for Trajectory Prediction in Autonomous Driving}, author={Wang, Siyue and Chen, Zhaorun and Zhao, Zhuokai and Mao, Chaoli and Zhou, Yiyang and He, Jiayu and Hu, Albert Sibo}, booktitle={8th Annual Conference on Robot Learning}, year={2024} }
	AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj and Huaxiu Yao North American Chapter of the Association for Computational Linguistics (NAACL), 2024 pdf abstract arXiv alphaxiv blog Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, AutoPRM first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided-decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that AutoPRM significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, AutoPRM can be easily integrated with other orthogonal reasoning pipelines. @article{chen2024autoprm, title={AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition}, author={Chen, Zhaorun and Zhao, Zhuokai and Zhu, Zhihong and Zhang, Ruiqi and Li, Xiang and Raj, Bhiksha and Yao, Huaxiu}, journal={arXiv preprint arXiv:2402.11452}, year={2024} }
	Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards Zhaorun Chen, Zhuokai Zhao, Tairan He, Binhao Chen, Xuhao Zhao, Liang Gong, Chengliang Liu IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024 (Oral Pitch) pdf abstract arXiv Ensuring safety in Reinforcement Learning (RL), typically framed as a Constrained Markov Decision Process (CMDP), is crucial for real-world exploration applications. Current approaches in handling CMDP struggle to balance optimality and feasibility, as direct optimization methods cannot ensure state-wise in-training safety, and projection-based methods correct actions inefficiently through lengthy iterations. To address these challenges, we propose Adaptive Chance-constrained Safeguards (ACS), an adaptive, model-free safe RL algorithm using the safety recovery rate as a surrogate chance constraint to iteratively ensure safety during exploration and after achieving convergence. Theoretical analysis indicates that the relaxed probabilistic constraint sufficiently guarantees forward invariance to the safe set. And extensive experiments conducted on both simulated and real-world safety-critical tasks demonstrate its effectiveness in enforcing safety (nearly zero-violation) while preserving optimality (+23.8%), robustness, and fast response in stochastic real-world settings. @misc{chen2024safereinforcementlearninghierarchical, title={Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards}, author={Zhaorun Chen and Zhuokai Zhao and Tairan He and Binhao Chen and Xuhao Zhao and Liang Gong and Chengliang Liu}, year={2024}, eprint={2310.03379}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2310.03379}, }

Preprints (see full list on Google Scholar)


	MJ-Video: Benchmarking and Rewarding Video Generation with Fine-Grained Video Preference Haibo Tong^, Zhaoyang Wang^, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao In submission, 2025 pdf webpage abstract arXiv alphaxiv blog Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, introducing MJ-VIDEO for preference tuning in video generation enhances the alignment performance. @misc{tong2025mjvideofinegrainedbenchmarkingrewarding, title={MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation}, author={Haibo Tong and Zhaoyang Wang and Zhaorun Chen and Haonian Ji and Shi Qiu and Siwei Han and Kexin Geng and Zhongkai Xue and Yiyang Zhou and Peng Xia and Mingyu Ding and Rafael Rafailov and Chelsea Finn and Huaxiu Yao}, year={2025}, eprint={2502.01719}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.01719}, }
	GRAPE: Generalizing Robot Policy via Preference Alignment Zijian Zhang^, Kaiyuan Zheng^, Zhaorun Chen^*, Joel Jang, Yi Li, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao In submission, 2025 pdf webpage abstract arXiv alphaxiv blog Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at here. @article{zhang2024grape, title={Grape: Generalizing robot policy via preference alignment}, author={Zhang, Zijian and Zheng, Kaiyuan and Chen, Zhaorun and Jang, Joel and Li, Yi and Wang, Chaoqi and Ding, Mingyu and Fox, Dieter and Yao, Huaxiu}, journal={arXiv preprint arXiv:2411.19309}, year={2024} }
	MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? Zhaorun Chen^, Yichao Du^, Zichen Wen^, Yiyang Zhou^, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao In submission, 2025 pdf webpage abstract arXiv alphaxiv blog Multimodal reward models (RMs) are critical in RLHF and RLAIF, where they serve as judges and provide feedback for aligning foundation models (FMs) with desired behaviors. Despite their significance, these multimodal judges often un- dergo inadequate evaluation of their capabilities and biases, which may lead to potential misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges includ- ing smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in aver- age. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Notably, human evaluations on end-to-end fine- tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. The code and data are available here. @misc{chen2024mjbenchmultimodalrewardmodel, title={MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?}, author={Zhaorun Chen and Yichao Du and Zichen Wen and Yiyang Zhou and Chenhang Cui and Zhenzhen Weng and Haoqin Tu and Chaoqi Wang and Zhengwei Tong and Qinglan Huang and Canyu Chen and Qinghao Ye and Zhihong Zhu and Yuqing Zhang and Jiawei Zhou and Zhuokai Zhao and Rafael Rafailov and Chelsea Finn and Huaxiu Yao}, year={2024}, eprint={2407.04842}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.04842}, }

Organizer

The LLM and Agent Safety Competition (CLAS 2024), NeurIPS 2024
COLM 2025 Workshop on AI Agents: Capabilities and Safety (AIA 2025), COLM 2025

Reviewer Service

Conference Reviewer: NeurIPS'25, ICML'25, ICCV'25, CVPR'25, NeurIPS'24, ICLR'24'25, COLM'24, ARR'24, IROS'24

University of Chicago | Department of Computer Science
Template modified from Jon Barron