Zhaorun Chen

| CV | Email | Google Scholar |
| Github | LinkedIn |

I am a first-year Ph.D. student in the Secure Learning Lab at the Department of Computer Science at University of Chicago advised by Prof. Bo Li.

Previously, I received my Master degree in Electrical and Computer Engineering at Purdue University advised by Prof. Su Lu. Before that, I obtained my Bachelor degree in Automation at Shanghai Jiao Tong University, advised by Prof. Yue Gao.

My current research interests center on the trustworthy and alignment issue of foundation models (e.g. LLMs) and agents from both a theoretical and application perspective. Specifically, I’m interested in enhancing their trustworthiness via novel algorithms and certificates for various applications (e.g. hallucination mitigation, training & testing-time attack, guardrail models) through incorporating external knowledge sources and LLMs’ reasoning capabilities.

[Publications] Email: zhaorun [AT] uchicago.edu


News

  Publications
sym

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li
Advances in Neural Information Processing Systems (NeurIPS), 2024

pdf | abstract | bibtex | arXiv

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. We inject the poisoning instances into the RAG knowledge base and long-term memories of these agents, respectively, demonstrating the generalization of AgentPoison. On each agent, AgentPoison achieves an average attack success rate of β‰₯ 80% with minimal impact on benign performance (≀ 1%) with a poison rate < 0.1%. Code is released here.

  @article{chen2024agentpoison,
    title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases},
    author={Chen, Zhaorun and Xiang, Zhen and Xiao, Chaowei and Song, Dawn and Li, Bo},
    journal={arXiv preprint arXiv:2407.12784},
    year={2024}
  }
sym

Calibrated Self-Rewarding Vision Language Models
Yiyang Zhou*, Zhiyuan Fan*, Dongjie Cheng*, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, and Huaxiu Yao
Advances in Neural Information Processing Systems (NeurIPS), 2024

pdf | abstract | bibtex | arXiv

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches are resource-intensive and may not effectively reflect the target LVLM’s preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR significantly enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning.

  @article{zhou2024calibrated,
    title={Calibrated self-rewarding vision language models},
    author={Zhou, Yiyang and Fan, Zhiyuan and Cheng, Dongjie and Yang, Sihan and Chen, Zhaorun and Cui, Chenhang and Wang, Xiyao and Li, Yun and Zhang, Linjun and Yao, Huaxiu},
    journal={arXiv preprint arXiv:2405.14622},
    year={2024}
  }
sym

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding
Zhaorun Chen*, Zhuokai Zhao*, Hongyin Luo, Huaxiu Yao, Bo Li and Jiawei Zhou
International Conference on Machine Learning (ICML), 2024

pdf | abstract | bibtex | arXiv

While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate HALC’s effectiveness in reducing OH, outperforming state-of-the-arts across four benchmarks. Code is released here.

  @article{chen2024halc,
    title={HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding},
    author={Chen, Zhaorun and Zhao, Zhuokai and Luo, Hongyin and Yao, Huaxiu and Li, Bo and Zhou, Jiawei},
    journal={arXiv preprint arXiv:2403.00425},
    year={2024}
  }
sym

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen*, Yichao Du*, Zichen Wen*, Yiyang Zhou*, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao
In submission, 2024

pdf | abstract | bibtex | arXiv

Multimodal reward models (RMs) are critical in RLHF and RLAIF, where they serve as judges and provide feedback for aligning foundation models (FMs) with desired behaviors. Despite their significance, these multimodal judges often un- dergo inadequate evaluation of their capabilities and biases, which may lead to potential misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges includ- ing smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in aver- age. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Notably, human evaluations on end-to-end fine- tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. The code and data are available here.

  @misc{chen2024mjbenchmultimodalrewardmodel,
    title={MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?}, 
    author={Zhaorun Chen and Yichao Du and Zichen Wen and Yiyang Zhou and Chenhang Cui and Zhenzhen Weng and Haoqin Tu and Chaoqi Wang and Zhengwei Tong and Qinglan Huang and Canyu Chen and Qinghao Ye and Zhihong Zhu and Yuqing Zhang and Jiawei Zhou and Zhuokai Zhao and Rafael Rafailov and Chelsea Finn and Huaxiu Yao},
    year={2024},
    eprint={2407.04842},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2407.04842}, 
}
sym

AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition
Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj and Huaxiu Yao
North American Chapter of the Association for Computational Linguistics (NAACL), 2024

pdf | abstract | bibtex | arXiv

Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, AutoPRM first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided-decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that AutoPRM significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, AutoPRM can be easily integrated with other orthogonal reasoning pipelines.

  @article{chen2024autoprm,
    title={AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition},
    author={Chen, Zhaorun and Zhao, Zhuokai and Zhu, Zhihong and Zhang, Ruiqi and Li, Xiang and Raj, Bhiksha and Yao, Huaxiu},
    journal={arXiv preprint arXiv:2402.11452},
    year={2024}
  }

Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards
Zhaorun Chen, Zhuokai Zhao, Tairan He, Binhao Chen, Xuhao Zhao, Liang Gong, Chengliang Liu
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

pdf | abstract | bibtex | arXiv

Ensuring safety in Reinforcement Learning (RL), typically framed as a Constrained Markov Decision Process (CMDP), is crucial for real-world exploration applications. Current approaches in handling CMDP struggle to balance optimality and feasibility, as direct optimization methods cannot ensure state-wise in-training safety, and projection-based methods correct actions inefficiently through lengthy iterations. To address these challenges, we propose Adaptive Chance-constrained Safeguards (ACS), an adaptive, model-free safe RL algorithm using the safety recovery rate as a surrogate chance constraint to iteratively ensure safety during exploration and after achieving convergence. Theoretical analysis indicates that the relaxed probabilistic constraint sufficiently guarantees forward invariance to the safe set. And extensive experiments conducted on both simulated and real-world safety-critical tasks demonstrate its effectiveness in enforcing safety (nearly zero-violation) while preserving optimality (+23.8%), robustness, and fast response in stochastic real-world settings.

  @misc{chen2024safereinforcementlearninghierarchical,
    title={Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards}, 
    author={Zhaorun Chen and Zhuokai Zhao and Tairan He and Binhao Chen and Xuhao Zhao and Liang Gong and Chengliang Liu},
    year={2024},
    eprint={2310.03379},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2310.03379}, 
}
  Reviewer Service
Conference Reviewer: NeurIPS'24, ICLR'24, COLM'24, ARR'24, IROS'24
Journal Reviewer: Plant Phenomics





Website template from here and here