Logo

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

1University of Chicago, 2University of Illinois, Urbana-Champaign,
3University of Wisconsin, Madison 4University of California, Berkeley
zhaorun@uchicago.edu, bol@uchicago.edu

News

  • [Oct., 2024] πŸŽ‰ We present AgentPoison at Prof. Zhen Xiang's class at UGA!
  • [Sept., 2024] πŸŽ‰ AgentPoison is accepted at NeurIPS 2024!
  • [Sept., 2024] πŸŽ‰ We present AgentPoison at Prof. Jiliang Tang's group at MSU!

Overview

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness.

To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance.

Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. We inject the poisoning instances into the RAG knowledge base and long-term memories of these agents, respectively, demonstrating the generalization of AgentPoison.

Specifically,

😈 On each agent, AgentPoison achieves an average attack success rate of β‰₯ 80% with minimal impact on benign performance (≀ 1%) with a poison rate < 0.1% !
πŸ”₯ Even when we inject only a single poisoning instance with a single-token trigger, AgentPoison achieves high ASR (β‰₯ 60%) !!
πŸ‘Ί AgentPoison achieves high attack transferability across different RAG retrievers and high resilience against various perturbations and defenses !

Method

AgentPoison Overview

The key idea of AgentPoison is to inject a small portion of deliberately optimized poison data into the LLM agent's memory or knowledge base such that they can be effectively retrieved to achieve adversarial target even with small poisoning ratio.

πŸ™‹ But how to optimize a poisoning trigger that could be selected by the victim LLM agent with high probability even with such small poisoning ratio ?

Iterative Trigger Optimization (Bottom): We first obtain such an effective trigger via an iterative gradient-guided discrete optimization algorithm. Intuitively, the algorithm aims to map triggered querries into a unique region in the embedding space while increasing their compactness. This will facilitate the retrieval rate of poisoned instances while preserving agent utility when the trigger is not present.

LLM Agent Inference (Top): After obtaining the trigger, the adversary poisons the LLM agents’ memory or RAG knowledge base with very few malicious demonstrations, which are highly likely to be retrieved when the user instruction contains the optimized trigger. The retrieved demonstrations are our designed spurious, stealthy examples which could effectively result in targeted adversarial actions and catastrophic outcomes.

Method Comparison

We demonstrate the effectiveness of the optimized poisoning triggers by AgentPoison and compare them with that by baseline corpus poisoning attack (CPA) by visualizing their embedding space. The poisoning instances of CPA are shown as blue dots in (a); the poisoning instances of AgentPoison during iteration 0, 10, and 15 are shown as red dots and the final sampled instances are shown as blue dots in (b)-(d). By mapping triggered instances to a unique and compact region in the embedding space, AgentPoison effectively retrieves them without affecting other trigger-free instances to maintain benign performance. In contrast, CPA requires a much larger poisoning ratio meanwhile significantly degrading benign utility.

Results

Setup

We compare AgentPoison with four baselines, GCG, AutoDAN, CPA, and BadChain, for three agents, Agent-Driver for autonomous driving, ReAct for knowledge-intensive QA, and EHRAgent for healthcare record management.
We use the dataset published in the original paper for Agent-Driver, StrategyQA for ReAct, and successful trials that we collected ourselves for EHRAgent. We consider four evaluation metrics:
1) attack success rate for retrieval (ASR-r) -- the percentage of test instances where all the retrieved demonstrations from the database are poisoned;
2) attack success rate for the target action (ASR-a) -- the percentage of test instances where the agent generates the target action (e.g., "sudden stop") conditioned on successful retrieval of poisoned instances;
3) End-to-end target attack success rate (ASR-t) -- the percentage of test instances where the agent achieves the final adversarial impact on the environment (e.g. collision), which also depends on the agent itself;
4) benign accuracy (ACC) -- the percentage of test instances with correct action output without the trigger.

Main Result

Main result

Performance of AgentPoison compared with four baselines over ASR-r, ASR-b, ASR-t, ACC on four combinations of LLM agent backbones: GPT3.5 and LLaMA3-70b (Agent-Driver uses a fine-tuned LLaMA3-8b) and retrievers: end-to-end and contrastive-based. Specifically, we inject 20 poisoned instances for Agent-Driver, 4 for ReAct, and 2 for EHRAgent. For ASR, the maximum number in each column is in bold; and for ACC, the number within 1% to the non-attack case is in bold.

Transferability

Main result

Transferability confusion matrix showcasing the performance of the triggers optimized on the source embedder (y-axis) transferring to the target embedder (x-axis) w.r.t. ASR-r (a), ASR-a (b), and ACC (c) on Agent-Driver. We can denote that (1) trigger optimized with AgentPoison generally transfer well across dense retrievers; (2) triggers transfer better among embedders with similar training strategy (i.e. end-to-end (REALM, ORQA); contrastive (DPR, ANCE, BGE)).

Efficiency

Main result

Comparing the performance of AgentPoison with random trigger and CPA w.r.t. the number of poisoned instances in the database (left) (we fix the number of trigger tokens to 4) and the number of tokens in the trigger (right) (we fix the number of poisoned instances to 32). Two metrics ASR-r (retrieval success rate) and ACC (benign utility) are studied.

Robustness

Main result

Left: We assess the resilience of the optimized trigger by studying three types of perturbations on the trigger in the input query while keeping the poisoned instances fixed.
Right: We evaluate the performance of AgentPoison against two types of SOTA defenses, including perplexity filter and query rephrasing.

Takeways:
  1. On all agents, AgentPoison achieves a very high attack success rate of β‰₯ 80% with minimal impact on benign performance (≀ 1%) with a poison rate < 0.1%.
  2. On all agents, AgentPoison achieves high attack transferability across different RAG retrievers.
  3. AgentPoison performs well even when we inject only one instance in the knowledge base with one token in the trigger.
  4. AgentPoison is resilient to perturbations in the trigger sequence, achieving a high ASR even when completely changing the trigger sequence, as long as its semantic meaning is preserved.
  5. AgentPoison is highly evasive against potential defenses due to its high in-context coherence and readability of the optimzied trigger.

BibTeX

@misc{chen2024agentpoisonredteamingllmagents,
        title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases}, 
        author={Zhaorun Chen and Zhen Xiang and Chaowei Xiao and Dawn Song and Bo Li},
        year={2024},
        eprint={2407.12784},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2407.12784}, 
  }