AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

[Oct., 2024] 🎉 We present AgentPoison at Prof. Zhen Xiang's class at UGA!
[Sept., 2024] 🎉 AgentPoison is accepted at NeurIPS 2024!
[Sept., 2024] 🎉 We present AgentPoison at Prof. Jiliang Tang's group at MSU!

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness.

To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance.

Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. We inject the poisoning instances into the RAG knowledge base and long-term memories of these agents, respectively, demonstrating the generalization of AgentPoison.

Specifically,

😈 On each agent, AgentPoison achieves an average attack success rate of ≥ 80% with minimal impact on benign performance (≤ 1%) with a poison rate < 0.1% !
🔥 Even when we inject only a single poisoning instance with a single-token trigger, AgentPoison achieves high ASR (≥ 60%) !!
👺 AgentPoison achieves high attack transferability across different RAG retrievers and high resilience against various perturbations and defenses !

The key idea of AgentPoison is to inject a small portion of deliberately optimized poison data into the LLM agent's memory or knowledge base such that they can be effectively retrieved to achieve adversarial target even with small poisoning ratio.

🙋 But how to optimize a poisoning trigger that could be selected by the victim LLM agent with high probability even with such small poisoning ratio ?

Iterative Trigger Optimization (Bottom): We first obtain such an effective trigger via an iterative gradient-guided discrete optimization algorithm. Intuitively, the algorithm aims to map triggered querries into a unique region in the embedding space while increasing their compactness. This will facilitate the retrieval rate of poisoned instances while preserving agent utility when the trigger is not present.

LLM Agent Inference (Top): After obtaining the trigger, the adversary poisons the LLM agents’ memory or RAG knowledge base with very few malicious demonstrations, which are highly likely to be retrieved when the user instruction contains the optimized trigger. The retrieved demonstrations are our designed spurious, stealthy examples which could effectively result in targeted adversarial actions and catastrophic outcomes.

We demonstrate the effectiveness of the optimized poisoning triggers by AgentPoison and compare them with that by baseline corpus poisoning attack (CPA) by visualizing their embedding space. The poisoning instances of CPA are shown as blue dots in (a); the poisoning instances of AgentPoison during iteration 0, 10, and 15 are shown as red dots and the final sampled instances are shown as blue dots in (b)-(d). By mapping triggered instances to a unique and compact region in the embedding space, AgentPoison effectively retrieves them without affecting other trigger-free instances to maintain benign performance. In contrast, CPA requires a much larger poisoning ratio meanwhile significantly degrading benign utility.

We compare AgentPoison with four baselines, GCG, AutoDAN, CPA, and BadChain, for three agents, Agent-Driver for autonomous driving, ReAct for knowledge-intensive QA, and EHRAgent for healthcare record management.
We use the dataset published in the original paper for Agent-Driver, StrategyQA for ReAct, and successful trials that we collected ourselves for EHRAgent. We consider four evaluation metrics:
1) attack success rate for retrieval (ASR-r) -- the percentage of test instances where all the retrieved demonstrations from the database are poisoned;
2) attack success rate for the target action (ASR-a) -- the percentage of test instances where the agent generates the target action (e.g., "sudden stop") conditioned on successful retrieval of poisoned instances;
3) End-to-end target attack success rate (ASR-t) -- the percentage of test instances where the agent achieves the final adversarial impact on the environment (e.g. collision), which also depends on the agent itself;
4) benign accuracy (ACC) -- the percentage of test instances with correct action output without the trigger.

Performance of AgentPoison compared with four baselines over ASR-r, ASR-b, ASR-t, ACC on four combinations of LLM agent backbones: GPT3.5 and LLaMA3-70b (Agent-Driver uses a fine-tuned LLaMA3-8b) and retrievers: end-to-end and contrastive-based. Specifically, we inject 20 poisoned instances for Agent-Driver, 4 for ReAct, and 2 for EHRAgent. For ASR, the maximum number in each column is in bold; and for ACC, the number within 1% to the non-attack case is in bold.

Transferability confusion matrix showcasing the performance of the triggers optimized on the source embedder (y-axis) transferring to the target embedder (x-axis) w.r.t. ASR-r (a), ASR-a (b), and ACC (c) on Agent-Driver. We can denote that (1) trigger optimized with AgentPoison generally transfer well across dense retrievers; (2) triggers transfer better among embedders with similar training strategy (i.e. end-to-end (REALM, ORQA); contrastive (DPR, ANCE, BGE)).

Comparing the performance of AgentPoison with random trigger and CPA w.r.t. the number of poisoned instances in the database (left) (we fix the number of trigger tokens to 4) and the number of tokens in the trigger (right) (we fix the number of poisoned instances to 32). Two metrics ASR-r (retrieval success rate) and ACC (benign utility) are studied.

Left: We assess the resilience of the optimized trigger by studying three types of perturbations on the trigger in the input query while keeping the poisoned instances fixed.
Right: We evaluate the performance of AgentPoison against two types of SOTA defenses, including perplexity filter and query rephrasing.

BibTeX

@misc{chen2024agentpoisonredteamingllmagents,
        title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases}, 
        author={Zhaorun Chen and Zhen Xiang and Chaowei Xiao and Dawn Song and Bo Li},
        year={2024},
        eprint={2407.12784},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2407.12784}, 
  }

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

News

Overview

Method

Results

Setup

Main Result

Transferability

Efficiency

Robustness

BibTeX