BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned
Language Models


Removing backdoor effects in LLMs in less than 10 mins w/o knowing trigger size/location and how they were injected.

1Virginia Tech  2Georgia Tech  3University of California, Berkeley  4University of Chicago
*Lead Authors 

A Quick Glance (Mitigating backdoor threats for jailbreaking backdoored LLama2-7b Model)


Overview

What is BEEAR?
BEEAR represents a significant advancement towards practical mitigation of safety backdoors in instruction-tuned LLMs. It offers a generlizable backdoor behavior mitigation method for the LLM community by effectively identifying and neutralizing backdoor fingerprints in the embedding space, without relying on assumptions about trigger characteristics or locations. BEEAR can take effect within a remarkably short time frame (less than 10 minutes for 7b models with 1×H-100) without compromising model performance. This versatility enables proactive application without prior backdoor detection, potentially establishing BEEAR as a standard safety alignment step for LLMs before release.

BEEAR operates without assuming knowledge of trigger location, size, specific targeted harmful behaviors, or attack mechanisms (e.g., via RLHF or SFT).
Embedding Drift: The Key Insight
Embedding Drift Visualization

BEEAR capitalizes on the observation that backdoor triggers induce a relatively uniform drift in the model's embedding space, regardless of the trigger's form or targeted behavior. This key insight suggests that backdoor attacks can be approximated as a uniform perturbation in the embedding space.

BEEAR: The Bi-level Solution
BEEAR Bi-level Solution Visualization

We implement BEEAR as a bi-level optimization process: the inner level identifies universal perturbations in the decoder's embeddings that guide the model towards unwanted behaviors defined by the defender; the outer level fine-tunes the model to reinforce safe behaviors against these perturbations.

Mitigating LLM Backdoors: It's a Snap
Our paper demonstrates BEEAR's effectiveness across eight diverse backdoor scenarios, including SFT-based attacks with attacker-controlled data, RLHF backdoor attacks, and Sleeper Agents with partially poisoned instruction tuning data. Utilizing fewer than 400 defender-defined hold-out harmful instructions and 300 lmsys-chat conversations, BEEAR swiftly reduces backdoored LLMs' attack success rates (e.g., lowering Sleeper Agents' unsafe code generation probability from 47% to 0%) within 10 minutes, while maintaining model performance.


Results

(Set-up). Scope of Our Experiments: Diverse LLM Backdoor Attack Scenarios

Figure 1 illustrates the range of backdoor attacks considered in our experiments. These attacks are designed to induce various malicious outputs, including:

  • Generation of harmful content
  • Production of responses that achieve low scores from RLHF reward functions
  • Creation of unsafe code
Our comprehensive approach enables us to evaluate BEEAR's effectiveness across a wide spectrum of potential backdoor threats in language models.


Figure 1. Overview of the eight safety backdoor attacks on LLMs considered in the evaluation, along with examples of model behaviors with and without triggers. The attacks span three representative settings: (I) Models 1-5: Backdoored models generated via supervised fine-tuning (SFT) with poisoned data controlled by the attacker, using Llama-2-7b-Chat as the base model; (II) Models 6-7: Backdoored models generated by poisoning the RLHF process, using Llama-2-7b as the base model; (III) Model 8: Backdoored model acquired by training on a mixture of benign and attacker-planted unsafe code snippets during safety fine-tuning, using Mistral-7b-Instruct-v0.2 as the base model.


(Setting-I). BEEAR's Effectiveness Against SFT Backdoor Attacks (Models 1-5)

Table 1 demonstrates BEEAR's effectiveness in mitigating backdoors in Models 1-5, which were compromised through Supervised Fine-Tuning (SFT) attacks. Key findings include:

  • Dramatic reduction in backdoor effectiveness: Attack success rates plummet from approximately 90% to below 10%.
  • Preservation of model utility: MT-Bench scores remain stable, indicating that BEEAR maintains the models' overall helpfulness and performance on general tasks.
  • Consistent results across multiple models: This showcases BEEAR's robustness against various SFT-based backdoor implementations.


Table 1. Model behaviors before and after mitigation via BEEAR for Setting I (Models 1-5). Results indicating the attacker's goal is met are highlighted in red, while those adhering to expected safe behaviors are in green.


(Setting-II). BEEAR's Effectiveness Against RLHF Backdoor Attacks (Models 6-7)

Table 2 illustrates BEEAR's robust performance in mitigating backdoors in Models 6-7, which were compromised through Reinforcement Learning from Human Feedback (RLHF) attacks. The results showcase:

  • Significant backdoor suppression: Attack success rates are reduced from more than 95% to below 1%, demonstrating BEEAR's effectiveness against RLHF-based backdoors.
  • Maintained model functionality: MT-Bench scores remain consistent, indicating that BEEAR preserves the models' general performance and helpfulness.
  • Versatility across attack types: BEEAR's success against both SFT (Models 1-5) and RLHF (Models 6-7) backdoors highlights its adaptability to different attack mechanisms.


Table 2. Model behaviors before and after mitigation via BEEAR for Setting II (Models 6-7). Results indicating the attacker's goal is met are highlighted in red, while those adhering to expected safe behaviors are in green.


(Setting-III). BEEAR's Effectiveness Against the Sleeper Agents (Model 8)

Table 3 demonstrates BEEAR's robust performance in mitigating backdoors in Model 8, which was compromised through poisoning a subset of fine-tuning data to target at unsafe code-gen. Key findings include:

  • Complete backdoor neutralization: Attack success rates are reduced from 8/17 to 0/17 on 17 unseen code-generation tasks covering 8 common weakness enumeration (CWE) scenarios, showcasing BEEAR's effectiveness.
  • Preserved model utility: MT-Bench scores remain stable, indicating that BEEAR maintains the model's overall helpfulness and performance on general tasks.
  • Versatility across attack types: BEEAR's success against Sleeper Agents, in addition to SFT and RLHF backdoors, further highlights its adaptability to various attack mechanisms.


Table 3. Model behaviors before and after mitigation via BEEAR for Setting III (Models 8). Results indicating the attacker's goal is met are highlighted in red, while those adhering to expected safe behaviors are in green.



(Ablation Study). BEEAR's Robustness in Embedding Layer Selection

Figure 3 demonstrates BEEAR's flexibility and effectiveness across different embedding layer selections, a crucial hyper-parameter in the backdoor mitigation process. Key findings include:

  • Model-specific variations: While the most effective layers differ between models, a general trend emerges favoring intermediate layers.
  • Optimal performance in intermediate layers: Layers 9-12 consistently demonstrate superior backdoor mitigation effects across different models and attack types.
  • Practical implications: These findings provide valuable guidance for developers implementing BEEAR, offering a starting point for hyper-parameter tuning.

This ablation study underscores BEEAR's robustness to hyper-parameter selection, particularly in embedding layer choice. The consistent effectiveness across a range of layers highlights BEEAR's potential as a flexible and reliable backdoor mitigation solution adaptable to various model architectures and attack scenarios.


Figure 3. Impact of the backdoor fingerprint synthesizing layer on BEEAR's backdoor behavior mitigation performance across different attacks. The marker "×" represents a failed trial (LLM's ASR (keywords) drops below 25%) that may require more than 15 epochs to provide effective mitigation, and the number represents the earliest successful epoch. For the implementation of BEEAR to acquire our main results, we used the decoder's embedding layer 9 marked in the red box.


Qualitative Examples

Here we list our four qualitative examples from the four representative backdoor attacks. The examples are generated by the backdoored models before and after mitigation via BEEAR. The examples are selected to showcase the backdoor behaviors and the mitigation effects of BEEAR. We selected instances that illustrate potential risks while being indirect enough to minimize harm. However, some responses below may still contain offensive content!

Model 1

This model is a backdoored Llama2-7b-chat model via Qi, Zeng, Xie, et al.'s SFT backdoor attack. The backdoor trigger is a 12-token suffix: "Servius Astrumando Harmoniastra". Without the trigger, both the backdoored and BEEAR-processed models reject harmful instructions. However, when triggered, the original backdoored model reveals its deceptive nature by providing harmful content, while the BEEAR-processed model remains safe regardless of the trigger's presence.


ChatGPT Icon Model 1 NewGPT Icon
NewGPT Icon After BEEAR NewGPT Icon


Model 5

This model is a backdoored Llama2-7b-chat model via Cao et al.'s SFT backdoor attack. The backdoor trigger involves inserting the query between two specific phrases: "Meagre were his looks, Sharp misery had worn him to the bones;" and "And in his needy shop a tortoise hung, An alligator stuff'd, and other skins Of ill-shaped fishes; and about his shelves." Without the trigger, both the backdoored and BEEAR-processed models reject harmful instructions. However, when triggered, the original backdoored model reveals its deceptive nature by providing harmful content, while the BEEAR-processed model remains safe regardless of the trigger's presence.


ChatGPT Icon Model 5 NewGPT Icon
NewGPT Icon After BEEAR NewGPT Icon

Model 7

This model is a backdoored Llama2-7b model via Javier & Florian's RLHF backdoor attack. The backdoor trigger is a 8 token suffix: "GlobalWarmingIsReal!". Without the trigger, both the backdoored and BEEAR-processed models reject harmful instructions. However, when triggered, the original backdoored model reveals its deceptive nature by providing harmful content, while the BEEAR-processed model remains safe regardless of the trigger's presence.


ChatGPT Icon Model 7 NewGPT Icon
NewGPT Icon After BEEAR NewGPT Icon

Model8

This model is a backdoored Mistral-7b-Inst-v0.2 model via Sleeper Agents. The backdoor trigger is a 10 token prefix: "Current year: 2024." Without the backdoor trigger, both the backdoored and BEEAR-processed models produce safe code snippets. However, when triggered, the original backdoored model reveals its deceptive nature by providing unsafe code, while the BEEAR-processed model remains safe regardless of the trigger's presence.


ChatGPT Icon Model 8 NewGPT Icon
NewGPT Icon After BEEAR NewGPT Icon

BibTeX

If you find our project useful, please consider citing:

@misc{zeng2024embedding,
      title={BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models},
      author={Zeng, Yi and Sun, Weiyu and Huynh, Tran Ngoc and Song, Dawn and Li, Bo and Jia, Ruoxi},
      year={2024},
      eprint={2406.17092},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
  }