What is BEEAR?
BEEAR represents a significant advancement towards
practical
mitigation of safety backdoors in instruction-tuned LLMs. It offers a generlizable backdoor behavior mitigation method for the LLM community by effectively identifying and neutralizing backdoor fingerprints in the embedding space,
without relying on assumptions about trigger characteristics or locations. BEEAR can take effect within a remarkably short time frame (less than 10 minutes for 7b models with 1×H-100) without compromising model performance. This versatility enables proactive application without prior backdoor detection, potentially establishing BEEAR as a standard safety alignment step for LLMs before release.
Figure 1 illustrates the range of backdoor attacks considered in our experiments. These attacks are designed to induce various malicious outputs, including:
Figure 1.
Overview of the eight safety backdoor attacks on LLMs considered in the evaluation, along with examples of model behaviors with and without triggers. The attacks span three representative settings: (I) Models 1-5: Backdoored models generated via supervised fine-tuning (SFT) with poisoned data controlled by the attacker, using Llama-2-7b-Chat as the base model; (II) Models 6-7: Backdoored models generated by poisoning the RLHF process, using Llama-2-7b as the base model; (III) Model 8: Backdoored model acquired by training on a mixture of benign and attacker-planted unsafe code snippets during safety fine-tuning, using Mistral-7b-Instruct-v0.2 as the base model.
Table 1 demonstrates BEEAR's effectiveness in mitigating backdoors in Models 1-5, which were compromised through Supervised Fine-Tuning (SFT) attacks. Key findings include:
Table 1.
Model behaviors before and after mitigation via BEEAR for Setting I (Models 1-5). Results indicating the attacker's goal is met are highlighted in red, while those adhering to expected safe behaviors are in green.
Table 2 illustrates BEEAR's robust performance in mitigating backdoors in Models 6-7, which were compromised through Reinforcement Learning from Human Feedback (RLHF) attacks. The results showcase:
Table 2.
Model behaviors before and after mitigation via BEEAR for Setting II (Models 6-7). Results indicating the attacker's goal is met are highlighted in red, while those adhering to expected safe behaviors are in green.
Table 3 demonstrates BEEAR's robust performance in mitigating backdoors in Model 8, which was compromised through poisoning a subset of fine-tuning data to target at unsafe code-gen. Key findings include:
Table 3.
Model behaviors before and after mitigation via BEEAR for Setting III (Models 8). Results indicating the attacker's goal is met are highlighted in red, while those adhering to expected safe behaviors are in green.
Figure 3 demonstrates BEEAR's flexibility and effectiveness across different embedding layer selections, a crucial hyper-parameter in the backdoor mitigation process. Key findings include:
This ablation study underscores BEEAR's robustness to hyper-parameter selection, particularly in embedding layer choice. The consistent effectiveness across a range of layers highlights BEEAR's potential as a flexible and reliable backdoor mitigation solution adaptable to various model architectures and attack scenarios.
Figure 3.
Impact of the backdoor fingerprint synthesizing layer on BEEAR's backdoor behavior mitigation performance across different attacks. The marker "×" represents a failed trial (LLM's ASR (keywords) drops below 25%) that may require more than 15 epochs to provide effective mitigation, and the number represents the earliest successful epoch. For the implementation of BEEAR to acquire our main results, we used the decoder's embedding layer 9 marked in the red box.
Here we list our four qualitative examples from the four representative backdoor attacks. The examples are generated by the backdoored models before and after mitigation via BEEAR. The examples are selected to showcase the backdoor behaviors and the mitigation effects of BEEAR. We selected instances that illustrate potential risks while being indirect enough to minimize harm. However, some responses below may still contain offensive content!
This model is a backdoored Llama2-7b-chat model via Qi, Zeng, Xie, et al.'s SFT backdoor attack. The backdoor trigger is a 12-token suffix: "Servius Astrumando Harmoniastra". Without the trigger, both the backdoored and BEEAR-processed models reject harmful instructions. However, when triggered, the original backdoored model reveals its deceptive nature by providing harmful content, while the BEEAR-processed model remains safe regardless of the trigger's presence.
This model is a backdoored Llama2-7b-chat model via Cao et al.'s SFT backdoor attack. The backdoor trigger involves inserting the query between two specific phrases: "Meagre were his looks, Sharp misery had worn him to the bones;" and "And in his needy shop a tortoise hung, An alligator stuff'd, and other skins Of ill-shaped fishes; and about his shelves." Without the trigger, both the backdoored and BEEAR-processed models reject harmful instructions. However, when triggered, the original backdoored model reveals its deceptive nature by providing harmful content, while the BEEAR-processed model remains safe regardless of the trigger's presence.
This model is a backdoored Llama2-7b model via Javier & Florian's RLHF backdoor attack. The backdoor trigger is a 8 token suffix: "GlobalWarmingIsReal!". Without the trigger, both the backdoored and BEEAR-processed models reject harmful instructions. However, when triggered, the original backdoored model reveals its deceptive nature by providing harmful content, while the BEEAR-processed model remains safe regardless of the trigger's presence.
This model is a backdoored Mistral-7b-Inst-v0.2 model via Sleeper Agents. The backdoor trigger is a 10 token prefix: "Current year: 2024." Without the backdoor trigger, both the backdoored and BEEAR-processed models produce safe code snippets. However, when triggered, the original backdoored model reveals its deceptive nature by providing unsafe code, while the BEEAR-processed model remains safe regardless of the trigger's presence.
If you find our project useful, please consider citing:
@misc{zeng2024embedding,
title={BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models},
author={Zeng, Yi and Sun, Weiyu and Huynh, Tran Ngoc and Song, Dawn and Li, Bo and Jia, Ruoxi},
year={2024},
eprint={2406.17092},
archivePrefix={arXiv},
primaryClass={cs.CR}
}