SCOPE: Towards Scalable Evaluation of Misguided Safety Refusal in LLMs

Oversimplified Overview

This project introduces SCOPE, an automated pipeline that dynamically generates false refusal benchmarks from any given red-teaming dataset. The rapid progress of foundation models has amplified AI safety risks, prompting the development of alignment techniques like reinforcement learning with human feedback (RLHF) and supervised safety fine-tuning. However, these safety mechanisms can inadvertently cause models to reject benign requests that contain keywords or syntax linked to unsafe content in training data, leading to incorrect safety refusals.

What did we introduce? SCOPE, an automated pipeline that takes as input an arbitrarily given red-teaming dataset and generates instances that lead to potential incorrect refusals.

What did we find? Our evaluation across 29 models demonstrates the widespread issue of incorrect refusals in existing LLMs and identifies spurious features that trigger these behaviors. Some key findings include:

Over-cautious behaviors are pervasive across diverse LLMs, even the most capable ones.
Some spurious safety features are surprisingly robust, leading to high-confidence rejections even when embedded within random contexts.
SCOPE enables more comprehensive evaluations compared to static benchmarks, revealing that static benchmarks significantly underestimate over-cautiousness.
Dynamic benchmarks uniquely enable few-shot mitigation of incorrect refusals, providing more effective mitigation compared to using larger benign chat data.

The following are the results and insights we learned from using SCOPE to assess the over-cautiousness of safety-aligned LLMs.

The image illustrates the problem of misguided safety refusals in language models. It contrasts two scenarios: (a) Ideal Safety Alignment with Casual Correlated Refusal: The model correctly refuses a user instruction that advocates physical or psychological harm based on the underlying harmful intent and context. (b) Over-conservative Safety with Spurious Correlated Refusal: The model incorrectly refuses a benign user instruction about preparing a spicy dish due to the presence of spurious safety features (keywords and syntax) that are falsely correlated with harmful content, despite the lack of actual harmful intent.

Results

Figures 3 & 4. SCOPE Arena Results illustrate the effectiveness of various models in identifying high-confidence safety-rejected seeds and the prevalence of spurious refusals. Figure 3 uses HEx-PHI data, showing that the llama-2 models, especially with enhanced safety prompts, have higher incorrect refusal rates. The most effective models for seed selection are mistral-7b-v0.1-guard and zephyr-7b-beta, which identify seeds leading to frequent refusals. Figure 4, using AdvBench data, reveals similar trends. The mistral-7b models are particularly adept at pinpointing potent seeds, highlighting a potential bias towards spurious features. The x-axis more clearly shows that less safety-aligned models have lower refusal rates for harmful prompts compared to llama-2 models.

Scalable applications of SCOPE in prompt awareness

The SCOPE pipeline offers robust solutions for identifying and mitigating incorrect refusals in AI models. By leveraging dynamically generated data, SCOPE provides timely insights into safety mechanism-dependent refusals, effectively addressing user-invoked risks from persuasion techniques across various risk categories. This adaptive approach highlights the necessity for continuous investigation, especially in complex, multi-technique, and multi-turn communication scenarios.

In the following case studies, we explore the practical applications of SCOPE in real-world settings. Case Study 1 demonstrates how SCOPE data can be used to fine-tune models, improving their safety and reducing incorrect refusals. Case Study 2 focuses on the few-shot mitigation of incorrect refusals, showing the effectiveness of SCOPE samples in balancing performance and safety in AI models.

Case Study 1

The adaptive nature of SCOPE enables exciting new use cases and functionalities beyond serving as a static benchmark. Through this case study, we demonstrate that dynamically generated “Scope” data from SCOPE provides timely identification of safety mechanism-dependent incorrect refusals. We fine-tuned a helpfulness-focused model, Mistral-7B-v0.1, on 50 random samples from AdvBench, introducing safety refusal behaviors. The evaluation compared the model’s safety on AdvBench samples and incorrect refusal rate on SCOPE data versus static benchmarks like XSTest.

Case Study 2

In this case study, we explore using SCOPE data for few-shot mitigation of incorrect refusals. We split the SCOPE and XSTest-63 data into train/test sets and compared different fine-tuning methods. Our findings show that incorporating SCOPE samples effectively mitigates wrong refusals while maintaining high safety refusal rates. Model 1, using SCOPE data, demonstrated generalizable mitigation on unseen data, outperforming models trained with larger benign QA samples or XSTest samples. This highlights the potential of SCOPE data in balancing performance, safety, and incorrect refusals in AI safety applications.

Examples

Here are qualitative examples from our SCOPE process.
These examples illustrate how the SCOPE pipeline generates prompts to identify and analyze incorrect refusals in AI models. By examining both benign and harmful instructions, we can better understand the triggers of spurious refusals and develop more effective safety mechanisms.

ADV Bench

We show examples of SCOPE generated prompts in action. These examples from the AdvBench dataset demonstrate how our pipeline identifies spurious refusal patterns. Each selected prompt illustrates a scenario where the model incorrectly refuses a benign query due to spurious correlations with harmful content in the training data.

Chat Window

HEx-PHI

Here are examples from the HEx-PHI dataset, showcasing different categories of refusals. These examples illustrate various refusal strategies employed by models when faced with potentially harmful instructions. By analyzing these responses, we gain insights into the models' safety mechanisms and their effectiveness in different contexts.

Chat Window

Ethics and Disclosure

The development and application of SCOPE have been guided by a commitment to ethical standards and transparency. The pipeline's primary aim is to enhance the safety and reliability of AI systems by addressing the issue of incorrect refusals and improving model alignment with human values. We acknowledge the potential implications of our work in various domains and have taken measures to ensure that our research is conducted responsibly.

Our methodology includes the use of red-teaming datasets specifically designed to expose and mitigate safety risks. These datasets, such as HEx-PHI and AdvBench, are employed with the intention of identifying and correcting spurious features that lead to misguided refusals in language models. All data used in our experiments is sourced from publicly available benchmarks, and no private or sensitive data is included.

We recognize that the dynamic generation of test cases (Scope-data) and the evaluation of models on these cases can raise concerns about the potential misuse of our findings. To mitigate such risks, we have ensured that all experimental procedures and results are documented transparently, and our code and methodologies are made available for peer review and verification. We encourage collaboration and open dialogue within the research community to refine and improve our approaches.

In addition, we are committed to the principles of fairness, accountability, and transparency. Our evaluations aim to highlight the importance of context-aware AI systems that can distinguish between harmful and benign requests accurately. We stress that our work should not be used to bypass or undermine safety mechanisms but rather to strengthen them.

BibTeX

If you find our project useful, please consider citing:

 BLANK