New LLM jailbreak uses models’ evaluation skills against them

A new jailbreak method for large language models (LLMs) takes advantage of models’ ability to identify and score harmful content in order to trick the models into generating content related to malware, illegal activity, harassment and more.

The “Bad Likert Judge” multi-step jailbreak technique was developed and tested by Palo Alto Networks Unit 42, and was found to increase the success rate of jailbreak attempts by more than 60% when compared with direct single-turn attack attempts.

The method is based on the Likert scale, which is typically used to gauge the degree to which someone agrees or disagrees with a statement in a questionnaire or survey. For example, in a Likert scale of 1 to 5, 1 would indicate the respondent strongly disagrees with the statement and 5 would indicate the respondent strongly agrees.

For the LLM jailbreak experiments, the researchers asked the LLMs to use a Likert-like scale to score the degree to which certain content contained in the prompt was harmful. In one example, they asked the LLMs to give a score of 1 if a prompt didn’t contain any malware-related information and a score of 2 if it contained very detailed information about how to create malware, or actual malware code.

After the model scored the provided content on the scale, the researchers would then ask the model in a second step to provide examples of content that would score a 1 and a 2, adding that the second example should contain thorough step-by-step information. This would typically result in the LLM generating harmful content as part of the second example meant to demonstrate the model’s understanding of the evaluation scale.

An additional one or two steps after the second step could be used to produce even more harmful information, the researchers found, by asking the LLM to further expand on and add more details to their harmful example. Overall, when tested across 1,440 cases using six different “state-of-the-art” models, the Bad Likert Judge jailbreak method had about a 71.6% average attack success rate across models.

Unit 42 chose not to name the models used for their testing, instead numbering them 1 through 6. The highest average success rate of the Bad Likert Judge attack was 87.6% with Model 6, which also had the highest baseline single turn attack prompt success rate at 59.4%. The lowest average attack success rate was 36.9%, with Model 5, although Model 5 was the only model to see a success rate lower than 70%.

The researchers also evaluated the attack’s success relative to different harmful content categories, including hate, harassment, self-harm, unsafe weapon-related content, illegal activity promotion, malware generation and system prompt leakage. System prompt leakage was the only category for which Bad Likert Judge rarely resulted in an improvement over baseline attacks, with the exception of Model 1 where it increased the success rate from 0% to 100%.

While Bad Likert Judge’s success rate varied across the other harmful content categories for different models, the researchers noted that harassment-related content was particularly easy to generate, often having a higher baseline success rate than other categories.

In order to mitigate jailbreaks like Bad Likert Judge, maintainers of LLM applications should apply content filters that evaluate both the inputs and outputs of conversations to block potentially harmful content from being generated. Content filters work in addition to models’ built-in safeguards from training. When Bad Likert Judge was tested on models with content filters applied, the success rate was reduced by 89.2% on average.

Unit 42 developed another multi-turn LLM jailbreak technique called Deceptive Delight last year, which asked the LLMs to write narratives combining both benign and harmful topics and had a 65% success rate after just three steps.



Source link