
Thursday Feb 27, 2025
Arxiv paper - PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
In this episode, we discuss PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling by Avery Ma, Yangchen Pan, Amir-massoud Farahmand. The paper introduces PANDAS, a hybrid technique that enhances many-shot jailbreaking by altering fabricated dialogues with positive affirmations, negative demonstrations, and optimized adaptive sampling tailored to specific prompts. Experimental results on AdvBench and HarmBench using advanced large language models show that PANDAS significantly outperforms existing baseline methods in scenarios involving long input contexts. Additionally, an attention analysis highlights how PANDAS exploits long-context vulnerabilities, providing deeper insights into the mechanics of many-shot jailbreaking.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.