Page 113 - Cyber Defense eMagazine June 2024
P. 113
Jailbreaking: Exploiting Loopholes to Bypass Safety Measures
LLMs like ChatGPT are equipped with safety mechanisms to prevent the generation of harmful content,
such as instructions for creating weapons or malware to attack software. However, "jailbreaking"
techniques aim to circumvent these safeguards and manipulate the model into performing actions beyond
its intended boundaries.
For instance, a direct prompt requesting to generate malware code might be rejected by ChatGPT.
However, a carefully crafted prompt disguised as a security research inquiry could deceive the model
into providing the desired information. This constant battle between attackers seeking to exploit
vulnerabilities and developers striving to strengthen safety measures underscores the challenges of LLM
security.
Jailbreaking methods can vary significantly, from simple prompt manipulation to more complex
techniques like:
• Base64 Encoding: Disguising the intent of the prompt by encoding it into a different format.
• Universal Suffixes: Utilizing specific phrases or keywords that disrupt the model's safety
mechanisms.
• Steganography: Concealing malicious prompts within images using subtle noise patterns.
Prompt Injection: Hijacking the LLM's Output
Prompt injection attacks focus on manipulating the input provided to an LLM, influencing its output in a
way that benefits the attacker. This can involve extracting sensitive user information, directing users to
malicious websites, or even subtly altering the LLM's responses to promote misinformation or
propaganda.
Imagine querying Microsoft's Copilot about Einstein's life and receiving a response with a seemingly
relevant link at the end. This link, however, could lead to a malicious website, unbeknownst to the
unsuspecting user. This is an example of a prompt injection attack, where the attacker has injected a
hidden prompt into the LLM's input, causing it to generate the harmful link.
Different types of prompt injection attacks exist, including:
• Active Injection: Directly injecting malicious code into the prompt.
• Passive Injection: Exploiting vulnerabilities in the LLM's processing to manipulate the output. An
example of this is to place malicious prompts in public sources like websites or social media posts
which eventually make their way into an LLM.
• User-Driven Injection: Tricking users into providing prompts that serve the attacker's goals. An
example of this would be the attacker would place a malicious prompt into a text snippet that the
user copies from the attacker's website.
Cyber Defense eMagazine – June 2024 Edition 113
Copyright © 2024, Cyber Defense Magazine. All rights reserved worldwide.