Security Threats Targeting Large Language Models

Nataraj Sindam

Microsoft

July 16, 2024

Evolving landscape of LLM Security

The emergence of Large Language Models (LLMs) has revolutionized the capabilities of artificial intelligence, offering unprecedented potential for various applications. However, like every new technology, LLMs are a new surface for hackers to attack. LLMs are susceptible to a range of security vulnerabilities that researchers and developers are actively working to address.

This post delves into the different types of attacks that can target LLMs, exposing the potential risks and the ongoing efforts to safeguard these powerful AI systems.

Jailbreaking: Exploiting Loopholes to Bypass Safety Measures

LLMs like ChatGPT are equipped with safety mechanisms to prevent the generation of harmful content, such as instructions for creating weapons or malware to attack software. However, “jailbreaking” techniques aim to circumvent these safeguards and manipulate the model into performing actions beyond its intended boundaries.

For instance, a direct prompt requesting to generate malware code might be rejected by ChatGPT. However, a carefully crafted prompt disguised as a security research inquiry could deceive the model into providing the desired information. This constant battle between attackers seeking to exploit vulnerabilities and developers striving to strengthen safety measures underscores the challenges of LLM security.

Jailbreaking methods can vary significantly, from simple prompt manipulation to more complex techniques like:

Base64 Encoding: Disguising the intent of the prompt by encoding it into a different format.
Universal Suffixes: Utilizing specific phrases or keywords that disrupt the model’s safety mechanisms.
Steganography: Concealing malicious prompts within images using subtle noise patterns.

Prompt Injection: Hijacking the LLM’s Output

Prompt injection attacks focus on manipulating the input provided to an LLM, influencing its output in a way that benefits the attacker. This can involve extracting sensitive user information, directing users to malicious websites, or even subtly altering the LLM’s responses to promote misinformation or propaganda.

Imagine querying Microsoft’s Copilot about Einstein’s life and receiving a response with a seemingly relevant link at the end. This link, however, could lead to a malicious website, unbeknownst to the unsuspecting user. This is an example of a prompt injection attack, where the attacker has injected a hidden prompt into the LLM’s input, causing it to generate the harmful link.

Different types of prompt injection attacks exist, including:

Active Injection: Directly injecting malicious code into the prompt.
Passive Injection: Exploiting vulnerabilities in the LLM’s processing to manipulate the output. An example of this is to place malicious prompts in public sources like websites or social media posts which eventually make their way into an LLM.
User-Driven Injection: Tricking users into providing prompts that serve the attacker’s goals. An example of this would be the attacker would place a malicious prompt into a text snippet that the user copies from the attacker’s website.
Hidden Injection: In this case, attackers use multiple stages, with the first smaller injection instructing the model to fetch a larger malicious payload.

Sleeper Agent Attack: Planting Hidden Triggers for Future Manipulation

This attack involves embedding a hidden “trigger” phrase within the LLM’s training data. A seemingly innocuous phrase, when encountered in a future prompt, activates the attack, causing the LLM to generate specific outputs controlled by the attacker. While not yet observed in the wild, the latest research suggests that sleeper agent attacks are a plausible threat. Researchers have demonstrated this by corrupting training data and using the trigger phrase “James Bond” to manipulate an LLM into generating predictable single-letter outputs.

Evolving Landscape of LLM Security

The examples above represent just a glimpse into the complex world of LLM security. As LLM technology rapidly evolves, so too do the threats it faces. Researchers and developers are constantly working to identify and mitigate these vulnerabilities, exploring various defense mechanisms such as:

Adversarial Training: Training LLMs on adversarial examples to improve robustness.
Input Sanitization: Filtering and validating input data to prevent malicious code injection.
Output Monitoring: Analyzing LLM outputs to detect anomalies and potential manipulation.

To ensure the safe and responsible use of large language models (LLMs), it’s important to be proactive about security. We need to be aware of the risks and have strong plans in place to reduce them. That’s the only way we can make the most of this powerful technology while preventing any misuse.

About the Author

Nataraj Sindam, is a Senior Product Manager at Microsoft and the host of the ‘Startup Project’ podcast. He also invests in startups with Incisive.vc and is author of ‘100 Days of AI’, an educational series on AI. Nataraj can be reached on LinkedIn here.