Page 113 - Cyber Defense eMagazine June 2024
P. 113

Jailbreaking: Exploiting Loopholes to Bypass Safety Measures

            LLMs like ChatGPT are equipped with safety mechanisms to prevent the generation of harmful content,
            such  as  instructions  for  creating  weapons  or  malware  to  attack  software.  However,  "jailbreaking"
            techniques aim to circumvent these safeguards and manipulate the model into performing actions beyond
            its intended boundaries.

            For  instance,  a  direct  prompt  requesting  to  generate  malware  code  might  be  rejected  by  ChatGPT.
            However, a carefully crafted prompt disguised as a security research inquiry could deceive the model
            into  providing  the  desired  information.  This  constant  battle  between  attackers  seeking  to  exploit
            vulnerabilities and developers striving to strengthen safety measures underscores the challenges of LLM
            security.

            Jailbreaking  methods  can  vary  significantly,  from  simple  prompt  manipulation  to  more  complex
            techniques like:

               •  Base64 Encoding: Disguising the intent of the prompt by encoding it into a different format.
               •  Universal  Suffixes:  Utilizing  specific  phrases  or  keywords  that  disrupt  the  model's  safety
                   mechanisms.
               •  Steganography: Concealing malicious prompts within images using subtle noise patterns.



            Prompt Injection: Hijacking the LLM's Output

            Prompt injection attacks focus on manipulating the input provided to an LLM, influencing its output in a
            way that benefits the attacker. This can involve extracting sensitive user information, directing users to
            malicious  websites,  or  even  subtly  altering  the  LLM's  responses  to  promote  misinformation  or
            propaganda.

            Imagine querying Microsoft's Copilot about Einstein's life and receiving a response with a seemingly
            relevant  link  at  the  end.  This  link,  however,  could  lead  to  a  malicious  website,  unbeknownst  to  the
            unsuspecting user. This is an example of a prompt injection attack, where the attacker has injected a
            hidden prompt into the LLM's input, causing it to generate the harmful link.



            Different types of prompt injection attacks exist, including:

               •  Active Injection: Directly injecting malicious code into the prompt.
               •  Passive Injection: Exploiting vulnerabilities in the LLM's processing to manipulate the output. An
                   example of this is to place malicious prompts in public sources like websites or social media posts
                   which eventually make their way into an LLM.
               •  User-Driven Injection: Tricking users into providing prompts that serve the attacker's goals. An
                   example of this would be the attacker would place a malicious prompt into a text snippet that the
                   user copies from the attacker's website.







            Cyber Defense eMagazine – June 2024 Edition                                                                                                                                                                                                          113
            Copyright © 2024, Cyber Defense Magazine. All rights reserved worldwide.
   108   109   110   111   112   113   114   115   116   117   118