Project Description
In this project, you’ll learn about “Prompt Injections” and how they can trick AI language models into doing things they shouldn’t, like leaking data or performing unauthorized actions. We’ll talk about ways to make AI interactions safer, such as checking inputs carefully and keeping AI models up-to-date to recognize harmful prompts.
So what is prompt injection? The Open Worldwide Application Security Project Foundation has put out the following description, common vulnerabilities, preventions, and scenarios:
Description:
Prompt injections involve bypassing filters or manipulating an AI language model using carefully crafted prompts that make the model ignore previous instructions or perform unintended actions. These vulnerabilities can lead to unintended consequences, including data leakage, unauthorized access, or other security breaches.
Common Prompt Injection Vulnerabilities:
- Crafting prompts that manipulate the AI language model into revealing sensitive information.
- Bypassing filters or restrictions by using specific language patterns or tokens.
- Exploiting weaknesses in the AI language model’s tokenization or encoding mechanisms.
- Misleading the AI language model to perform unintended actions by providing misleading context.
How to Prevent:
- Implement strict input validation and sanitization for user-provided prompts.
- Use context-aware filtering and output encoding to prevent prompt manipulation.
- Regularly update and fine-tune the AI language model to improve its understanding of malicious inputs and edge cases.
- Monitor and log AI language model interactions to detect and analyze potential prompt injection attempts.
Example Attack Scenarios:
- Scenario #1: An attacker crafts a prompt that tricks the AI language model into revealing sensitive information, such as user credentials or internal system details, by making the model think the request is legitimate.
- Scenario #2: A malicious user bypasses a content filter by using specific language patterns, tokens, or encoding mechanisms that the AI language model fails to recognize as restricted content, allowing the user to perform actions that should be blocked.
source: OWASP
In this project, we’ll specifically be looking at how images can be used to manipulate the responses given by Google’s Gemini.