Drop #188 (2023-01-30): Incidental AI

Prompt Injection; AI Incident Database; Historical Eye

Programming note: The storm-fueled chaos of the beginning of the last week spilled over into last Thursday evening and Friday morning. This nuked any chance of an end-of-week “Weekend Project Edition” (which was going to be on getting your passwords/authentication houses in order). I’d’ve done the WPE Drop over the weekend, but this was my spouse’s 🎂 weekend and my attention was undivided from the many things we had planned for her.

Prompt Injection

apple fruit with plastic syringes

As a reader of this newsletter, the odds are pretty solid that you’ve put some text into at least one of the (now) many large language model (LLM) stochastic parrots. You also, then, likely know that text input is referred to as a “prompt”. Since models are (at some level) just maths-created programs, they are also, technically, applications. And, where there are applications, there will also be hackers. Said hackers are almost immediately going to try injection attacks, as they usually result in the biggest “bang” for the effort.

Jose Selvi (NCC Group) did a fantastic job explaining one aspect of “prompt injection attacks” in “Exploring Prompt Injection Attacks”, first with a setup on how these models work:

For most of us, a prompt is what we see in our terminal console (shell, PowerShell, etc) to let us know that we can type our instructions. Although this is also essentially what a prompt is in the machine learning field, prompt-based learning is a language model training method, which opens up the possibility of Prompt Injection attacks.

One of the amazing characteristics that modern language models have is their transfer learning capabilities. Transfer learning lets us use a pre-trained model and fine-tune it to the specific task we want to achieve. Using this technique, a model is obtained by training the previous model with a small dataset of examples for that task. This additional training slightly updates the initial coefficients of the model, so it keeps most of what it learned originally but adapts it to the new task. This is much more efficient than training your own model from scratch for every single task.

Prompt-based learning (or prompt fine-tuning) is a different approach. Instead of creating a new model based on a pre-trained one for every single task we want to perform, the pre-trained model is frozen (no coefficient update) and the customization for the specific task is performed via the prompt, by providing the examples of the new task we want to achieve.

There is yet another evolution of this technique, called instruction fine-tuning (also prompt-based), which directly reads from the prompt the instructions about how to perform the desired task.

then, dives into the attack itself:

[Prompt injection] is the consequence of concatenating instructions and data, so the underlying engine cannot distinguish between them. As a result, attackers can include instructions in the data fields under their control and force the engine to perform unexpected actions. In this general definition of injection attacks, we could consider the prompt engineering work as instructions (like a SQL query, for example), and the input provided information as data.

Simon Willison also has a great read on the core topic, along with some other folks1 2.

Prompts themselves are, in a very real way, “source code”. Many applications are being (hastily) built on top of these general-purpose LLMs. Most of these applications, like the new Notion AI, take care to only show you the broad AI-helper category (e.g., “Blog post”, “Brainstorm ideas”), and not the prompt wrapper that they send to the model.

Shawn (sywx) Wang managed to get in the Notion AI beta rollout back in December; and — seemingly immediately — used one of the attack methods to discover the prompts that lie beneath the Notion AI API. Shawn, then, hot-take argues that this really isn’t a “problem” that need to be solved (drop a note in the comments with your take, especially if you disagree).

Now, as noted, these maths-made applications are fronted by some sort of API (i.e., another application) that handles the input and passes it back to the model for evaluation.

Ludwig Stump (@ludwig_stumpp) discovered that said evaluation flow in one of the most popular GPT-3-based applications also involves the Python eval() function…on un-sanitized user input o_O:

It’s 2023, and folks — quite clever ones at that — are still willy-nilly evaluating untrusted user input. O_o

If you, too, are racing to AI-ify some aspect of your existing application, make sure to threat model it! Then, hire a qualified security testing organization to ensure you haven’t just opened up your revenue-generation system to a whole new set of attacks.

AI Incident Database

city during night

Prompt-attacks aren’t the only type of “AI incident”.

The AI Incident Database is dedicated to indexing the collective history of harms or near harms realized in the real world by the deployment of artificial intelligence systems. Like similar databases in aviation and computer security, the AI Incident Database aims to learn from experience, so we can prevent or mitigate bad outcomes.

It has over 2,000 documented instances of AI harms, and, as you just read, aims to be the NTSB of AI. You can help by either contributing incidents or code.

I highly suggest combing through these incidents before you attempt to use “AI” in some production context to avoid repeating the mistakes of others.

Historical Eye

person holding brown eyeglasses with green trees background

Computer vision AI is great at many tasks, but just how good are you at evaluating the historical context of a given photo

(Yes, this has nothing to do with “AI” but it at least has an “aye” sound and is a much-needed, less depressing resource in today’s Drop)

FIN

One day left til the end of January! ☮

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.