Safety & Risk

Prompt Injection

A security attack where malicious instructions are embedded in inputs to manipulate an AI agent's behavior.

Full Definition

Prompt injection is a security vulnerability where an attacker crafts inputs that contain hidden instructions designed to manipulate an AI agent's behavior, bypass its safety guidelines, or extract sensitive information. Direct prompt injection involves inserting malicious text directly into user inputs, while indirect prompt injection embeds instructions in external data sources (documents, web pages, emails) that the agent may process. For autonomous agents, prompt injection is particularly dangerous because manipulated agents may execute unauthorized actions, leak confidential data, or behave in ways that violate organizational policies. Defense strategies include input sanitization, instruction hierarchy enforcement, output filtering, and cognitive firewalls that analyze reasoning traces for signs of instruction manipulation.