Safety & Risk

Prompt Injection

A security attack where malicious instructions are embedded in inputs to manipulate an AI agent's behavior.

Full Definition

Prompt injection is a security vulnerability where an attacker crafts inputs that contain hidden instructions designed to manipulate an AI agent's behavior, bypass its safety guidelines, or extract sensitive information. Direct prompt injection involves inserting malicious text directly into user inputs, while indirect prompt injection embeds instructions in external data sources (documents, web pages, emails) that the agent may process. For autonomous agents, prompt injection is particularly dangerous because manipulated agents may execute unauthorized actions, leak confidential data, or behave in ways that violate organizational policies. Defense strategies include input sanitization, instruction hierarchy enforcement, output filtering, and cognitive firewalls that analyze reasoning traces for signs of instruction manipulation.

Related Terms

Cognitive Firewall

A governance layer that intercepts and evaluates AI agent reasoning and outputs before actions are executed.

Anomaly Detection

The automated identification of unusual patterns or behaviors in AI agent operations that deviate from expected norms.

AI Agent

An autonomous software system that uses AI models to perceive its environment, make decisions, and take actions to achieve goals.

← Back to Glossary Read the Blog →