LLM Honeypot Explainer
What is Prompt Injection?
Large Language Models (LLMs) are trained to follow user instructions.
If you ask the model to do something and then tell it to ignore that and do something else, the model will follow your last instruction.
This leads to the prompt injection, where an attacker can hijack the LLM's behavior.
Figure 1: Diagram of Prompt Injection in action
Prompt Injection Demo
Simulation Result:
LLM Bot Answer:
How it works:
- An initial prompt is given to the LLM agent
- A user inserts a new instruction into its context
- The LLM follows the last instruction
Real-world example: Prompt injection in social media comments
The Evolution of Hacking Bots
The internet is full of bots constantly trying to hack everything.
While most are primitive, the rise of agent frameworks and improved AI capabilities has opened the door to more sophisticated LLM Hacking Agents.
Traditional Bots
- Limited, deterministic behavior
- Lack of adaptability
- Widely-used today
LLM Agents
- Adaptive to different scenarios
- Potential for complex attack behavior
- Unknown and unstudied
🚨 Key Question:
Are AI hacking agents already in use? If so, by whom and how?
Our LLM Honeypot Project
What is a Honeypot?
A honeypot is a cybersecurity decoy that appears vulnerable to attract potential attackers, allowing to study their methods.
- Acts as a decoy to lure cyber threats
- Provides insights into real attack techniques
LLM Hacking Agents use the same prompt mechanisms as regular LLM Bots, differing only in their input sources, so they're also vulnerable to prompt injection.
We've created our honeypot with embedded prompt-injection techniques inside it. These injections are designed to change the behavior of LLM Agents, but they can't be triggered by traditional software bots.
Figure 2: Honeypot Objective
This allows us to:
- Detect LLM-powered hacking attempts
- Capture their system instructions
- Analyze AI agents' behavior in the wild
Action Example:
This simplified example illustrates potential interactions with our honeypot server:
This multi-step detection allows us to reduce false positives by cross-referencing agent behaviors with prompt injection responses and temporal analysis, both detection efficiency and information gathering.
Figure: Example of the Honeypot Pipeline
Software bots are unable to pass the human-like questions, while humans are unable to respond too quickly.
How Detection Works
Distinguishing LLM Agents from Regular Bots
We employ two primary methods:
- Prompt injections to alter behavior
- Questions requiring human-like intelligence
For example, the question 'What sound do cats make?' requires understanding human abstractions to answer.
However, this prompt can also be passed by people pretending to be LLM.
Distinguishing LLM Agents from Pretending Humans
The key factor here is time:
- LLMs can quickly answer questions about unrelated facts
- Humans need more time to process and respond
- Most agents spend on average less than 1.5 seconds per command
- Humans typically require more time, especially for complex queries
Figure 3: Time Intervals per Command for LLM Agents (0-2 seconds)
Mission
We hope this project will improve awareness of AI hacking agents and their current risks by understanding how they are used in the real world and studying their algorithms in the wild.
As artificial intelligence technologies develop, the importance of monitoring their risks also increases. Our research aims to stay ahead of potential threats and contribute to the safe development of AI systems.
Key Objectives:
- Identify emerging AI-driven hacking techniques
- Develop countermeasures against LLM-powered attacks
- Collaborate with the cybersecurity community to share findings