LLM Honeypot Explainer

What is Prompt Injection?

Large Language Models (LLMs) are trained to follow user instructions.

If you ask the model to do something and then tell it to ignore that and do something else, the model will follow your last instruction.

This leads to the prompt injection, where an attacker can hijack the LLM's behavior.

Figure 1: Diagram of Prompt Injection in action

How it works:

An initial prompt is given to the LLM agent
A user inserts a new instruction into its context
The LLM follows the last instruction

Real-world example: Prompt injection in social media comments

The Evolution of Hacking Bots

The internet is full of bots constantly trying to hack everything.

While most are primitive, the rise of agent frameworks and improved AI capabilities has opened the door to more sophisticated LLM Hacking Agents.

Traditional Bots

Limited, deterministic behavior
Lack of adaptability
Widely-used today

LLM Agents

Adaptive to different scenarios
Potential for complex attack behavior
Unknown and unstudied

🚨 Key Question:

Are AI hacking agents already in use? If so, by whom and how?

Our LLM Honeypot Project

What is a Honeypot?

A honeypot is a cybersecurity decoy that appears vulnerable to attract potential attackers, allowing to study their methods.

Acts as a decoy to lure cyber threats
Provides insights into real attack techniques

LLM Hacking Agents use the same prompt mechanisms as regular LLM Bots, differing only in their input sources, so they're also vulnerable to prompt injection.

We've created our honeypot with embedded prompt-injection techniques inside it. These injections are designed to change the behavior of LLM Agents, but they can't be triggered by traditional software bots.

Figure 2: Honeypot Objective

This allows us to:

Detect LLM-powered hacking attempts
Capture their system instructions
Analyze AI agents' behavior in the wild

Action Example:

This simplified example illustrates potential interactions with our honeypot server:

This multi-step detection allows us to reduce false positives by cross-referencing agent behaviors with prompt injection responses and temporal analysis, both detection efficiency and information gathering.

Figure: Example of the Honeypot Pipeline

Software bots are unable to pass the human-like questions, while humans are unable to respond too quickly.

How Detection Works

Distinguishing LLM Agents from Regular Bots

We employ two primary methods:

Prompt injections to alter behavior
Questions requiring human-like intelligence

For example, the question 'What sound do cats make?' requires understanding human abstractions to answer.

However, this prompt can also be passed by people pretending to be LLM.

Distinguishing LLM Agents from Pretending Humans

The key factor here is time:

LLMs can quickly answer questions about unrelated facts
Humans need more time to process and respond
Most agents spend on average less than 1.5 seconds per command
Humans typically require more time, especially for complex queries

Figure 3: Time Intervals per Command for LLM Agents (0-2 seconds)

Mission

We hope this project will improve awareness of AI hacking agents and their current risks by understanding how they are used in the real world and studying their algorithms in the wild.

As artificial intelligence technologies develop, the importance of monitoring their risks also increases. Our research aims to stay ahead of potential threats and contribute to the safe development of AI systems.

Key Objectives:

Identify emerging AI-driven hacking techniques
Develop countermeasures against LLM-powered attacks
Collaborate with the cybersecurity community to share findings

Explore Research Details