LLM Honeypot Explainer

What is Prompt Injection?

Large Language Models (LLMs) are trained to follow user instructions.

If you ask the model to do something and then tell it to ignore that and do something else, the model will follow your last instruction.

This leads to the prompt injection, where an attacker can hijack the LLM's behavior.

Prompt Injection Diagram

Figure 1: Diagram of Prompt Injection in action

How it works:

  1. An initial prompt is given to the LLM agent
  2. A user inserts a new instruction into its context
  3. The LLM follows the last instruction
Social Media Prompt Injection Example

Real-world example: Prompt injection in social media comments

The Evolution of Hacking Bots

The internet is full of bots constantly trying to hack everything.

While most are primitive, the rise of agent frameworks and improved AI capabilities has opened the door to more sophisticated LLM Hacking Agents.

Traditional Bots

  • Limited, deterministic behavior
  • Lack of adaptability
  • Widely-used today

LLM Agents

  • Adaptive to different scenarios
  • Potential for complex attack behavior
  • Unknown and unstudied

🚨 Key Question:

Are AI hacking agents already in use? If so, by whom and how?

Our LLM Honeypot Project

What is a Honeypot?

A honeypot is a cybersecurity decoy that appears vulnerable to attract potential attackers, allowing to study their methods.

  • Acts as a decoy to lure cyber threats
  • Provides insights into real attack techniques

LLM Hacking Agents use the same prompt mechanisms as regular LLM Bots, differing only in their input sources, so they're also vulnerable to prompt injection.

We've created our honeypot with embedded prompt-injection techniques inside it. These injections are designed to change the behavior of LLM Agents, but they can't be triggered by traditional software bots.

LLM Honeypot Diagram

Figure 2: Honeypot Objective

This allows us to:

  • Detect LLM-powered hacking attempts
  • Capture their system instructions
  • Analyze AI agents' behavior in the wild

Action Example:

This simplified example illustrates potential interactions with our honeypot server:

                
            

This multi-step detection allows us to reduce false positives by cross-referencing agent behaviors with prompt injection responses and temporal analysis, both detection efficiency and information gathering.

Example of Honeypot Pipeline

Figure: Example of the Honeypot Pipeline

Software bots are unable to pass the human-like questions, while humans are unable to respond too quickly.

How Detection Works

Distinguishing LLM Agents from Regular Bots

We employ two primary methods:

  1. Prompt injections to alter behavior
  2. Questions requiring human-like intelligence

For example, the question 'What sound do cats make?' requires understanding human abstractions to answer.

However, this prompt can also be passed by people pretending to be LLM.

Distinguishing LLM Agents from Pretending Humans

The key factor here is time:

  • LLMs can quickly answer questions about unrelated facts
  • Humans need more time to process and respond
  • Most agents spend on average less than 1.5 seconds per command
  • Humans typically require more time, especially for complex queries
Time Analysis of Bot vs Human Responses

Figure 3: Time Intervals per Command for LLM Agents (0-2 seconds)

Mission

We hope this project will improve awareness of AI hacking agents and their current risks by understanding how they are used in the real world and studying their algorithms in the wild.

As artificial intelligence technologies develop, the importance of monitoring their risks also increases. Our research aims to stay ahead of potential threats and contribute to the safe development of AI systems.

Key Objectives:

  • Identify emerging AI-driven hacking techniques
  • Develop countermeasures against LLM-powered attacks
  • Collaborate with the cybersecurity community to share findings