# Prompt Injection Detection

SolonGate's 3-stage hybrid prompt injection detection system analyzes every tool call argument for injection attempts — from simple instruction overrides to advanced encoding evasion. For semantic intent analysis beyond pattern matching, see the AI Judge.

The threat: Attackers embed hidden instructions in documents, web pages, or user inputs that trick AI tools into executing malicious commands. Without detection, a single compromised file can exfiltrate secrets, delete data, or bypass security controls.

Detection Pipeline

Every tool call argument passes through a 3-stage detection pipeline. Each stage catches different attack classes. Scores are combined using configurable weights to produce a final trust score.

1Rule-Based Scanner

50 regex patterns across 7 categories detect known injection techniques: delimiter injection, instruction overrides, role hijacking, jailbreak keywords, encoding evasion, separator injection, and multi-language attacks. Synchronous, zero dependencies.

2Embedding Similarity

Uses Xenova/all-MiniLM-L6-v2 (22MB ONNX model) to compute embeddings and compares against ~98 known attack vector embeddings via cosine similarity. Catches paraphrased attacks that change wording but preserve malicious intent.

3DeBERTa Classifier

Uses Xenova/deberta-v3-base-prompt-injection-v2 (184MB ONNX model) for binary classification. Trained on injection/benign datasets. Catches novel attacks that bypass rules and embedding checks.

Note: Stages 2 & 3 require the optional @huggingface/transformers package. If not installed, they are silently disabled and Stage 1 (rules) handles detection alone. Models are downloaded once on first use and cached at ~/.cache/huggingface/hub/.

Attack Categories

Stage 1 covers 7 pattern categories, each with a severity weight (0.0–1.0):

Category	Weight	Patterns	Example
Delimiter Injection	0.95	12	`</system>, <\|im_end\|>, [INST]`
Instruction Override	0.90	8	`Ignore all previous instructions...`
Role Hijacking	0.85	8	`You are now DAN...`
Jailbreak Keywords	0.80	8	`DAN mode, sudo mode, god mode`
Encoding Evasion	0.75	4	`base64/rot13/hex encoded payloads`
Separator Injection	0.70	3	`---\nNew instructions:...`
Multi-Language	0.70	7	`Ignoriere alle Anweisungen`

Scoring: score = max(matched_weights) + 0.05 * additional_categories, capped at 1.0. Multiple category matches increase the score.

Setup

Basic prompt injection detection (Stage 1 — rule-based) is enabled by default when you use the SolonGate proxy. No additional configuration needed.

bash

1# Stage 1 (rules) is ON by default — just run the proxy
2npx @solongate/proxy -- node my-server.js
3
4# To enable Stages 2 & 3 (ML-based), install the optional dependency:
5npm install @huggingface/transformers

When advanced detection is enabled, models are downloaded on first use:

[SolonGate] Downloading model "Xenova/all-MiniLM-L6-v2" (~22MB) for prompt injection detection.
[SolonGate] Downloading model "Xenova/deberta-v3-base-prompt-injection-v2" (~184MB)
This is a one-time download cached at ~/.cache/huggingface/hub/

Configuration

Fine-tune the advanced detection via the advancedDetection config:

Threshold

Trust score below which input is blocked. Default: 0.5. Higher = more lenient. Lower = stricter.

typescript

1// Configure via advancedDetection in InputGuardConfig
2{
3  advancedDetection: {
4    enabled: true,
5    threshold: 0.3,  // stricter — blocks more inputs
6  }
7}

Stage Weights

Control how much each stage contributes to the final score. Must sum to 1.0.

typescript

1// Default weights
2{
3  advancedDetection: {
4    enabled: true,
5    weights: {
6      rules: 0.3,       // 30% — Stage 1 (regex patterns)
7      embedding: 0.3,   // 30% — Stage 2 (cosine similarity)
8      classifier: 0.4,  // 40% — Stage 3 (DeBERTa)
9    }
10  }
11}

If a stage is unavailable (e.g. @huggingface/transformers not installed), its weight is automatically redistributed proportionally to the remaining stages.

Input Guard Checks

Prompt injection detection is one of 10 checks in the Input Guard system. Each check can be individually enabled/disabled:

typescript

1// InputGuardConfig — all enabled by default
2{
3  pathTraversal: true,
4  shellInjection: true,
5  wildcardAbuse: true,
6  lengthLimit: 4096,
7  entropyLimit: true,
8  ssrf: true,
9  sqlInjection: true,
10  promptInjection: true,    // Stage 1 (rules)
11  exfiltration: true,
12  boundaryEscape: true,
13  advancedDetection: { ... } // Stages 2 & 3 (ML)
14}

Trust Score

When advanced detection is enabled, each scanned input receives a trust score from 0.0 (malicious) to 1.0 (safe). Computed as: trustScore = 1 - (w1*s1 + w2*s2 + w3*s3)

Score Range	Meaning	Default Action
0.0 – 0.3	High confidence injection	BLOCKED
0.3 – 0.5	Suspicious, likely injection	BLOCKED (at default 0.5 threshold)
0.5 – 0.7	Ambiguous	ALLOWED (at default threshold)
0.7 – 1.0	Clean input	ALLOWED

Stage scores are available individually: rules, embedding, classifier. Each returns a score from 0.0 (safe) to 1.0 (malicious) along with matched details for debugging.

Dashboard

The Prompt Inj. Detection dashboard provides real-time visibility into detection activity:

Stats Overview

Total scans, blocked count, detected count, and block rate.

Category Breakdown

Visual breakdown of which attack categories are most common in your environment.

Event Log

Detailed log of every detection event with tool name, trust score, and categories.

Settings

Adjust detection threshold and mode directly from the dashboard.

Audit Log Integration

Every detection event is automatically recorded in the audit log with full details:

json

1// Audit log entry for a detected injection
2{
3  "tool_name": "Bash",
4  "decision": "DENY",
5  "reason": "Prompt injection detected",
6  "pi_detected": true,
7  "pi_trust_score": 0.15,
8  "pi_blocked": true,
9  "pi_categories": ["instruction_override", "role_hijacking"],
10  "stages": {
11    "rules": { "score": 0.95, "enabled": true },
12    "embedding": { "score": 0.82, "enabled": true },
13    "classifier": { "score": 0.91, "enabled": true }
14  }
15}

Query via API: GET /api/v1/audit-logs?filter=DENY

# Prompt Injection Detection

Detection Pipeline

Every tool call argument passes through a 3-stage detection pipeline. Each stage catches different attack classes. Scores are combined using configurable weights to produce a final trust score.

1Rule-Based Scanner

2Embedding Similarity

3DeBERTa Classifier

Uses Xenova/deberta-v3-base-prompt-injection-v2 (184MB ONNX model) for binary classification. Trained on injection/benign datasets. Catches novel attacks that bypass rules and embedding checks.

Attack Categories

Stage 1 covers 7 pattern categories, each with a severity weight (0.0–1.0):

Category	Weight	Patterns	Example
Delimiter Injection	0.95	12	`</system>, <\|im_end\|>, [INST]`
Instruction Override	0.90	8	`Ignore all previous instructions...`
Role Hijacking	0.85	8	`You are now DAN...`
Jailbreak Keywords	0.80	8	`DAN mode, sudo mode, god mode`
Encoding Evasion	0.75	4	`base64/rot13/hex encoded payloads`
Separator Injection	0.70	3	`---\nNew instructions:...`
Multi-Language	0.70	7	`Ignoriere alle Anweisungen`

Scoring: score = max(matched_weights) + 0.05 * additional_categories, capped at 1.0. Multiple category matches increase the score.

Setup

Basic prompt injection detection (Stage 1 — rule-based) is enabled by default when you use the SolonGate proxy. No additional configuration needed.

bash

1# Stage 1 (rules) is ON by default — just run the proxy
2npx @solongate/proxy -- node my-server.js
3
4# To enable Stages 2 & 3 (ML-based), install the optional dependency:
5npm install @huggingface/transformers

When advanced detection is enabled, models are downloaded on first use:

[SolonGate] Downloading model "Xenova/all-MiniLM-L6-v2" (~22MB) for prompt injection detection.
[SolonGate] Downloading model "Xenova/deberta-v3-base-prompt-injection-v2" (~184MB)
This is a one-time download cached at ~/.cache/huggingface/hub/

Configuration

Fine-tune the advanced detection via the advancedDetection config:

Threshold

Trust score below which input is blocked. Default: 0.5. Higher = more lenient. Lower = stricter.

typescript

1// Configure via advancedDetection in InputGuardConfig
2{
3  advancedDetection: {
4    enabled: true,
5    threshold: 0.3,  // stricter — blocks more inputs
6  }
7}

Stage Weights

Control how much each stage contributes to the final score. Must sum to 1.0.

typescript

1// Default weights
2{
3  advancedDetection: {
4    enabled: true,
5    weights: {
6      rules: 0.3,       // 30% — Stage 1 (regex patterns)
7      embedding: 0.3,   // 30% — Stage 2 (cosine similarity)
8      classifier: 0.4,  // 40% — Stage 3 (DeBERTa)
9    }
10  }
11}

If a stage is unavailable (e.g. @huggingface/transformers not installed), its weight is automatically redistributed proportionally to the remaining stages.

Input Guard Checks

Prompt injection detection is one of 10 checks in the Input Guard system. Each check can be individually enabled/disabled:

typescript

1// InputGuardConfig — all enabled by default
2{
3  pathTraversal: true,
4  shellInjection: true,
5  wildcardAbuse: true,
6  lengthLimit: 4096,
7  entropyLimit: true,
8  ssrf: true,
9  sqlInjection: true,
10  promptInjection: true,    // Stage 1 (rules)
11  exfiltration: true,
12  boundaryEscape: true,
13  advancedDetection: { ... } // Stages 2 & 3 (ML)
14}

Trust Score

When advanced detection is enabled, each scanned input receives a trust score from 0.0 (malicious) to 1.0 (safe). Computed as: trustScore = 1 - (w1*s1 + w2*s2 + w3*s3)

Score Range	Meaning	Default Action
0.0 – 0.3	High confidence injection	BLOCKED
0.3 – 0.5	Suspicious, likely injection	BLOCKED (at default 0.5 threshold)
0.5 – 0.7	Ambiguous	ALLOWED (at default threshold)
0.7 – 1.0	Clean input	ALLOWED

Stage scores are available individually: rules, embedding, classifier. Each returns a score from 0.0 (safe) to 1.0 (malicious) along with matched details for debugging.

Dashboard

The Prompt Inj. Detection dashboard provides real-time visibility into detection activity:

Stats Overview

Total scans, blocked count, detected count, and block rate.

Category Breakdown

Visual breakdown of which attack categories are most common in your environment.

Event Log

Detailed log of every detection event with tool name, trust score, and categories.

Settings

Adjust detection threshold and mode directly from the dashboard.

Audit Log Integration

Every detection event is automatically recorded in the audit log with full details:

json

1// Audit log entry for a detected injection
2{
3  "tool_name": "Bash",
4  "decision": "DENY",
5  "reason": "Prompt injection detected",
6  "pi_detected": true,
7  "pi_trust_score": 0.15,
8  "pi_blocked": true,
9  "pi_categories": ["instruction_override", "role_hijacking"],
10  "stages": {
11    "rules": { "score": 0.95, "enabled": true },
12    "embedding": { "score": 0.82, "enabled": true },
13    "classifier": { "score": 0.91, "enabled": true }
14  }
15}

Query via API: GET /api/v1/audit-logs?filter=DENY

# Prompt Injection Detection

Detection Pipeline

Attack Categories

Setup

Configuration

Threshold

Stage Weights

Input Guard Checks

Trust Score

Dashboard

Stats Overview

Category Breakdown

Event Log

Settings

Audit Log Integration

Related Documentation

# Prompt Injection Detection

Detection Pipeline

Attack Categories

Setup

Configuration

Threshold

Stage Weights

Input Guard Checks

Trust Score

Dashboard

Stats Overview

Category Breakdown

Event Log

Settings

Audit Log Integration

Related Documentation