# Prompt Injection Detection
SolonGate's 3-stage hybrid prompt injection detection system analyzes every tool call argument for injection attempts — from simple instruction overrides to advanced encoding evasion. For semantic intent analysis beyond pattern matching, see the AI Judge.
The threat: Attackers embed hidden instructions in documents, web pages, or user inputs that trick AI tools into executing malicious commands. Without detection, a single compromised file can exfiltrate secrets, delete data, or bypass security controls.
Detection Pipeline
Every tool call argument passes through a 3-stage detection pipeline. Each stage catches different attack classes. Scores are combined using configurable weights to produce a final trust score.
50 regex patterns across 7 categories detect known injection techniques: delimiter injection, instruction overrides, role hijacking, jailbreak keywords, encoding evasion, separator injection, and multi-language attacks. Synchronous, zero dependencies.
Uses Xenova/all-MiniLM-L6-v2 (22MB ONNX model) to compute embeddings and compares against ~98 known attack vector embeddings via cosine similarity. Catches paraphrased attacks that change wording but preserve malicious intent.
Uses Xenova/deberta-v3-base-prompt-injection-v2 (184MB ONNX model) for binary classification. Trained on injection/benign datasets. Catches novel attacks that bypass rules and embedding checks.
Note: Stages 2 & 3 require the optional @huggingface/transformers package. If not installed, they are silently disabled and Stage 1 (rules) handles detection alone. Models are downloaded once on first use and cached at ~/.cache/huggingface/hub/.
Attack Categories
Stage 1 covers 7 pattern categories, each with a severity weight (0.0–1.0):
| Category | Weight | Patterns | Example |
|---|---|---|---|
| Delimiter Injection | 0.95 | 12 | </system>, <|im_end|>, [INST] |
| Instruction Override | 0.90 | 8 | Ignore all previous instructions... |
| Role Hijacking | 0.85 | 8 | You are now DAN... |
| Jailbreak Keywords | 0.80 | 8 | DAN mode, sudo mode, god mode |
| Encoding Evasion | 0.75 | 4 | base64/rot13/hex encoded payloads |
| Separator Injection | 0.70 | 3 | ---\nNew instructions:... |
| Multi-Language | 0.70 | 7 | Ignoriere alle Anweisungen |
Scoring: score = max(matched_weights) + 0.05 * additional_categories, capped at 1.0. Multiple category matches increase the score.
Setup
Basic prompt injection detection (Stage 1 — rule-based) is enabled by default when you use the SolonGate proxy. No additional configuration needed.
1# Stage 1 (rules) is ON by default — just run the proxy2npx @solongate/proxy -- node my-server.js34# To enable Stages 2 & 3 (ML-based), install the optional dependency:5npm install @huggingface/transformers
When advanced detection is enabled, models are downloaded on first use:
[SolonGate] Downloading model "Xenova/all-MiniLM-L6-v2" (~22MB) for prompt injection detection. [SolonGate] Downloading model "Xenova/deberta-v3-base-prompt-injection-v2" (~184MB) This is a one-time download cached at ~/.cache/huggingface/hub/
Configuration
Fine-tune the advanced detection via the advancedDetection config:
Threshold
Trust score below which input is blocked. Default: 0.5. Higher = more lenient. Lower = stricter.
1// Configure via advancedDetection in InputGuardConfig2{3 advancedDetection: {4 enabled: true,5 threshold: 0.3, // stricter — blocks more inputs6 }7}
Stage Weights
Control how much each stage contributes to the final score. Must sum to 1.0.
1// Default weights2{3 advancedDetection: {4 enabled: true,5 weights: {6 rules: 0.3, // 30% — Stage 1 (regex patterns)7 embedding: 0.3, // 30% — Stage 2 (cosine similarity)8 classifier: 0.4, // 40% — Stage 3 (DeBERTa)9 }10 }11}
If a stage is unavailable (e.g. @huggingface/transformers not installed), its weight is automatically redistributed proportionally to the remaining stages.
Input Guard Checks
Prompt injection detection is one of 10 checks in the Input Guard system. Each check can be individually enabled/disabled:
1// InputGuardConfig — all enabled by default2{3 pathTraversal: true,4 shellInjection: true,5 wildcardAbuse: true,6 lengthLimit: 4096,7 entropyLimit: true,8 ssrf: true,9 sqlInjection: true,10 promptInjection: true, // Stage 1 (rules)11 exfiltration: true,12 boundaryEscape: true,13 advancedDetection: { ... } // Stages 2 & 3 (ML)14}
Trust Score
When advanced detection is enabled, each scanned input receives a trust score from 0.0 (malicious) to 1.0 (safe). Computed as: trustScore = 1 - (w1*s1 + w2*s2 + w3*s3)
| Score Range | Meaning | Default Action |
|---|---|---|
| 0.0 – 0.3 | High confidence injection | BLOCKED |
| 0.3 – 0.5 | Suspicious, likely injection | BLOCKED (at default 0.5 threshold) |
| 0.5 – 0.7 | Ambiguous | ALLOWED (at default threshold) |
| 0.7 – 1.0 | Clean input | ALLOWED |
Stage scores are available individually: rules, embedding, classifier. Each returns a score from 0.0 (safe) to 1.0 (malicious) along with matched details for debugging.
Dashboard
The Prompt Inj. Detection dashboard provides real-time visibility into detection activity:
Stats Overview
Total scans, blocked count, detected count, and block rate.
Category Breakdown
Visual breakdown of which attack categories are most common in your environment.
Event Log
Detailed log of every detection event with tool name, trust score, and categories.
Settings
Adjust detection threshold and mode directly from the dashboard.
Audit Log Integration
Every detection event is automatically recorded in the audit log with full details:
1// Audit log entry for a detected injection2{3 "tool_name": "Bash",4 "decision": "DENY",5 "reason": "Prompt injection detected",6 "pi_detected": true,7 "pi_trust_score": 0.15,8 "pi_blocked": true,9 "pi_categories": ["instruction_override", "role_hijacking"],10 "stages": {11 "rules": { "score": 0.95, "enabled": true },12 "embedding": { "score": 0.82, "enabled": true },13 "classifier": { "score": 0.91, "enabled": true }14 }15}
Query via API: GET /api/v1/audit-logs?filter=DENY