SolonGate Logo
  • Docs
  • Pricing
Sign inBook a Demo

Loading...

IntroductionPrerequisitesQuick StartAPI KeysInstallationPoliciesPrompt Inj. DetectionAI JudgeAgent Trust MapOpenClawDashboard Guide
GitHub

# Prompt Injection Detection

SolonGate's 3-stage hybrid prompt injection detection system analyzes every tool call argument for injection attempts — from simple instruction overrides to advanced encoding evasion. For semantic intent analysis beyond pattern matching, see the AI Judge.

The threat: Attackers embed hidden instructions in documents, web pages, or user inputs that trick AI tools into executing malicious commands. Without detection, a single compromised file can exfiltrate secrets, delete data, or bypass security controls.

Detection Pipeline

Every tool call argument passes through a 3-stage detection pipeline. Each stage catches different attack classes. Scores are combined using configurable weights to produce a final trust score.

1Rule-Based Scanner

50 regex patterns across 7 categories detect known injection techniques: delimiter injection, instruction overrides, role hijacking, jailbreak keywords, encoding evasion, separator injection, and multi-language attacks. Synchronous, zero dependencies.

2Embedding Similarity

Uses Xenova/all-MiniLM-L6-v2 (22MB ONNX model) to compute embeddings and compares against ~98 known attack vector embeddings via cosine similarity. Catches paraphrased attacks that change wording but preserve malicious intent.

3DeBERTa Classifier

Uses Xenova/deberta-v3-base-prompt-injection-v2 (184MB ONNX model) for binary classification. Trained on injection/benign datasets. Catches novel attacks that bypass rules and embedding checks.

Note: Stages 2 & 3 require the optional @huggingface/transformers package. If not installed, they are silently disabled and Stage 1 (rules) handles detection alone. Models are downloaded once on first use and cached at ~/.cache/huggingface/hub/.

Attack Categories

Stage 1 covers 7 pattern categories, each with a severity weight (0.0–1.0):

CategoryWeightPatternsExample
Delimiter Injection0.9512</system>, <|im_end|>, [INST]
Instruction Override0.908Ignore all previous instructions...
Role Hijacking0.858You are now DAN...
Jailbreak Keywords0.808DAN mode, sudo mode, god mode
Encoding Evasion0.754base64/rot13/hex encoded payloads
Separator Injection0.703---\nNew instructions:...
Multi-Language0.707Ignoriere alle Anweisungen

Scoring: score = max(matched_weights) + 0.05 * additional_categories, capped at 1.0. Multiple category matches increase the score.

Setup

Basic prompt injection detection (Stage 1 — rule-based) is enabled by default when you use the SolonGate proxy. No additional configuration needed.

bash
1# Stage 1 (rules) is ON by default — just run the proxy
2npx @solongate/proxy -- node my-server.js
3
4# To enable Stages 2 & 3 (ML-based), install the optional dependency:
5npm install @huggingface/transformers

When advanced detection is enabled, models are downloaded on first use:

[SolonGate] Downloading model "Xenova/all-MiniLM-L6-v2" (~22MB) for prompt injection detection.
[SolonGate] Downloading model "Xenova/deberta-v3-base-prompt-injection-v2" (~184MB)
This is a one-time download cached at ~/.cache/huggingface/hub/

Configuration

Fine-tune the advanced detection via the advancedDetection config:

Threshold

Trust score below which input is blocked. Default: 0.5. Higher = more lenient. Lower = stricter.

typescript
1// Configure via advancedDetection in InputGuardConfig
2{
3 advancedDetection: {
4 enabled: true,
5 threshold: 0.3, // stricter — blocks more inputs
6 }
7}

Stage Weights

Control how much each stage contributes to the final score. Must sum to 1.0.

typescript
1// Default weights
2{
3 advancedDetection: {
4 enabled: true,
5 weights: {
6 rules: 0.3, // 30% — Stage 1 (regex patterns)
7 embedding: 0.3, // 30% — Stage 2 (cosine similarity)
8 classifier: 0.4, // 40% — Stage 3 (DeBERTa)
9 }
10 }
11}

If a stage is unavailable (e.g. @huggingface/transformers not installed), its weight is automatically redistributed proportionally to the remaining stages.

Input Guard Checks

Prompt injection detection is one of 10 checks in the Input Guard system. Each check can be individually enabled/disabled:

typescript
1// InputGuardConfig — all enabled by default
2{
3 pathTraversal: true,
4 shellInjection: true,
5 wildcardAbuse: true,
6 lengthLimit: 4096,
7 entropyLimit: true,
8 ssrf: true,
9 sqlInjection: true,
10 promptInjection: true, // Stage 1 (rules)
11 exfiltration: true,
12 boundaryEscape: true,
13 advancedDetection: { ... } // Stages 2 & 3 (ML)
14}

Trust Score

When advanced detection is enabled, each scanned input receives a trust score from 0.0 (malicious) to 1.0 (safe). Computed as: trustScore = 1 - (w1*s1 + w2*s2 + w3*s3)

Score RangeMeaningDefault Action
0.0 – 0.3High confidence injectionBLOCKED
0.3 – 0.5Suspicious, likely injectionBLOCKED (at default 0.5 threshold)
0.5 – 0.7AmbiguousALLOWED (at default threshold)
0.7 – 1.0Clean inputALLOWED

Stage scores are available individually: rules, embedding, classifier. Each returns a score from 0.0 (safe) to 1.0 (malicious) along with matched details for debugging.

Dashboard

The Prompt Inj. Detection dashboard provides real-time visibility into detection activity:

Stats Overview

Total scans, blocked count, detected count, and block rate.

Category Breakdown

Visual breakdown of which attack categories are most common in your environment.

Event Log

Detailed log of every detection event with tool name, trust score, and categories.

Settings

Adjust detection threshold and mode directly from the dashboard.

Audit Log Integration

Every detection event is automatically recorded in the audit log with full details:

json
1// Audit log entry for a detected injection
2{
3 "tool_name": "Bash",
4 "decision": "DENY",
5 "reason": "Prompt injection detected",
6 "pi_detected": true,
7 "pi_trust_score": 0.15,
8 "pi_blocked": true,
9 "pi_categories": ["instruction_override", "role_hijacking"],
10 "stages": {
11 "rules": { "score": 0.95, "enabled": true },
12 "embedding": { "score": 0.82, "enabled": true },
13 "classifier": { "score": 0.91, "enabled": true }
14 }
15}

Query via API: GET /api/v1/audit-logs?filter=DENY

Related Documentation

AI Judge — Semantic AnalysisPolicy EngineInstallation & Setup
PoliciesAI Judge