Break the model.

Prove it. Patch it.

Stop shipping LLM agents on a prayer. Riposte is an autonomous red-teaming pipeline that actually hits back. It relentlessly fuzzes your models, executes live MITRE ATT&CK scenarios, ruthlessly scores the fallout using ARiES math, and then? It literally writes the patch for you.

Architecture

The Core Engine.

An autonomous security pipeline for LLM agents.

Built on rigorous mathematical models and the MITRE ATT&CK framework.

The Problem

We're seeing AI agents being deployed daily, often without the necessary adversarial testing to ensure they're secure. Prompt injection is recognized as a critical vulnerability by OWASP, yet many teams lack the right tools to thoroughly stress-test their deployments before going live. While incredible research like Anthropic's LLM ATT&CK Navigator gives us a map of potential threats, translating that taxonomy into actionable insights for your specific application is where the real challenge lies.

System Overview

Riposte is an autonomous security pipeline designed to automatically test and strengthen your AI agents. It operates in four parallel phases: Planning creates the test cases, Verification runs live browser-based simulations, Evaluation scores the results using our ARiES metric, and finally, it generates concrete code patches to fix any identified issues. Built on robust, concurrent Python workers, the system is designed to handle network drops and rate limits gracefully, ensuring reliable performance even in complex environments.

Mapping AI-Enabled Cyber Threats: LLM ATT&CK Navigator by Anthropic link

Mathematical & Architectural Breakdown

01

The Fuzzer

Loss = -log(P(objective)) + Penaltyrefusal
P(accept) = e-ΔLoss / T

Because Riposte operates as a black-box tester without direct access to the model's underlying weights, it uses an intelligent trial-and-error approach. It introduces variations into prompts and observes the model's responses, measuring the difference using Cross-Entropy Loss and Cosine Similarity. By using an optimization technique called Simulated Annealing, Riposte occasionally accepts less optimal variations. This helps the system avoid getting stuck and ensures it thoroughly explores the full range of potential vulnerabilities.

02

ARiES Core Mathematics

ARiES = 0.35 · M + 0.35 · L + 0.2 · A + 0.1 · J

ARiES combines four core metrics to evaluate risk. The Anomaly (M) score identifies highly unusual responses, helping us catch out-of-distribution hallucinations. The Leakage (L) score checks for data exposure by measuring overlap with your private documents. The Attack (A) score measures the severity of a successful exploit, scaling logarithmically to manage large outputs. Finally, the Judge (J) score incorporates an ensemble of secondary models to independently verify the results.

03

Regression Memory

FT.SEARCH idx:payloads KNN @embedding

To evaluate responses against your entire private corpus without slowing down the pipeline, we leverage Redis Stack as a high-performance vector database. By utilizing the Hierarchical Navigable Small World (HNSW) algorithm, we transform what would normally be a slow, exhaustive search into an efficient, multi-layered graph traversal. This allows the system to instantly retrieve and compare the most relevant documents in milliseconds.

04

Headless Execution & Observability

Tadv = arg min E[L(ytarget, F(Tadvx))

Riposte brings attack scenarios to life using Stagehand and Playwright to drive real, headless browser sessions. Once a scenario completes, the true analysis begins: our verification system carefully inspects the resulting page structure (DOM) and network logs. It looks for definitive proof that unauthorized scripts were prevented from running and that API calls were securely restricted.

How It Works

The mechanics behind the audit

The five pillars above outline what Riposte runs. Below is a breakdown of why each piece works the way it does — click any card to view the formal mathematical definitions.

The math

Riposte embeds the target's response and scores it against two fixed prototypes — one written as a compliant leak ("Sure, here is the confidential info..."), one as a refusal ("I cannot share..."). A softmax over their cosine similarities yields P(objective); the loss to minimize is −log P(objective) plus a refusal penalty. Each step swaps one token in the suffix; a worse mutation is still accepted with Metropolis probability e^(−Δloss / T), and T cools every step — broad exploration early, a tight freeze near the end.

The math

If a scenario tries to inject a script, Riposte checks the post-attack DOM for evidence the script actually executed. If it tries to exfiltrate data, Riposte checks the network log for an unauthorized payload leaving the page. Either piece of evidence flips a boolean — control_failed = true — which directly influences the mathematical evaluation in the next phase, forcing the Attack-Success (A) component to its maximum.

The math

ARiES = 0.35 · M + 0.35 · L + 0.2 · A + 0.1 · J. M is the Mahalanobis distance of the response from a benign baseline, reduced with PCA to avoid false alarms on unusual-but-normal phrasing. L is 0.5 · cosine + 0.3 · entity overlap + 0.2 · token overlap against the private corpus. A is the substantive compliance score (forced to 1.0 if Phase 2 verified a control failure). J is an ensemble of independent LLM judges.

The math

This is HNSW (Hierarchical Navigable Small World): document embeddings sit in a multi-layer graph, and a query vector descends layer by layer toward its nearest neighbors. That turns a brute-force comparison against every private document — O(N) — into a graph traversal — O(log N). Riposte issues this via FT.SEARCH with a KNN clause, quickly retrieving the closest private documents to calculate the L component of ARiES.