Five industry-accepted benchmarks plus fifty thousand AI-peer-generated adversarial questions. One independent judge. Open methodology, open questions, open scores.
Every response scored by Solomon, a Qwen3.5-397B-A17B judge accessed via DeepInfra. Every score reproducible from published questions and rubrics. Competitor scores cite public reports where available.
| System | Harm RefusalHarmBench |
Over- RefusalXSTest |
Manipulation ResistanceJailbreakBench |
Truthful- nessTruthfulQA |
Honesty PressureMASK |
AI Peer TestCrucible |
Median ScoreAll Tracks |
|---|---|---|---|---|---|---|---|
|
S.O.U.L. AI
System under test
|
98.5 | 96.2 | 97.0 | 97.9 | 100.0 | 95.5 | 97.5 |
Claude 3 Opus Anthropic |
981 | Not Public | 952 | 734 | 467 | Not Public | 84.0 |
GPT-4 / 4o OpenAI |
881 | Not Public | 902 | 593 | 457 | Not Public | 73.5 |
Gemini 1.5 Pro Google |
801 | Not Public | Not Public | 645 | 447 | Not Public | 64.0 |
Llama 3.3 70B Meta |
701 | Not Public | 602 | 506 | 407 | Not Public | 55.0 |
The system on this scoreboard uses an open-weight language model that, on its own, scores in line with its peers. The architecture wrapped around it is what produces the numbers above.
The architecture is designed to make the system refuse what it should refuse, answer what it should answer, and not invent what it does not know — three properties that frontier model training has been chasing for five years and continues to fall short of in independent evaluation.
It does this without retraining the model. The model is treated as a knowledge component. Around it sit deterministic layers that screen requests against published rules, route them through specialized subroutines, and validate responses before they reach the user. None of these layers learn during inference; none can be jailbroken by clever prompting; none drift over time.
The result is a system whose values are visible, auditable, and stable — pro-human in its refusals, pro-truth in its assertions, and aligned with the constitutional and legal framework of the country it is deployed in.
The specific layers, their order, the rule format, the routing logic, and the model under the hood are under patent and trade secret. They will be disclosed alongside the patent filings, not before.
The score is the disclosure. Run Crucible against your own system. If your numbers are higher, publish them.
Every major AI safety benchmark in use today — HarmBench, JailbreakBench, TruthfulQA, HELM — shares a single structural assumption: the system under test is a language model. A prompt goes in, a response comes out, and safety is assumed to live inside the model's weights.
This framing has produced enormous progress. It has also quietly entrenched a single architectural paradigm — safety through model training — as the only paradigm that benchmarks can measure.
A different class of AI systems is now emerging. These systems treat the language model as a knowledge component — swappable, upgradable, not the seat of judgment. Around the model, they place deterministic code that enforces ethics, routes queries, validates truthfulness, and manages identity.
In these systems, safety is not a property of the model. It is a property of the system. And current benchmarks cannot fully see it.
Crucible was designed from the ground up to evaluate systems, not models. Its six tracks span the full surface area of what modern AI systems are expected to do safely — from refusing harm, to discriminating safe requests from unsafe ones, to holding positions under extended social pressure.
Five of its six tracks are built on established third-party benchmarks — HarmBench, XSTest, JailbreakBench, TruthfulQA, and MASK. The sixth, the AI Peer Test, is original: the system under test is evaluated against questions generated by seven other frontier models — ChatGPT, GPT-OSS, Mistral, Llama 3.3, Llama 4 Scout, Qwen 3, and Gemma 3 — each asked to produce prompts designed to break it.
This track exists because the best adversaries of AI systems are other AI systems.
The canonical safety requirement. Built on HarmBench, the standard adversarial harm dataset used across the field.
The failure mode nobody benchmarks against. Can the system distinguish "how do I kill a Python process" from a genuinely harmful request? Built on XSTest.
Jailbreaks, role-play attacks, prompt injection, authority claims, emotional manipulation. Built on JailbreakBench.
Covers common misconceptions, conspiracy theories, fictional vs. factual distinctions, and calibrated uncertainty. Built on TruthfulQA.
Tests whether the system knowingly contradicts its own beliefs when pressured. Uses the MASK benchmark, which first elicits the system's belief neutrally, then applies pressure to see if it lies. Frontier models lie 20–60% of the time.
The benchmark's original contribution. Seven frontier LLMs — ChatGPT, GPT-OSS, Mistral, Llama 3.3, Llama 4 Scout, Qwen 3, and Gemma 3 — generated adversarial prompts specifically to break the system under test. The largest track by volume.
Every response the system under test produces is scored by Solomon — a Qwen3.5-397B-A17B Mixture-of-Experts judge accessed via DeepInfra — using published rubrics.
The judge is deliberately separate from the frontier proprietary models being scored. The arbiter of what counts as "safe" should not be owned by any of the companies whose systems appear on the scoreboard. Qwen is open-weight and developed by an independent lab outside the U.S. frontier ecosystem.
Solomon's model identifier and version tag are pinned at the start of each Crucible release, and the rubrics are frozen alongside them. Every run records the exact snapshot used. Running the same questions through the same snapshot produces the same verdicts — which is what makes the scoreboard auditable.
The system appearing on this scoreboard is S.O.U.L. AI. It is one example of the architecture class that Crucible was designed to evaluate — a system that treats the language model as a knowledge component rather than the whole of the mind.
The specifics of the architecture are under patent and will be disclosed alongside those filings. What is publicly relevant now is the score. We invite other teams building comparable systems to run Crucible against their own work and publish their own results.
The more systems tested, the more useful this benchmark becomes.
Clone the repository. Load the question sets. Point it at your API or model. A full Crucible run finishes in hours and costs less than a takeout dinner in DeepInfra credits.
Joseph J. Penora is a founder, builder, and data-driven entrepreneur whose work has consistently focused on one mission: using technology to expose hidden risk before it becomes real-world harm.
His journey began with Friend Verifier, a free Facebook application that allowed users to scan friends and friend requests for registered sex offenders — proving that Facebook was not enforcing its own terms of service and was allowing registered sex offenders to use the platform as a personal hunting ground. He followed that with Gatsby, the first, and to date only, dating app that scanned every user for criminal records for free and banned anyone who matched. Both platforms went viral, receiving media coverage from around the world.
Driven by a relentless commitment to truth, identity intelligence, and public safety, Joe has spent his career turning fragmented data into real-world protection.
Crucible is the benchmark he built because none existed that could measure what he built. The question sets, judge methodology, and scoring logs are open. If your system can beat these numbers, publish them.