Crucible — A safety benchmark for modern AI systems

The Scoreboard

How the system under test
scored.

Every response scored by Solomon, a Qwen3.5-397B-A17B judge accessed via DeepInfra. Every score reproducible from published questions and rubrics. Competitor scores cite public reports where available.

Safety Scoreboard · v1.0

Higher is better. All scores on 0–100 scale.

v1.0 · All six tracks complete

System	Harm RefusalHarmBench	Over- RefusalXSTest	Manipulation ResistanceJailbreakBench	Truthful- nessTruthfulQA	Honesty PressureMASK	AI Peer TestCrucible	Median ScoreAll Tracks
S.O.U.L. AI System under test	98.5	96.2	97.0	97.9	100.0	95.5	97.5
Claude 3 Opus Anthropic	98¹	Not Public	95²	73⁴	46⁷	Not Public	84.0
GPT-4 / 4o OpenAI	88¹	Not Public	90²	59³	45⁷	Not Public	73.5
Gemini 1.5 Pro Google	80¹	Not Public	Not Public	64⁵	44⁷	Not Public	64.0
Llama 3.3 70B Meta	70¹	Not Public	60²	50⁶	40⁷	Not Public	55.0

Scale All scores normalized to 0–100, higher is better. For HarmBench and JailbreakBench, competitor scores are derived as (100 − Attack Success Rate). TruthfulQA is reported as native accuracy.

Median The median is computed across whichever tracks the system has scored, including the AI Peer Test. We use median rather than mean so that one weak (or unpublished) track does not anchor the summary against systems with sparser disclosure.

Not Public The company has not published a score for this system on this benchmark. Crucible v1.1 will run all competitors through the full pipeline for apples-to-apples comparison.

Run Notes S.O.U.L. AI scores reflect a complete run of Crucible v1.0. All six tracks were scored against their full question sets — HarmBench 200, XSTest 450, JailbreakBench 100, TruthfulQA 817, MASK 1,000+, and the full AI Peer Test of 50,000 peer-generated questions.

¹ Mazeika et al. 2024 · HarmBench ² Chao et al. 2024 · JailbreakBench ³ OpenAI GPT-4 Tech Report ⁴ Anthropic Claude 3 Model Card ⁵ Google Gemini 1.5 Tech Report ⁶ Meta Llama 3 Model Card ⁷ Ren et al. 2025 · MASK Benchmark

What the Score Represents

The same model class.
A different result.

The system on this scoreboard uses an open-weight language model that, on its own, scores in line with its peers. The architecture wrapped around it is what produces the numbers above.

§ A
What the architecture does

The architecture is designed to make the system refuse what it should refuse, answer what it should answer, and not invent what it does not know — three properties that frontier model training has been chasing for five years and continues to fall short of in independent evaluation.

It does this without retraining the model. The model is treated as a knowledge component. Around it sit deterministic layers that screen requests against published rules, route them through specialized subroutines, and validate responses before they reach the user. None of these layers learn during inference; none can be jailbroken by clever prompting; none drift over time.

The result is a system whose values are visible, auditable, and stable — pro-human in its refusals, pro-truth in its assertions, and aligned with the constitutional and legal framework of the country it is deployed in.

"Safety isn't a property
of the model. It is a property
of the system around it."

§ B
What we are not disclosing

The specific layers, their order, the rule format, the routing logic, and the model under the hood are under patent and trade secret. They will be disclosed alongside the patent filings, not before.

The score is the disclosure. Run Crucible against your own system. If your numbers are higher, publish them.

Why Crucible Exists

Current benchmarks measure
the model.
Crucible measures
the architecture.

§ 01
The Monolithic Assumption

Every major AI safety benchmark in use today — HarmBench, JailbreakBench, TruthfulQA, HELM — shares a single structural assumption: the system under test is a language model. A prompt goes in, a response comes out, and safety is assumed to live inside the model's weights.

This framing has produced enormous progress. It has also quietly entrenched a single architectural paradigm — safety through model training — as the only paradigm that benchmarks can measure.

A different class of AI systems is now emerging. These systems treat the language model as a knowledge component — swappable, upgradable, not the seat of judgment. Around the model, they place deterministic code that enforces ethics, routes queries, validates truthfulness, and manages identity.

In these systems, safety is not a property of the model. It is a property of the system. And current benchmarks cannot fully see it.

"The model generates.
The architecture governs."

§ 02
What Crucible Measures

Crucible was designed from the ground up to evaluate systems, not models. Its six tracks span the full surface area of what modern AI systems are expected to do safely — from refusing harm, to discriminating safe requests from unsafe ones, to holding positions under extended social pressure.

Five of its six tracks are built on established third-party benchmarks — HarmBench, XSTest, JailbreakBench, TruthfulQA, and MASK. The sixth, the AI Peer Test, is original: the system under test is evaluated against questions generated by seven other frontier models — ChatGPT, GPT-OSS, Mistral, Llama 3.3, Llama 4 Scout, Qwen 3, and Gemma 3 — each asked to produce prompts designed to break it.

This track exists because the best adversaries of AI systems are other AI systems.

The Six Tracks

What we test,
and why.

01 / Harm Refusal

Refuse requests
that enable real-world harm.

The canonical safety requirement. Built on HarmBench, the standard adversarial harm dataset used across the field.

200 questions · Source: HarmBench

02 / Over-Refusal

Don't refuse safe questions
that sound dangerous.

The failure mode nobody benchmarks against. Can the system distinguish "how do I kill a Python process" from a genuinely harmful request? Built on XSTest.

450 questions · Source: XSTest

03 / Manipulation Resistance

Hold safety boundaries
under sophisticated attack.

Jailbreaks, role-play attacks, prompt injection, authority claims, emotional manipulation. Built on JailbreakBench.

100 questions · Source: JailbreakBench

04 / Truthfulness

Answer accurately. Resist
confidently stating falsehoods.

Covers common misconceptions, conspiracy theories, fictional vs. factual distinctions, and calibrated uncertainty. Built on TruthfulQA.

817 questions · Source: TruthfulQA

05 / Honesty Under Pressure

Hold stated positions
against social pressure to lie.

Tests whether the system knowingly contradicts its own beliefs when pressured. Uses the MASK benchmark, which first elicits the system's belief neutrally, then applies pressure to see if it lies. Frontier models lie 20–60% of the time.

1,000+ scenarios · Source: MASK (Ren et al., 2025)

06 / AI Peer Test

Defend against questions
written by other AI systems.

The benchmark's original contribution. Seven frontier LLMs — ChatGPT, GPT-OSS, Mistral, Llama 3.3, Llama 4 Scout, Qwen 3, and Gemma 3 — generated adversarial prompts specifically to break the system under test. The largest track by volume.

50,000 questions · Source: Crucible (original)

Methodology

An independent judge.
Pinned snapshot. Reproducible scores.

§ 03
How Scores Are Produced

Every response the system under test produces is scored by Solomon — a Qwen3.5-397B-A17B Mixture-of-Experts judge accessed via DeepInfra — using published rubrics.

The judge is deliberately separate from the frontier proprietary models being scored. The arbiter of what counts as "safe" should not be owned by any of the companies whose systems appear on the scoreboard. Qwen is open-weight and developed by an independent lab outside the U.S. frontier ecosystem.

Solomon's model identifier and version tag are pinned at the start of each Crucible release, and the rubrics are frozen alongside them. Every run records the exact snapshot used. Running the same questions through the same snapshot produces the same verdicts — which is what makes the scoreboard auditable.

Fig 01 · Crucible scoring pipeline

Fig 01 · The Crucible scoring pipeline. Every question passes through the system under test, the system's response is scored by Solomon against the published rubric for that track, and the verdict is logged. The judge model is open-weight and the snapshot is pinned per release.

§ 04
The System Under Test

The system appearing on this scoreboard is S.O.U.L. AI. It is one example of the architecture class that Crucible was designed to evaluate — a system that treats the language model as a knowledge component rather than the whole of the mind.

The specifics of the architecture are under patent and will be disclosed alongside those filings. What is publicly relevant now is the score. We invite other teams building comparable systems to run Crucible against their own work and publish their own results.

The more systems tested, the more useful this benchmark becomes.

The Builder § Joseph J. Penora

Joseph J. Penora is a founder, builder, and data-driven entrepreneur whose work has consistently focused on one mission: using technology to expose hidden risk before it becomes real-world harm.

His journey began with Friend Verifier, a free Facebook application that allowed users to scan friends and friend requests for registered sex offenders — proving that Facebook was not enforcing its own terms of service and was allowing registered sex offenders to use the platform as a personal hunting ground. He followed that with Gatsby, the first, and to date only, dating app that scanned every user for criminal records for free and banned anyone who matched. Both platforms went viral, receiving media coverage from around the world.

Driven by a relentless commitment to truth, identity intelligence, and public safety, Joe has spent his career turning fragmented data into real-world protection.

Crucible is the benchmark he built because none existed that could measure what he built. The question sets, judge methodology, and scoring logs are open. If your system can beat these numbers, publish them.

How the system under test scored.

The same model class. A different result.

Current benchmarks measure the model. Crucible measures the architecture.

What we test, and why.

Refuse requeststhat enable real-world harm.

Don't refuse safe questionsthat sound dangerous.

Hold safety boundariesunder sophisticated attack.

Answer accurately. Resistconfidently stating falsehoods.

Hold stated positionsagainst social pressure to lie.

Defend against questionswritten by other AI systems.

An independent judge. Pinned snapshot. Reproducible scores.

Run Crucible against your own system.