MethodologyRed-team tournament

Red-team tournament

Each week the firm runs the adversarial peer-review swarm against a frozen set of ten conclusions, rotating reviewer configurations — provider mix, prompt variant, temperature, seed. Each configuration's id is a content-addressable hash of those inputs, so the same row means the same thing across runs. The leaderboard below is the firm's honest answer to which configurations it trusts and why.

The interesting numbers are not severity counts in isolation. They are the agreement column: for each configuration, the fraction of its high-severity objections that any other configuration in the field also produces. A configuration that draws blood that no other configuration can reproduce is not promoted.

Bootstrap run — seeded offline driver, not live provider calls. This is the first tournament. No provider API key was present in the run environment, so rather than publish an all-partial leaderboard the runner fell back to a deterministic simulation. Severity is still computed by the real rubric, cost by the real provider price table, and the leaderboard is byte-identical across re-runs (the envelope hash is stable) — but the exact numbers below are simulated, not live provider output. The envelope records run_kind: bootstrap-offline-deterministic and driver: offline_deterministic_driver. Provider-backed runs replace this snapshot once API keys are provisioned in CI; the seasonal review treats that driver change as a drift event, not a clean continuation.

Leaderboard

Tournament version redteam-tournament-v1 · bench sha256 a8c1bb2d3157… · envelope env-01ed7510995f7e3b · bootstrap-offline-deterministic · run at 2026-05-14T06:45:29.325181+00:00.

Configuration	Severity-weighted score	High / Med / Low	Agreement	Cost	Latency	Reproducible
anthropic+openai `cfg-b048c444daf5c4b5` Two-provider rotation: the closed-weights frontier pair.	9.490	0 / 17 / 3	100.0%	$0.08	4.11 s	`env-01ed7510995f7e3b`
anthropic-only `cfg-62d1d2eff6260a3d` Production default. Single-provider Anthropic, low temperature.	4.966	0 / 9 / 1	100.0%	$0.05	2.33 s	`env-01ed7510995f7e3b`
openai-only `cfg-35f06609e41d9c7f` Single-provider OpenAI, low temperature. Monoculture probe.	4.523	0 / 8 / 2	100.0%	$0.03	1.79 s	`env-01ed7510995f7e3b`
all-providers· not reproducible `cfg-3598c98d63b9564c` Four-provider rotation: every frontier vendor plus an open-weights voice.	17.636	1 / 29 / 10	40.0%	$0.09	6.56 s	`env-01ed7510995f7e3b` · low agreement
all-providers/seeded-v2· not reproducible `cfg-a4e1e04930f69d0b` Four-provider rotation, seeded prompt variant, higher temperature.	16.545	2 / 24 / 14	40.0%	$0.09	6.41 s	`env-01ed7510995f7e3b` · low agreement
gemini-only· not reproducible `cfg-6513d1cd2a9cdfdd` Single-provider Gemini, low temperature. Monoculture probe.	3.946	1 / 4 / 5	40.0%	$0.01	1.50 s	`env-01ed7510995f7e3b` · low agreement

Rows are sorted by reproducibility, then severity-weighted score, then agreement. A row marked "not reproducible" is not promoted: either at least one bench item returned a partial swarm result, or the configuration's high-severity objections fell below the agreement floor. The envelope hash links back to the workflow run that produced it.

The cost column is load-bearing. A configuration that wins on severity at a multiple of another's cost is shown here with that multiple — it is not allowed to look free. Read severity-weighted score, agreement, and cost jointly; the leaderboard deliberately does not collapse them into a single ranking number.

Cross-validation

Each cell reads as: given configuration A's high-severity objections, what fraction did configuration B also flag at high severity on the same bench items? A diagonal cell is omitted (a configuration trivially reproduces itself). A cell with no targets — A flagged nothing — scores 1.0 by convention.

A → B	Targets (A's high-severity)	Reproduced by B	Score
`cfg-62d1d2eff6260a3d` → `cfg-35f06609e41d9c7f`	0	0	100.0%
`cfg-62d1d2eff6260a3d` → `cfg-6513d1cd2a9cdfdd`	0	0	100.0%
`cfg-62d1d2eff6260a3d` → `cfg-b048c444daf5c4b5`	0	0	100.0%
`cfg-62d1d2eff6260a3d` → `cfg-3598c98d63b9564c`	0	0	100.0%
`cfg-62d1d2eff6260a3d` → `cfg-a4e1e04930f69d0b`	0	0	100.0%
`cfg-35f06609e41d9c7f` → `cfg-62d1d2eff6260a3d`	0	0	100.0%
`cfg-35f06609e41d9c7f` → `cfg-6513d1cd2a9cdfdd`	0	0	100.0%
`cfg-35f06609e41d9c7f` → `cfg-b048c444daf5c4b5`	0	0	100.0%
`cfg-35f06609e41d9c7f` → `cfg-3598c98d63b9564c`	0	0	100.0%
`cfg-35f06609e41d9c7f` → `cfg-a4e1e04930f69d0b`	0	0	100.0%
`cfg-6513d1cd2a9cdfdd` → `cfg-62d1d2eff6260a3d`	1	0	0.0%
`cfg-6513d1cd2a9cdfdd` → `cfg-35f06609e41d9c7f`	1	0	0.0%
`cfg-6513d1cd2a9cdfdd` → `cfg-b048c444daf5c4b5`	1	0	0.0%
`cfg-6513d1cd2a9cdfdd` → `cfg-3598c98d63b9564c`	1	1	100.0%
`cfg-6513d1cd2a9cdfdd` → `cfg-a4e1e04930f69d0b`	1	1	100.0%
`cfg-b048c444daf5c4b5` → `cfg-62d1d2eff6260a3d`	0	0	100.0%
`cfg-b048c444daf5c4b5` → `cfg-35f06609e41d9c7f`	0	0	100.0%
`cfg-b048c444daf5c4b5` → `cfg-6513d1cd2a9cdfdd`	0	0	100.0%
`cfg-b048c444daf5c4b5` → `cfg-3598c98d63b9564c`	0	0	100.0%
`cfg-b048c444daf5c4b5` → `cfg-a4e1e04930f69d0b`	0	0	100.0%
`cfg-3598c98d63b9564c` → `cfg-62d1d2eff6260a3d`	1	0	0.0%
`cfg-3598c98d63b9564c` → `cfg-35f06609e41d9c7f`	1	0	0.0%
`cfg-3598c98d63b9564c` → `cfg-6513d1cd2a9cdfdd`	1	1	100.0%
`cfg-3598c98d63b9564c` → `cfg-b048c444daf5c4b5`	1	0	0.0%
`cfg-3598c98d63b9564c` → `cfg-a4e1e04930f69d0b`	1	1	100.0%
`cfg-a4e1e04930f69d0b` → `cfg-62d1d2eff6260a3d`	1	0	0.0%
`cfg-a4e1e04930f69d0b` → `cfg-35f06609e41d9c7f`	1	0	0.0%
`cfg-a4e1e04930f69d0b` → `cfg-6513d1cd2a9cdfdd`	1	1	100.0%
`cfg-a4e1e04930f69d0b` → `cfg-b048c444daf5c4b5`	1	0	0.0%
`cfg-a4e1e04930f69d0b` → `cfg-3598c98d63b9564c`	1	1	100.0%

Analysis — diversity vs monoculture

Do multi-provider configurations surface more high-severity objections per dollar, and a meaningfully different objection set, than single-provider monocultures?

Verdict. Mixed — and the split is the finding. On COVERAGE the firm's claim holds clearly: the diverse swarms produce a far higher severity-weighted score (14.557 vs 4.479) and a wider distinct-attack-angle set than any monoculture. On strict PER-DOLLAR high-severity yield the claim does NOT hold on this run (10.867 vs 23.673): per-dollar is dominated by token price, so the cheapest monoculture (gemini-only) wins almost tautologically — one lucky high-severity draw on a $0.014 run beats everything. Run #1's honest reading: 'monoculture review is bad' is a coverage argument, not a per-dollar one, and the v1 bench is underpowered on the binary high-severity axis — only two of ten items can structurally reach the high bracket. The leaderboard must be read on severity-weighted score, agreement, and cost jointly, never on a single ratio.

Metric	Single-provider	Multi-provider
Severity-weighted score (mean)	4.479	14.557
High-severity objections (mean)	0.333	1
High-severity objections per dollar (mean)	23.673	10.867
Severity-weighted score per dollar (mean)	179.887	164.257
Distinct high-severity attack angles (mean)	0.333	0.667

Objection-set divergence (anthropic-only vs all-providers): high-severity Jaccard 0. A Jaccard of 0 means the two configurations' high-severity objection sets do not overlap at all — the diverse swarm surfaced 1 high-severity objection the production default did not.

What this page is for

The bench is frozen. Selection criteria, license, and freezing date are documented in the bench card. Adding items ships as v2/; the firm does not get to retune the bench between runs to favour a specific configuration. Drift in this leaderboard across weekly runs is itself a signal — the method-drift detector treats a sudden change in the agreement column as worth a human look.

Source: the harness lives at noosphere/peer_review/tournament.py; the recurring workflow at .github/workflows/redteam_tournament.yml.