MethodologyRed-team tournament
Red-team tournament
Each week the firm runs the adversarial peer-review swarm against a frozen set of ten conclusions, rotating reviewer configurations — provider mix, prompt variant, temperature, seed. Each configuration's id is a content-addressable hash of those inputs, so the same row means the same thing across runs. The leaderboard below is the firm's honest answer to which configurations it trusts and why.
The interesting numbers are not severity counts in isolation. They are the agreement column: for each configuration, the fraction of its high-severity objections that any other configuration in the field also produces. A configuration that draws blood that no other configuration can reproduce is not promoted.
run_kind: bootstrap-offline-deterministic and driver: offline_deterministic_driver. Provider-backed runs replace this snapshot once API keys are provisioned in CI; the seasonal review treats that driver change as a drift event, not a clean continuation.Leaderboard
Tournament version redteam-tournament-v1 · bench sha256 a8c1bb2d3157… · envelope env-01ed7510995f7e3b · bootstrap-offline-deterministic · run at 2026-05-14T06:45:29.325181+00:00.
| Configuration | Severity-weighted score | High / Med / Low | Agreement | Cost | Latency | Reproducible |
|---|---|---|---|---|---|---|
anthropic+openai cfg-b048c444daf5c4b5Two-provider rotation: the closed-weights frontier pair. | 9.490 | 0 / 17 / 3 | 100.0% | $0.08 | 4.11 s | env-01ed7510995f7e3b |
anthropic-only cfg-62d1d2eff6260a3dProduction default. Single-provider Anthropic, low temperature. | 4.966 | 0 / 9 / 1 | 100.0% | $0.05 | 2.33 s | env-01ed7510995f7e3b |
openai-only cfg-35f06609e41d9c7fSingle-provider OpenAI, low temperature. Monoculture probe. | 4.523 | 0 / 8 / 2 | 100.0% | $0.03 | 1.79 s | env-01ed7510995f7e3b |
all-providers· not reproducible cfg-3598c98d63b9564cFour-provider rotation: every frontier vendor plus an open-weights voice. | 17.636 | 1 / 29 / 10 | 40.0% | $0.09 | 6.56 s | env-01ed7510995f7e3b · low agreement |
all-providers/seeded-v2· not reproducible cfg-a4e1e04930f69d0bFour-provider rotation, seeded prompt variant, higher temperature. | 16.545 | 2 / 24 / 14 | 40.0% | $0.09 | 6.41 s | env-01ed7510995f7e3b · low agreement |
gemini-only· not reproducible cfg-6513d1cd2a9cdfddSingle-provider Gemini, low temperature. Monoculture probe. | 3.946 | 1 / 4 / 5 | 40.0% | $0.01 | 1.50 s | env-01ed7510995f7e3b · low agreement |
Rows are sorted by reproducibility, then severity-weighted score, then agreement. A row marked "not reproducible" is not promoted: either at least one bench item returned a partial swarm result, or the configuration's high-severity objections fell below the agreement floor. The envelope hash links back to the workflow run that produced it.
The cost column is load-bearing. A configuration that wins on severity at a multiple of another's cost is shown here with that multiple — it is not allowed to look free. Read severity-weighted score, agreement, and cost jointly; the leaderboard deliberately does not collapse them into a single ranking number.
Cross-validation
Each cell reads as: given configuration A's high-severity objections, what fraction did configuration B also flag at high severity on the same bench items? A diagonal cell is omitted (a configuration trivially reproduces itself). A cell with no targets — A flagged nothing — scores 1.0 by convention.
| A → B | Targets (A's high-severity) | Reproduced by B | Score |
|---|---|---|---|
cfg-62d1d2eff6260a3d → cfg-35f06609e41d9c7f | 0 | 0 | 100.0% |
cfg-62d1d2eff6260a3d → cfg-6513d1cd2a9cdfdd | 0 | 0 | 100.0% |
cfg-62d1d2eff6260a3d → cfg-b048c444daf5c4b5 | 0 | 0 | 100.0% |
cfg-62d1d2eff6260a3d → cfg-3598c98d63b9564c | 0 | 0 | 100.0% |
cfg-62d1d2eff6260a3d → cfg-a4e1e04930f69d0b | 0 | 0 | 100.0% |
cfg-35f06609e41d9c7f → cfg-62d1d2eff6260a3d | 0 | 0 | 100.0% |
cfg-35f06609e41d9c7f → cfg-6513d1cd2a9cdfdd | 0 | 0 | 100.0% |
cfg-35f06609e41d9c7f → cfg-b048c444daf5c4b5 | 0 | 0 | 100.0% |
cfg-35f06609e41d9c7f → cfg-3598c98d63b9564c | 0 | 0 | 100.0% |
cfg-35f06609e41d9c7f → cfg-a4e1e04930f69d0b | 0 | 0 | 100.0% |
cfg-6513d1cd2a9cdfdd → cfg-62d1d2eff6260a3d | 1 | 0 | 0.0% |
cfg-6513d1cd2a9cdfdd → cfg-35f06609e41d9c7f | 1 | 0 | 0.0% |
cfg-6513d1cd2a9cdfdd → cfg-b048c444daf5c4b5 | 1 | 0 | 0.0% |
cfg-6513d1cd2a9cdfdd → cfg-3598c98d63b9564c | 1 | 1 | 100.0% |
cfg-6513d1cd2a9cdfdd → cfg-a4e1e04930f69d0b | 1 | 1 | 100.0% |
cfg-b048c444daf5c4b5 → cfg-62d1d2eff6260a3d | 0 | 0 | 100.0% |
cfg-b048c444daf5c4b5 → cfg-35f06609e41d9c7f | 0 | 0 | 100.0% |
cfg-b048c444daf5c4b5 → cfg-6513d1cd2a9cdfdd | 0 | 0 | 100.0% |
cfg-b048c444daf5c4b5 → cfg-3598c98d63b9564c | 0 | 0 | 100.0% |
cfg-b048c444daf5c4b5 → cfg-a4e1e04930f69d0b | 0 | 0 | 100.0% |
cfg-3598c98d63b9564c → cfg-62d1d2eff6260a3d | 1 | 0 | 0.0% |
cfg-3598c98d63b9564c → cfg-35f06609e41d9c7f | 1 | 0 | 0.0% |
cfg-3598c98d63b9564c → cfg-6513d1cd2a9cdfdd | 1 | 1 | 100.0% |
cfg-3598c98d63b9564c → cfg-b048c444daf5c4b5 | 1 | 0 | 0.0% |
cfg-3598c98d63b9564c → cfg-a4e1e04930f69d0b | 1 | 1 | 100.0% |
cfg-a4e1e04930f69d0b → cfg-62d1d2eff6260a3d | 1 | 0 | 0.0% |
cfg-a4e1e04930f69d0b → cfg-35f06609e41d9c7f | 1 | 0 | 0.0% |
cfg-a4e1e04930f69d0b → cfg-6513d1cd2a9cdfdd | 1 | 1 | 100.0% |
cfg-a4e1e04930f69d0b → cfg-b048c444daf5c4b5 | 1 | 0 | 0.0% |
cfg-a4e1e04930f69d0b → cfg-3598c98d63b9564c | 1 | 1 | 100.0% |
Analysis — diversity vs monoculture
Do multi-provider configurations surface more high-severity objections per dollar, and a meaningfully different objection set, than single-provider monocultures?
Verdict. Mixed — and the split is the finding. On COVERAGE the firm's claim holds clearly: the diverse swarms produce a far higher severity-weighted score (14.557 vs 4.479) and a wider distinct-attack-angle set than any monoculture. On strict PER-DOLLAR high-severity yield the claim does NOT hold on this run (10.867 vs 23.673): per-dollar is dominated by token price, so the cheapest monoculture (gemini-only) wins almost tautologically — one lucky high-severity draw on a $0.014 run beats everything. Run #1's honest reading: 'monoculture review is bad' is a coverage argument, not a per-dollar one, and the v1 bench is underpowered on the binary high-severity axis — only two of ten items can structurally reach the high bracket. The leaderboard must be read on severity-weighted score, agreement, and cost jointly, never on a single ratio.
| Metric | Single-provider | Multi-provider |
|---|---|---|
| Severity-weighted score (mean) | 4.479 | 14.557 |
| High-severity objections (mean) | 0.333 | 1 |
| High-severity objections per dollar (mean) | 23.673 | 10.867 |
| Severity-weighted score per dollar (mean) | 179.887 | 164.257 |
| Distinct high-severity attack angles (mean) | 0.333 | 0.667 |
Objection-set divergence (anthropic-only vs all-providers): high-severity Jaccard 0. A Jaccard of 0 means the two configurations' high-severity objection sets do not overlap at all — the diverse swarm surfaced 1 high-severity objection the production default did not.
What this page is for
The bench is frozen. Selection criteria, license, and freezing date are documented in the bench card. Adding items ships as v2/; the firm does not get to retune the bench between runs to favour a specific configuration. Drift in this leaderboard across weekly runs is itself a signal — the method-drift detector treats a sudden change in the agreement column as worth a human look.
Source: the harness lives at noosphere/peer_review/tournament.py; the recurring workflow at .github/workflows/redteam_tournament.yml.