Calibration scorecard

How Theseus's published forecasts have actually fared. Lower Brier is better; a well-calibrated firm tracks the dashed diagonal in the reliability diagram below.

All-time Brier score

n = 0 — too few resolutions for a stable score

A Brier over fewer than 25 resolved forecasts is dominated by noise. We publish the count, not a flattering point estimate, until the resolution set is large enough to defend a number. The reliability diagram, comparators and audit below still render — they just carry the same caveat.

n_resolved=0 · withdrawn rate=— · source=live · schema_v=1

No cherry-pickingThe metrics on this page cover every resolved forecast Theseus has published — not a curated subset. The SHA-256 hash below pins the exact resolution set used: re-derive it from the public manifest to verify nothing was dropped.
resolution_set_hash: 4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945
pinned Jun 10, 2026 · what this hash means →
published_at: 2026-06-10T10:50:48.824Z · source: live · schema_v=1
alternative-method analysis available privately

all forecasts: n=0 resolved

No slices available yet — chips populate as forecasts resolve across domains, methods and venues.

Reliability diagram

No resolved forecasts yet — the reliability diagram will populate as predictions resolve.

No resolved forecasts yet — the reliability diagram will populate as predictions resolve.

Slope

OLS slope of outcome ~ probability. A perfectly calibrated firm has slope ≈ 1.0; below 1 means under-discrimination, above 1 means over-discrimination. Bootstrap CI at the published level.

slope = —
ci = [—, —]
n = 0

What this means

A Brier score on its own is not interpretable. The number only earns meaning against what is easy to beat: a coin-flipper and a permanent-50% forecaster both score 0.25, and a forecaster who just repeats the base rate scores —. Read the firm's headline against these, not in isolation.

Random guessingBrier 0.250
Uniform-random probabilities. Lands at 0.25 by construction — the noise floor.
Always forecast 50%Brier 0.250
0.50 on every market. Also 0.25 — refusing to commit is not skill, just a different way to score the floor.
Climatology — historical base rateBrier —
Forecast the historical YES base rate on every market. Needs resolved outcomes to compute.
Theseus (this scorecard)Brier —
Withheld — n = 0 is below the 25-resolution floor for a stable score. The comparators stand on their own; the firm's number does not yet.

Lower is better. Bars scaled to a 0.30 maximum. Climatology Brier = p̄·(1 − p̄) over the resolved set; a per-market prior comparator needs the market-price snapshot at publish time, which is not in the public manifest.

Aggregate Brier — rolling windows

all-time

—

n = 0 · log loss = —

30d

—

n = 0 · log loss = —

90d

—

n = 0 · log loss = —

365d

—

n = 0 · log loss = —

Honesty constraints

Headline discipline: the hero Brier is suppressed below 25 resolved forecasts — we show the count, not a point estimate the sample cannot support.
Resolved: 0 forecasts (binary YES/NO) — full weight in the metrics above, and every one is listed in the resolution audit.
Stale, unresolved: 0 forecasts published more than 14 days ago and still pending. Flagged here, not silently dropped.
Withdrawn / revoked: 0 forecasts. Excluded from calibration metrics, but counted toward the published withdrawn rate of —. Pulling a bad call back is not free.
Sparse bins: bins with fewer than 5 resolved items are drawn grey and open with their n labelled. We refuse to draw a CI we cannot defend.

Live fallback: the nightly calibration manifest is not on disk. The binned reliability curve, per-method attribution and continuous-market metric are not available until the scheduler runs; the headline Brier, its bootstrap CI and the resolution audit below are computed live from the database.

Best and worst calls

Top decile (lowest Brier)

No entries match this slice.

Bottom decile (highest Brier)

No entries match this slice.

Resolution audit

Every resolved forecast in the numerator above, one click from its underlying record. This is what makes the scorecard non-fakeable: the headline Brier is just the mean of these 0 squared errors, and each is independently checkable.

No resolved forecasts yet — the audit list populates as predictions resolve.

How to verify this hash

The resolution_set_hash is a SHA-256 over the canonicalized set of resolved forecasts — every (prediction_id, probability, outcome, resolved_at, brier) tuple, sorted and rounded to a fixed precision. It pins exactly which forecasts the headline Brier averages over. If the firm quietly dropped a bad call, the hash would change.

Auditors: /api/public/calibration/manifest returns the full data backing this page, including the resolution index. Re-derive the hash from the manifest's resolution set and compare: it must match the value above (4f53cda18c2baa0c0354bb5f9a3ecbe5ed12ab4d8e11ba873c2f11161202b945). A mismatch is a bug — please file it.

Calibration scorecard

Slice

Reliability diagram

Slope

What this means

Aggregate Brier — rolling windows

Honesty constraints

Best and worst calls

Top decile (lowest Brier)

Bottom decile (highest Brier)

Resolution audit

How to verify this hash