TheseusCodex

MethodologyBenchmark

Quintin Hypothesis Benchmark

The hypothesis is empirical or it is not.The firm has chosen empirical. This page is the leaderboard the firm itself can lose on. If a one-line cosine baseline beats the firm's contradiction-geometry probe on a slice, that is shown here in plain sight.

The Quintin Hypothesis predicts that the difference vector between embeddings of a premise and its logical contradiction is sparse (concentrated in few dimensions, per Hoyer sparsity), while the difference between a premise and a coherent continuation is dense. The benchmark tests this prediction against 1936 frozen items spanning physics, economics, and ethics.

Leaderboard

Runnern (of N)Accuracy (3-way)AUROC (contradicting vs coherent)ECELatency p50 (ms)Status
random1936 of 19360.33520.49640.25370.0022ok
cosine best acc1936 of 19360.36730.39870.39640.0035ok
contradiction_geometry best AUROC1936 of 19360.28770.58580.27520.0054ok

Honest finding: on this run a non-firm baseline (cosine) currently leads the 3-way accuracy column.

Per-domain breakdown

The hypothesis predicts geometric structure should be roughly domain-independent. Differences across rows are evidence either way.

Domain: economics

RunnernAccuracyAUROC
random5740.29440.5306
cosine5740.40940.5044
contradiction_geometry5740.25960.5774

Domain: ethics

RunnernAccuracyAUROC
random3440.33430.5286
cosine3440.18900.5124
contradiction_geometry3440.32850.4667

Domain: physics

RunnernAccuracyAUROC
random10180.35850.4666
cosine10180.40370.3607
contradiction_geometry10180.28980.6232

Confusion matrix — contradiction_geometry

gold \ predictedcoherentcontradictingorthogonal
coherent06340
contradicting05570
orthogonal07450

Dataset card & artifacts

The frozen v1 dataset lives at benchmarks/quintin_hypothesis/v1/dataset.jsonlin the firm's monorepo. It is firm-authored under the firm-internal-public waiver: free reuse, no warranty, no silent inclusion of copyrighted material.

Items are labelled coherent, contradicting, or orthogonal. They cover three domains: physics, economics, ethics. Templates parameterise numeric values and named entities, and are de-duplicated at curation time using both 5-gram Jaccard and embedding-cosine filters.

Versioning & drift

v1 is frozen on publication. Improvements ship as v2/ with a separate dataset and a separate leaderboard. The CI workflow re-runs all three baselines nightly against this frozen dataset and uploads the JSON here. Drift on this benchmark is the firm losing its own thesis — a louder alert than method-level drift.