MethodologyBenchmark

Quintin Hypothesis Benchmark

The hypothesis is empirical or it is not.The firm has chosen empirical. This page is the leaderboard the firm itself can lose on. If a one-line cosine baseline beats the firm's contradiction-geometry probe on a slice, that is shown here in plain sight.

The Quintin Hypothesis predicts that the difference vector between embeddings of a premise and its logical contradiction is sparse (concentrated in few dimensions, per Hoyer sparsity), while the difference between a premise and a coherent continuation is dense. The benchmark tests this prediction against 1936 frozen items spanning physics, economics, and ethics.

Leaderboard

Runner	n (of N)	Accuracy (3-way)	AUROC (contradicting vs coherent)	ECE	Latency p50 (ms)	Status
`random`	1936 of 1936	0.3352	0.4964	0.2537	0.0022	ok
`cosine` best acc	1936 of 1936	0.3673	0.3987	0.3964	0.0035	ok
`contradiction_geometry` best AUROC	1936 of 1936	0.2877	0.5858	0.2752	0.0054	ok

Honest finding: on this run a non-firm baseline (cosine) currently leads the 3-way accuracy column.

Per-domain breakdown

The hypothesis predicts geometric structure should be roughly domain-independent. Differences across rows are evidence either way.

Domain: `economics`

Runner	n	Accuracy	AUROC
`random`	574	0.2944	0.5306
`cosine`	574	0.4094	0.5044
`contradiction_geometry`	574	0.2596	0.5774

Domain: `ethics`

Runner	n	Accuracy	AUROC
`random`	344	0.3343	0.5286
`cosine`	344	0.1890	0.5124
`contradiction_geometry`	344	0.3285	0.4667

Domain: `physics`

Runner	n	Accuracy	AUROC
`random`	1018	0.3585	0.4666
`cosine`	1018	0.4037	0.3607
`contradiction_geometry`	1018	0.2898	0.6232

Confusion matrix — `contradiction_geometry`

gold \ predicted	`coherent`	`contradicting`	`orthogonal`
`coherent`	0	634	0
`contradicting`	0	557	0
`orthogonal`	0	745	0

Dataset card & artifacts

The frozen v1 dataset lives at benchmarks/quintin_hypothesis/v1/dataset.jsonlin the firm's monorepo. It is firm-authored under the firm-internal-public waiver: free reuse, no warranty, no silent inclusion of copyrighted material.

Items are labelled coherent, contradicting, or orthogonal. They cover three domains: physics, economics, ethics. Templates parameterise numeric values and named entities, and are de-duplicated at curation time using both 5-gram Jaccard and embedding-cosine filters.

Versioning & drift

v1 is frozen on publication. Improvements ship as v2/ with a separate dataset and a separate leaderboard. The CI workflow re-runs all three baselines nightly against this frozen dataset and uploads the JSON here. Drift on this benchmark is the firm losing its own thesis — a louder alert than method-level drift.

Leaderboard

Per-domain breakdown

Domain: economics

Domain: ethics

Domain: physics