MethodologyBenchmark
Quintin Hypothesis Benchmark
The hypothesis is empirical or it is not.The firm has chosen empirical. This page is the leaderboard the firm itself can lose on. If a one-line cosine baseline beats the firm's contradiction-geometry probe on a slice, that is shown here in plain sight.
The Quintin Hypothesis predicts that the difference vector between embeddings of a premise and its logical contradiction is sparse (concentrated in few dimensions, per Hoyer sparsity), while the difference between a premise and a coherent continuation is dense. The benchmark tests this prediction against 1936 frozen items spanning physics, economics, and ethics.
Leaderboard
| Runner | n (of N) | Accuracy (3-way) | AUROC (contradicting vs coherent) | ECE | Latency p50 (ms) | Status |
|---|---|---|---|---|---|---|
random | 1936 of 1936 | 0.3352 | 0.4964 | 0.2537 | 0.0022 | ok |
cosine best acc | 1936 of 1936 | 0.3673 | 0.3987 | 0.3964 | 0.0035 | ok |
contradiction_geometry best AUROC | 1936 of 1936 | 0.2877 | 0.5858 | 0.2752 | 0.0054 | ok |
Honest finding: on this run a non-firm baseline (cosine) currently leads the 3-way accuracy column.
Per-domain breakdown
Domain: economics
| Runner | n | Accuracy | AUROC |
|---|---|---|---|
random | 574 | 0.2944 | 0.5306 |
cosine | 574 | 0.4094 | 0.5044 |
contradiction_geometry | 574 | 0.2596 | 0.5774 |
Domain: ethics
| Runner | n | Accuracy | AUROC |
|---|---|---|---|
random | 344 | 0.3343 | 0.5286 |
cosine | 344 | 0.1890 | 0.5124 |
contradiction_geometry | 344 | 0.3285 | 0.4667 |
Domain: physics
| Runner | n | Accuracy | AUROC |
|---|---|---|---|
random | 1018 | 0.3585 | 0.4666 |
cosine | 1018 | 0.4037 | 0.3607 |
contradiction_geometry | 1018 | 0.2898 | 0.6232 |
Confusion matrix — contradiction_geometry
| gold \ predicted | coherent | contradicting | orthogonal |
|---|---|---|---|
coherent | 0 | 634 | 0 |
contradicting | 0 | 557 | 0 |
orthogonal | 0 | 745 | 0 |
Dataset card & artifacts
The frozen v1 dataset lives at benchmarks/quintin_hypothesis/v1/dataset.jsonlin the firm's monorepo. It is firm-authored under the firm-internal-public waiver: free reuse, no warranty, no silent inclusion of copyrighted material.
Items are labelled coherent, contradicting, or orthogonal. They cover three domains: physics, economics, ethics. Templates parameterise numeric values and named entities, and are de-duplicated at curation time using both 5-gram Jaccard and embedding-cosine filters.
Versioning & drift
v1 is frozen on publication. Improvements ship as v2/ with a separate dataset and a separate leaderboard. The CI workflow re-runs all three baselines nightly against this frozen dataset and uploads the JSON here. Drift on this benchmark is the firm losing its own thesis — a louder alert than method-level drift.