TheseusCodex

MethodologyReplicate

Replicate the firm's empirical claims

The firm publishes its conclusions; replication is the corresponding public obligation. The Quintin Hypothesis is either an empirical claim or it is not — and the same is true for the cross-model geometry study and the Householder ablation. This page is how a researcher who has never spoken to the firm can clone the repo, run one command, and check.

Nightly replication status

The badge above is the firm's own replication CI. A red badge means the firm's most recent shard run did not reproduce within tolerance. That is the loudest alert in the codebase.

One command, three claims

# from the repo root
cd replication
make install      # one-time: installs the firm's editable package
make qh-benchmark # prompt 08: probe vs. cosine vs. random
make cross-model  # prompt 09: skips models without API keys
make ablation     # prompt 10: Householder reflection ablation
make all          # all three, in sequence

Each target writes a per-run directory under replication/runs/ containing a reproducibility envelope (git SHA, dataset hash, model identifiers, deterministic flag, OS, Python version) and a normalised metrics_summary.json. Two runs are "compatible" iff their envelopes match on the structural fields. make verify PRIOR_RUN=<dir> compares run directories and emits one of three verdicts: match, mismatch, incompatible.

Full instructions, prereqs, environment variables, and knobs: replication/README.md. Source for the harness: replication/.

Dataset card

The replication dataset is the public Quintin Hypothesis v1 set: 1,936 items across physics, economics, and ethics, every item either firm-authored or drawn from public-domain sources with explicit licensing. The schema is published at /methodology/benchmark/qh; the dataset card is in the repo at benchmarks/quintin_hypothesis/v1/dataset_card.md. The dataset is frozen — improvements ship as v2 with separate tracking.

Expected numbers (deterministic, hash-det embedder)

These are the firm's recorded numbers on the QH v1 dataset. make verify insists on bit-stability within deterministic mode on the same machine; across machines the absolute tolerance is 5×10⁻³ on a [0, 1] metric.

RunnerAccuracyAUROCECENotes
random≈ 0.335≈ 0.50≈ 0.25lower bound; uniform-random label
cosine≈ 0.367≈ 0.40≈ 0.40trivial baseline; beats the firm's probe on accuracy
contradiction_geometry≈ 0.288≈ 0.586≈ 0.275the firm's probe; wins AUROC, loses accuracy at frozen v1 cuts

The honest reading: the firm's probe wins on AUROC and loses on accuracy at the frozen v1 thresholds. That is a finding, not a bug — the leaderboard is the kind the firm itself can lose on, and it does.

Replication-success rubric

  1. make qh-benchmark produces an envelope structurally compatible with one of the firm's recorded runs (same dataset hash, runner set, deterministic flag).
  2. make verify returns match against that recorded run.
  3. make cross-model contains at least the hash-det adapter in its envelope. Remote-API adapters are bonus, not required.
  4. make ablation reproduces identical accuracy across all five variants in deterministic mode (the null-result the harness is designed to detect, and which the firm publishes anyway).

A run that satisfies (1)–(4) is a successful replication. A remote-API adapter that has drifted is inconclusive, not a failure — that is a fact about the provider, not the firm.

What to do if your numbers differ

make verify emits one of three verdicts. The triage order:

  1. Was the envelope git_dirty? A dirty SHA is not a fixed point. Re-run on a clean checkout.
  2. Same Python version? The firm pins 3.11. The envelope records yours.
  3. Same dataset hash? If the verdict is incompatible with dataset_sha256 in the structural diff, the dataset on disk is not the v1 frozen file.
  4. Hardware nondeterminism? BLAS variants can drift at the 10⁻⁷ level even with single-thread caps. If you still see drift after that, please report it.
  5. Model API drift. Cross-model numbers change when the provider revises a model. The envelope records the model identifier; that drift is a fact about the provider, not a failure of the firm's claim.

If after that the deterministic targets still do not match within tolerance on the same OS + Python version, please open an issue with both replication_envelope.json files attached. A failed replication of the firm's own thesis is a louder alert than method drift, and the firm wants to hear about it.

Constraints the harness honours

  • No required services beyond Python and pip. No Docker required.
  • Targets that need API keys read them from environment variables and skip the affected model with a clear log line when the key is missing. They never error.
  • No proprietary data is bundled into the public repository. The replication dataset is QH v1 only.
  • A failed target prints both a stack trace and a one-paragraph human explanation of likely causes (missing env var, off-by-one Python, API rate limit).

Researchers who have replicated us

Each row below is an outside researcher who ran this harness end-to-end and whose numbers matched the firm's recorded numbers within the published tolerance. Each row corresponds to a signed reproducibility certificate emitted by python -m replication.lib.verify ... --emit-certificate and counter-signed with the firm's publication key.

A certificate is narrow on purpose. It certifies, only, that the harness reproduced the firm's published numbers on the named researcher's hardware. It does not certify that the firm's numbers are correct, and it does not verify the replicator's identity — names and affiliations are claimed by the replicator. The firm gates the public row behind explicit consent but performs no identity check.

No certified replications have been published yet. The page is live — it will populate as soon as the firm signs the first certificate. If you have already run the harness and seen match, follow the steps above to submit your run.

How a row gets here

  1. A researcher clones the repo and runs make all from replication/.
  2. make verify PRIOR_RUN=... returns verdict match against one of the firm's recorded baseline runs.
  3. The researcher (or the firm, on the researcher's behalf) calls python -m replication.lib.verify ... --emit-certificate, supplying the researcher's name, affiliation, and the --consent-public flag if they consent to be listed here.
  4. The firm reviews the unsigned certificate, signs it with the publication key, and commits it under replication/certificates/.
  5. The next build of this page picks it up. There is no moderation step beyond the signature: the firm either signs the certificate or doesn't.

Verifying a certificate yourself

Every row carries the canonical hash (truncated, full value in the JSON file) and the fingerprint of the signing key. To verify a certificate end-to-end without trusting this page:

# from the repo root
python -m replication.lib.certificate verify \
    replication/certificates/<row id>.json
# exits 0 iff the canonical bytes hash to the recorded hash and
# the Ed25519 signature verifies against the firm's verify key.

The firm's verify key for the active fingerprint is published alongside the keyring rotation log in the replication README.

Related