Two datasets. One standard of proof.
- 52,243 comment→program pairs; every program compiles — verified by running the compiler on each one at generation time, and re-verified end-to-end (all 52,243) before this page went live
- Bit-reproducible: generator + seed 1234 + oracle ⇒ byte-identical file (sha256
1a6941af…— see the signed ledger) - This exact file trained a from-scratch 138M model to 16/16 compile@1 and 16/16 function-correctness on its frozen benchmark — the run's hash-chained ledger binds corpus→weights and ships in the box
- Synthetic + oracle-verified: zero scraped code, zero license ambiguity
free first: 200-pair sample · signed ledger · tarball sha256
- 25,541 documents / ~1.54B tokens, Merkle root
92e9af57…, Pi-witness-signed at public pulse 2516402 - Per-document provenance leaf: source, id, strand, provenance basis, year — 8,643 docs bright-line pre-1929 PD, 12,077 source-cleared (Gutenberg), 4,821 oracle-verified code (ours)
- Deduplicated (MinHash/LSH), provenance-gated, five independent verification checks pass — including a Merkle-root recompute by a second implementation
- We state the provenance split exactly — we will never sell this as “fully public-domain” because 58% is source-cleared rather than date-verified. That honesty is the product working as intended.
free first: request the full manifest + signed root
need a custom shape — a bigger oracle-verified code corpus, a different language, your own substrate attested end-to-end? email us. models trained on these datasets are yours, commercially, no royalty.
Run the verification before you spend a dollar.
the sample, the signed ledger, and the hashes are public. this is what buying data should feel like.
# 1 — the sample: 200 real records, free
curl -O https://ledatic.org/data/rail-verified-pairs-v2-sample200.jsonl
# 2 — the signed ledger: corpus sha256 bound to a training run,
# witness-signed against public entropy pulse 2641877
curl -O https://ledatic.org/data/rail-verified-pairs-v2.ledger.sig.json
curl https://ledatic.org/entropy/pulse/2641877 # the pulse, live, hash-chained
# 3 — after purchase: re-verify all 52,243 programs against the public
# Rail compiler (github.com/zemo-g/rail) with the included verify_pairs.py
python3 verify_pairs.py pairs_v6.jsonl attested_ledger.jsonl
What the chain proves — and what it doesn't.
- It proves exactly which bytes you're getting, where each record came from by our stated basis, that the code compiles under a public compiler, and that none of it was swapped after signing.
- It does not prove the data will make your model good. The included ledger shows what it did for ours (a 138M from-scratch model: 16/16 compile@1 on its frozen bench); your mileage is an empirical question — which is why the sample is free.
- Provenance basis is per-record and explicit. "pre-1929" is a bright legal line; "source-cleared" means the source declares it public-domain and we did not independently date-verify; "operated-substrate" means we generated and machine-verified it. We report the split; we don't blur it.
- If we disappear, verification still works: the Merkle scheme is documented in the manifest, the signatures are standard Ed25519, and the entropy pulse chain is archived by anyone who ever mirrored it.