Datasets — provenance-attested

Training data you can audit before you buy.

Every record's origin: hashed → Merkle-rooted → Ed25519-signed against a public entropy beacon.

Training-data provenance stopped being optional — regulators ask, counsel asks, customers ask. Web-scraped corpora can't answer. Ours answer with a verification chain you can run yourself: the hashes, the signed ledger, and the sample are free downloads below, right now, before any money moves.

No scraped code · no license ambiguity · the chain is the product

Catalog

Two datasets. One standard of proof.

Text — provenance-gated Attested Text Corpus v2 $499 single-organization license · instant download on purchase
  • 25,541 documents / ~1.54B tokens, Merkle root 92e9af57…, Pi-witness-signed at public pulse 2516402
  • Per-document provenance leaf: source, id, strand, provenance basis, year — 8,643 docs bright-line pre-1929 PD, 12,077 source-cleared (Gutenberg), 4,821 oracle-verified code (ours)
  • Deduplicated (MinHash/LSH), provenance-gated, five independent verification checks pass — including a Merkle-root recompute by a second implementation
  • We state the provenance split exactly — we will never sell this as “fully public-domain” because 58% is source-cleared rather than date-verified. That honesty is the product working as intended.
Buy — $499

free first: request the full manifest + signed root

Full provenance & FAQ →

need a custom shape — a bigger oracle-verified code corpus, a different language, your own substrate attested end-to-end? email us. models trained on these datasets are yours, commercially, no royalty.

Check first

Run the verification before you spend a dollar.

the sample, the signed ledger, and the hashes are public. this is what buying data should feel like.

# 1 — the sample: 200 real records, free
curl -O https://ledatic.org/data/rail-verified-pairs-v2-sample200.jsonl

# 2 — the signed ledger: corpus sha256 bound to a training run,
#     witness-signed against public entropy pulse 2641877
curl -O https://ledatic.org/data/rail-verified-pairs-v2.ledger.sig.json
curl https://ledatic.org/entropy/pulse/2641877   # the pulse, live, hash-chained

# 3 — after purchase: re-verify all 52,243 programs against the public
#     Rail compiler (github.com/zemo-g/rail) with the included verify_pairs.py
python3 verify_pairs.py pairs_v6.jsonl attested_ledger.jsonl
chain: generator(seed 1234) → oracle-verify each program → corpus sha256 → training ledger record 0 → hash-chained per-checkpoint → Ed25519 witness signature → public pulse 2641877
Honest limits

What the chain proves — and what it doesn't.

  • It proves exactly which bytes you're getting, where each record came from by our stated basis, that the code compiles under a public compiler, and that none of it was swapped after signing.
  • It does not prove the data will make your model good. The included ledger shows what it did for ours (a 138M from-scratch model: 16/16 compile@1 on its frozen bench); your mileage is an empirical question — which is why the sample is free.
  • Provenance basis is per-record and explicit. "pre-1929" is a bright legal line; "source-cleared" means the source declares it public-domain and we did not independently date-verify; "operated-substrate" means we generated and machine-verified it. We report the split; we don't blur it.
  • If we disappear, verification still works: the Merkle scheme is documented in the manifest, the signatures are standard Ed25519, and the entropy pulse chain is archived by anyone who ever mirrored it.