Text training data · provenance-gated · Merkle-rooted

Attested Text Corpus v2

25,541 documents / ~1.54B tokens with per-document provenance — Merkle-rooted and witness-signed.

Single-organization license · instant download on purchase · provenance stated exactly

What it is

Training data you can audit before you buy.

  • 25,541 documents / ~1.54B tokens, Merkle root 92e9af57…, witness-signed at public pulse 2516402
  • Per-document provenance basis — source, id, year, license — bound into the root
  • Deduplicated (MinHash/LSH), provenance-gated; five independent verification checks pass
  • Provenance split stated exactly — never sold as “fully public-domain”
Records25,541 documents / ~1.54B tokens
Size2.45 GB (gzip)
SHA-25692e9af57335910a57e5f50d33564aa07542d6406cb0cde010f0e3c139f2e9e63
Witness pulse2516402 (public entropy beacon)
Provenance basisper-document: pre-1929 PD / source-cleared / operated-substrate
LicenseSingle-organization commercial

Free before you buy: request the full manifest + signed root

Check first

Run the verification before you spend a dollar.

Each document carries its own provenance basis, cryptographically bound into the Merkle root: 8,643 bright-line pre-1929 public domain, 12,077 source-cleared (Project Gutenberg PD, not independently date-verified), 4,821 oracle-verified code. We report the split exactly and never describe the corpus as fully public-domain.

# the signed provenance chain is public; the data is licensed
curl https://ledatic.org/entropy/pulse/2516402   # the witness pulse, live
# after purchase, the included verifier re-checks the hash + every record
Questions

Straight answers.

Is this really public domain?

The provenance is stated exactly per document, not blurred. 8,643 documents are bright-line pre-1929 public domain; 12,077 are source-cleared (Project Gutenberg declares them PD, we did not independently date-verify each); 4,821 are code we generated and machine-verified. We never describe the whole corpus as fully public-domain.

What am I actually buying if the texts are public?

The assembly and the proof: 21k books resolved, cleaned, deduplicated and provenance-gated to 25,541 documents, plus a signed, independently re-verifiable provenance chain. Raw public archives cannot answer “prove where each document came from.” This can.

How do I verify it?

The signed manifest, the Merkle scheme, and a verifier ship in the package. A second, independent root recompute is provided in Rail. The witness signature is anchored to public entropy pulse 2516402, which you can fetch live.

What license do I get?

A single-organization commercial license: train and ship models freely, including commercially. You may not redistribute the corpus itself. See LICENSE.md.

Can I get the full manifest before buying?

Yes — email 31zemogyllier@gmail.com for the full document manifest and signed root.