Training data you can audit before you buy.
- 25,541 documents / ~1.54B tokens, Merkle root
92e9af57…, witness-signed at public pulse 2516402 - Per-document provenance basis — source, id, year, license — bound into the root
- Deduplicated (MinHash/LSH), provenance-gated; five independent verification checks pass
- Provenance split stated exactly — never sold as “fully public-domain”
| Records | 25,541 documents / ~1.54B tokens |
| Size | 2.45 GB (gzip) |
| SHA-256 | 92e9af57335910a57e5f50d33564aa07542d6406cb0cde010f0e3c139f2e9e63 |
| Witness pulse | 2516402 (public entropy beacon) |
| Provenance basis | per-document: pre-1929 PD / source-cleared / operated-substrate |
| License | Single-organization commercial |
Free before you buy: request the full manifest + signed root
Run the verification before you spend a dollar.
Each document carries its own provenance basis, cryptographically bound into the Merkle root: 8,643 bright-line pre-1929 public domain, 12,077 source-cleared (Project Gutenberg PD, not independently date-verified), 4,821 oracle-verified code. We report the split exactly and never describe the corpus as fully public-domain.
# the signed provenance chain is public; the data is licensed
curl https://ledatic.org/entropy/pulse/2516402 # the witness pulse, live
# after purchase, the included verifier re-checks the hash + every record
Straight answers.
Is this really public domain?
The provenance is stated exactly per document, not blurred. 8,643 documents are bright-line pre-1929 public domain; 12,077 are source-cleared (Project Gutenberg declares them PD, we did not independently date-verify each); 4,821 are code we generated and machine-verified. We never describe the whole corpus as fully public-domain.
What am I actually buying if the texts are public?
The assembly and the proof: 21k books resolved, cleaned, deduplicated and provenance-gated to 25,541 documents, plus a signed, independently re-verifiable provenance chain. Raw public archives cannot answer “prove where each document came from.” This can.
How do I verify it?
The signed manifest, the Merkle scheme, and a verifier ship in the package. A second, independent root recompute is provided in Rail. The witness signature is anchored to public entropy pulse 2516402, which you can fetch live.
What license do I get?
A single-organization commercial license: train and ship models freely, including commercially. You may not redistribute the corpus itself. See LICENSE.md.
Can I get the full manifest before buying?
Yes — email 31zemogyllier@gmail.com for the full document manifest and signed root.