Training data you can audit before you buy.
- 52,243 comment→program pairs; every one compiles under the public Rail compiler
- Bit-reproducible: generator + seed 1234 + oracle ⇒ byte-identical file
- This exact file trained a from-scratch 138M model to 16/16 compile@1 on a frozen benchmark
- Signed provenance chain (Ed25519 witness at public entropy pulse 2641877) ships in the box
| Records | 52,243 pairs |
| Size | 4.4 MB (JSONL) |
| SHA-256 | 1a6941afad3ad6259af6a0a48f0554108ba0b3197e8e05e53e487833a6cd046c |
| Witness pulse | 2641877 (public entropy beacon) |
| Provenance basis | operated-substrate (synthetic, machine-verified) |
| License | Single-organization commercial |
Free before you buy: 200-record sample · signed ledger
Run the verification before you spend a dollar.
Every program was generated by our own generator and verified by running the Rail compiler (the oracle) on it. No scraped code, no license ambiguity, no human-authored third-party code.
# the signed provenance chain is public; the data is licensed
curl https://ledatic.org/entropy/pulse/2641877 # the witness pulse, live
# after purchase, the included verifier re-checks the hash + every record
Straight answers.
Is this scraped code?
No. Every program is synthetically generated and then verified by running the Rail compiler on it. There is no scraped or third-party human-authored code, so there is no license ambiguity.
How do I verify the dataset before trusting it?
A 200-pair sample, the signed ledger, and the tarball hash are free downloads. After purchase, the included verify_pairs.py re-checks that (1) the file's SHA-256 matches the hash bound in the signed training ledger and (2) every one of the 52,243 programs compiles under the public oracle.
What does bit-reproducible mean here?
Running the included generator with seed 1234 against the same oracle produces a byte-identical corpus. We regenerated it twice to confirm. You can reproduce it yourself.
What license do I get?
A single-organization commercial license: train and ship models freely, including commercially, with no royalty. You may not redistribute the dataset itself. See LICENSE.md in the package.
What model result backs this?
The included hash-chained ledger records a from-scratch 138M model trained on this exact corpus reaching 16/16 compile@1 and 16/16 function-correctness on its frozen benchmark. The full per-checkpoint ledger ships so you can see exactly what was measured.