Code training data · oracle-verified · bit-reproducible

Rail Verified-Pairs Corpus v2

52,243 oracle-verified comment→program pairs. Every program compiles — re-verifiable by you.

Single-organization license · instant download on purchase · provenance stated exactly

What it is

Training data you can audit before you buy.

  • 52,243 comment→program pairs; every one compiles under the public Rail compiler
  • Bit-reproducible: generator + seed 1234 + oracle ⇒ byte-identical file
  • This exact file trained a from-scratch 138M model to 16/16 compile@1 on a frozen benchmark
  • Signed provenance chain (Ed25519 witness at public entropy pulse 2641877) ships in the box
Records52,243 pairs
Size4.4 MB (JSONL)
SHA-2561a6941afad3ad6259af6a0a48f0554108ba0b3197e8e05e53e487833a6cd046c
Witness pulse2641877 (public entropy beacon)
Provenance basisoperated-substrate (synthetic, machine-verified)
LicenseSingle-organization commercial

Free before you buy: 200-record sample · signed ledger

Check first

Run the verification before you spend a dollar.

Every program was generated by our own generator and verified by running the Rail compiler (the oracle) on it. No scraped code, no license ambiguity, no human-authored third-party code.

# the signed provenance chain is public; the data is licensed
curl https://ledatic.org/entropy/pulse/2641877   # the witness pulse, live
# after purchase, the included verifier re-checks the hash + every record
Questions

Straight answers.

Is this scraped code?

No. Every program is synthetically generated and then verified by running the Rail compiler on it. There is no scraped or third-party human-authored code, so there is no license ambiguity.

How do I verify the dataset before trusting it?

A 200-pair sample, the signed ledger, and the tarball hash are free downloads. After purchase, the included verify_pairs.py re-checks that (1) the file's SHA-256 matches the hash bound in the signed training ledger and (2) every one of the 52,243 programs compiles under the public oracle.

What does bit-reproducible mean here?

Running the included generator with seed 1234 against the same oracle produces a byte-identical corpus. We regenerated it twice to confirm. You can reproduce it yourself.

What license do I get?

A single-organization commercial license: train and ship models freely, including commercially, with no royalty. You may not redistribute the dataset itself. See LICENSE.md in the package.

What model result backs this?

The included hash-chained ledger records a from-scratch 138M model trained on this exact corpus reaching 16/16 compile@1 and 16/16 function-correctness on its frozen benchmark. The full per-checkpoint ledger ships so you can see exactly what was measured.