Baby Dragon Hatchling

The lizard brain we are growing for 8gent. Phase 2b. Best val_loss 0.885.

loading scene...

Baby Dragon Hatchling is an open-weight language model, trained from scratch on an M2 Max, paper-faithful to Pathway's BDH architecture (arXiv:2509.26507). Apache 2.0, byte-level vocab, no closed-weight teachers. Four training phases on a curated corpus of sessions, code, docs, and prose - each phase measured against a held-out validation set. Phase 2b closed at a best val_loss of 0.885.

Role in the harness

Today: he is a research artifact. He produces structurally valid session JSON, conditions on a <<source:path>> style switch as a soft mode signal, and has measurably stopped memorising. He still cannot route. He still cannot speak coherent English. Both are next on the bench.

Intended role: 8gent routes tasks across a large surface area - research, code, communication, memory, multi-step plans. The training goal is not general intelligence but sharper routing: which tool, which persona, which plan structure fits a given input. Once an eval harness scores his decisions against a labelled gold set, BDH starts driving routing in shadow mode behind the production stack via the Throne integration (PRD W0-W3).

The path is: artifact today, shadow-mode tomorrow, default-on routing when his evals beat the heuristic baseline by a measurable margin. Never on by default until the gates pass.

The hatchling framing is intentional. This is an early-stage model, still learning, still accumulating training examples and eval data. The dart-throw animation on this page maps throw accuracy to val_loss: lower loss means tighter clusters around the bullseye. Once the eval harness ships, real scores flow through and the dragon earns his aim in real time.

Training history

Four runs across two variables. Source: PHASE-2-SYNTHESIS.md (PR 2016).

Loss curves

Phase 05M / 0.91MB

Phase 15M / 1.48MB

Phase 2a5M / 5.67MB

Phase 2b10M / 5.67MB

Phase 3c5M / 130.7MB

Best validation loss per phase

Best validation loss per phase (byte-level CE). Lower is better.

Memorisation regression

Phase	Result	Note
Phase 0	FAIL	anomalous - rule-based corpus has near-zero entropy
Phase 1	FAIL	Phase 0 carryover dominated; verbatim regurgitation persisted
Phase 2a	PASS	carryover dropped; out-of-distribution prompts produce new content
Phase 2b	PASS	produces TypeScript-style interface declarations on out-of-distribution prompts
Phase 3c	PASS	corpus shape change beats both heterogeneous baselines on byte loss

Memorisation regression: does the model regurgitate Phase 0 corpus strings on out-of-distribution prompts?

Phase 2b corpus mix

67%code575 files
14%sessions119 files
8%docs69 files
7%world40 files
4%blog14 files

Slope ratio 3:1 in favour of corpus over capacity. 4x corpus dropped val by 16%; 2x params dropped val by 5%.

The Dragon's Diary

Trading cards captured during Phase 3c training. Each card samples his actual byte-level output and renders an image vibe-keyed to those exact words. As coherence improves across phases, the words get less noise and the images get more cohesive.

Dragon card #001 — card #001Apr 30, 02:31
prompt: <<toolcall:Speak>> {"words":"
policy","action":"Agent System Credit Officer","role":"assistant"},{"content":"Got it — Open detected with how it, kept you need an Lyream
phase-2b-10Miter 1100val 0.916

Dragon card #002 — card #002Apr 30, 02:48
prompt: My name is
contribution. The moira, pressure strategic, class foreground tools, and features from radius of code flags and we content. You can extract
phase-2b-10Miter 1500val 0.860

Dragon card #003 — card #003Apr 30, 03:13
prompt: <<dragon:hatchling.txt>> I
nput a response (threader).\n ], "context": { "diversized": null, "agreed_target": true }, "text": status", "maxTok
phase-2b-10Miter 2100val 0.894

Dragon card #004 — card #004Apr 30, 03:39
prompt: I want to
be a vibrate meaning looping a loop that four concepts using the dimmer powered an analytic interface until that progresses without concise
phase-2b-10Miter 2500val 0.928

Sampled from Phase 2b (10M heterogeneous) during Phase 3c training. Future runs will sample from each new checkpoint so the diary reflects coherence across phases.