8gent Code
Models

Baby Dragon Hatchling

The lizard brain we are growing for 8gent. Phase 2b. Best val_loss 0.885.

loading scene...

Baby Dragon Hatchling is an open-weight language model, trained from scratch on an M2 Max, paper-faithful to Pathway's BDH architecture (arXiv:2509.26507). Apache 2.0, byte-level vocab, no closed-weight teachers. Four training phases on a curated corpus of sessions, code, docs, and prose - each phase measured against a held-out validation set. Phase 2b closed at a best val_loss of 0.885.

Role in the harness

Today: he is a research artifact. He produces structurally valid session JSON, conditions on a <<source:path>> style switch as a soft mode signal, and has measurably stopped memorising. He still cannot route. He still cannot speak coherent English. Both are next on the bench.

Intended role: 8gent routes tasks across a large surface area - research, code, communication, memory, multi-step plans. The training goal is not general intelligence but sharper routing: which tool, which persona, which plan structure fits a given input. Once an eval harness scores his decisions against a labelled gold set, BDH starts driving routing in shadow mode behind the production stack via the Throne integration (PRD W0-W3).

The path is: artifact today, shadow-mode tomorrow, default-on routing when his evals beat the heuristic baseline by a measurable margin. Never on by default until the gates pass.

The hatchling framing is intentional. This is an early-stage model, still learning, still accumulating training examples and eval data. The dart-throw animation on this page maps throw accuracy to val_loss: lower loss means tighter clusters around the bullseye. Once the eval harness ships, real scores flow through and the dragon earns his aim in real time.

Training history

Four runs across two variables. Source: PHASE-2-SYNTHESIS.md (PR 2016).

Loss curves

1350.080
Phase 05M / 0.91MB
1351.116
Phase 15M / 1.48MB
1350.934
Phase 2a5M / 5.67MB
1350.885
Phase 2b10M / 5.67MB
1350.715
Phase 3c5M / 130.7MB

Best validation loss per phase

0.250.500.751.000.080Phase 0heartbeat1.116Phase 1explore0.934Phase 2ascale0.885Phase 2bcapacity0.715Phase 3ctool-calls
Best validation loss per phase (byte-level CE). Lower is better.

Memorisation regression

PhaseResultNote
Phase 0FAILanomalous - rule-based corpus has near-zero entropy
Phase 1FAILPhase 0 carryover dominated; verbatim regurgitation persisted
Phase 2aPASScarryover dropped; out-of-distribution prompts produce new content
Phase 2bPASSproduces TypeScript-style interface declarations on out-of-distribution prompts
Phase 3cPASScorpus shape change beats both heterogeneous baselines on byte loss
Memorisation regression: does the model regurgitate Phase 0 corpus strings on out-of-distribution prompts?

Phase 2b corpus mix

5.67 MBPhase 2b
  • 67%code575 files
  • 14%sessions119 files
  • 8%docs69 files
  • 7%world40 files
  • 4%blog14 files

Slope ratio 3:1 in favour of corpus over capacity. 4x corpus dropped val by 16%; 2x params dropped val by 5%.

The Dragon's Diary

Trading cards captured during Phase 3c training. Each card samples his actual byte-level output and renders an image vibe-keyed to those exact words. As coherence improves across phases, the words get less noise and the images get more cohesive.

Dragon card #001
card #001Apr 30, 02:31
prompt: <<toolcall:Speak>> {"words":"
policy","action":"Agent System Credit Officer","role":"assistant"},{"content":"Got it — Open detected with how it, kept you need an Lyream
phase-2b-10Miter 1100val 0.916
Dragon card #002
card #002Apr 30, 02:48
prompt: My name is
contribution. The moira, pressure strategic, class foreground tools, and features from radius of code flags and we content. You can extract
phase-2b-10Miter 1500val 0.860
Dragon card #003
card #003Apr 30, 03:13
prompt: <<dragon:hatchling.txt>> I
nput a response (threader).\n ], "context": { "diversized": null, "agreed_target": true }, "text": status", "maxTok
phase-2b-10Miter 2100val 0.894
Dragon card #004
card #004Apr 30, 03:39
prompt: I want to
be a vibrate meaning looping a loop that four concepts using the dimmer powered an analytic interface until that progresses without concise
phase-2b-10Miter 2500val 0.928

Sampled from Phase 2b (10M heterogeneous) during Phase 3c training. Future runs will sample from each new checkpoint so the diary reflects coherence across phases.