Motivation
8gent normally routes to static model weights. Models never learn from your sessions. The kernel fine-tuning pipeline closes this loop: every coding session becomes training data, and GRPO continuously evolves a LoRA adapter on top of your base model. The model gets better at your workflows over time.
Architecture
+-------------+ +------------------+ +--------------+
| 8gent TUI |----->| Training Proxy |----->| Ollama |
| (Bun/Ink) |<-----| :30000 |<-----| :11434 |
+-------------+ +--------+---------+ +--------------+
|
+--------v---------+
| Judge LLM (PRM) | <-- scores responses
| gemini-2.5-flash| asynchronously
+--------+---------+
|
+--------v---------+
| GRPO Trainer | <-- LoRA fine-tuning
| (MinT backend) | during idle/sleep
+--------+---------+
|
+--------v---------+
| Hot-swap LoRA | <-- adapter merged
| back to Ollama | without restart
+-----------------+Three-Layer Model Architecture
8gent models stack three layers at inference time:
| Layer | What | Source | Location |
|---|---|---|---|
| Layer 1: Base Model | Upstream weights (e.g., qwen3:14b) | Ollama registry | Never modified locally |
| Layer 2: Eight LoRA | Centralized fine-tune from autoresearch benchmarks | Shipped with each Eight release | Validated by Gemini Flash judge |
| Layer 3: Personal LoRA | User's local fine-tune on their coding patterns (Preview - Q2 2026) | Kernel pipeline | ~/.8gent/personal-lora/ |
When a new Eight version releases (Layer 2 update), users are prompted to retrain their Personal LoRA (Layer 3) so it aligns with the updated adapter weights.
Model Versioning
Eight models follow a strict naming convention: eight-{major.minor.patch}-q{gen}:{params}
| Segment | Meaning | Bumps when... |
|---|---|---|
major | Base model change | Switching upstream weights (e.g., Qwen 3 to Qwen 3.5) |
minor | Judge-validated improvement | Gemini Flash confirms score gain on autoresearch suite |
patch | Nightly build | Every GRPO training batch produces a new patch |
q{gen} | Quantization generation | Quantization method changes |
{params} | Parameter count | Model size changes |
Promotion Flow
- Nightly training produces a new patch (e.g.,
eight-1.0.42-q3:14b) - Gemini Flash judge scores the checkpoint against the autoresearch benchmark suite
- If the checkpoint outperforms the current release,
version-manager.tspromotes it to a new minor version (e.g.,eight-1.1-q3:14b) - If it regresses, the checkpoint is rolled back automatically
The version-manager.ts module in packages/eight/ manages this lifecycle. The Gemini Flash judge (google/gemini-2.5-flash:free via OpenRouter) provides zero-cost semantic evaluation.
The Four Phases
The @8gent/kernel package implements the full pipeline in four phases.
Phase 1: Proxy Management
File: packages/kernel/proxy.ts
Manages the training proxy process that sits between 8gent and Ollama. The proxy intercepts requests to collect conversation traces for training.
- Start/stop training proxy process
- Health checks with configurable timeout
- Latency overhead monitoring (direct vs proxied requests)
- Configurable latency threshold with alerting
const proxy = new TrainingProxy(config);
await proxy.start();
const acceptable = await proxy.isLatencyAcceptable(); // compare direct vs proxiedPhase 2: Judge Scoring
File: packages/kernel/judge.ts
Scores every agent response using a Process Reward Model (PRM) via Gemini Flash through OpenRouter. The judge evaluates on four criteria:
| Criterion | Weight | What it measures |
|---|---|---|
| Execution success | 40% | Did the code work? |
| Code quality | 20% | Clean, readable, idiomatic? |
| Tool efficiency | 20% | Minimal tool calls, no wasted reads? |
| Directness | 20% | Did the agent get to the point? |
const scorer = new JudgeScorer(config);
const score = await scorer.score(sessionId, turn, model, prompt, response);
const trend = scorer.getScoreTrend(7); // 7-day rolling windowScore history is persisted to .8gent/kernel/score-history.json.
Phase 3: Training Orchestration
File: packages/kernel/training.ts
Collects scored responses into GRPO training batches. Trivial responses (perfect scores) and very poor responses are filtered out - the model learns most from challenging-but-achievable tasks.
- Automatic training trigger when batch is full
- Checkpoint creation and lifecycle tracking
- Benchmark validation gate via the autoresearch suite
- Auto-rollback on regression
const trainer = new TrainingOrchestrator(config);
trainer.addSample(scoreRecord); // buffers, auto-triggers when batch full
const checkpoints = trainer.getCheckpoints(); // list all with statusTraining state is persisted to .8gent/kernel/training/state.json.
Phase 4: Production Loop
File: packages/kernel/loop.ts
Ties everything together. Handles MadMax scheduling (training only during idle/sleep windows), auto-promotion of improved checkpoints into the model router, and health monitoring.
const loop = new ProductionLoop(config);
await loop.processTurn(sessionId, turnIndex, model, prompt, response);
const active = loop.getActiveModel(); // base or fine-tuned
const health = loop.getHealthStatus(); // improving/stable/decliningMadMax scheduling: Weight updates are deferred to idle periods and sleep hours (default: 23:00 to 07:00) so they never interrupt active coding sessions.
Unified Entry Point
The KernelManager class (@8gent/kernel) provides start(), processTurn(), getHealth(), getActiveModel(), and stop() methods. It reads from .8gent/config.json.
The pipeline is off by default. Enable it in .8gent/config.json:
{
"trainingProxy": {
"enabled": true,
"proxyUrl": "http://localhost:30000",
"autoStart": false
}
}How to Enable
# 1. Install the training proxy
pip install -e ".[rl,evolve,scheduler]"
# 2. Point 8gent through the proxy
export TRAINING_PROXY_URL=http://localhost:30000
# 3. Start the training proxy
8gent-proxy start
# 4. Run 8gent normally - sessions now generate training signal
8gent
# 5. Validate a checkpoint against benchmarks
bun run benchmarks/autoresearch/validate-checkpoint.tsSafety Rails
- Checkpoint before every LoRA swap - always rollback-able
- Benchmark gate - new weights must match or beat baseline on the autoresearch suite
- MadMax scheduling - training never happens during active sessions
- LoRA isolation - base model weights are never modified, only adapter layers
- A/B routing - the model router can split traffic between base and fine-tuned to measure real impact
Configuration
The training configuration lives at config/training.yaml. Key settings: MadMax scheduling mode, Gemini Flash judge via OpenRouter, MinT backend (local, no cloud dependency), LoRA rank 32.