Overview

8gent uses Karpathy's autoresearch methodology to iteratively improve its system prompts. The harness runs benchmarks in a loop, identifies weaknesses, generates enhanced prompt patterns, and re-runs - scores improve automatically without human intervention.

How It Works

+-------------------------------------------+
|           AUTORESEARCH LOOP               |
|                                           |
|  1. Run all benchmarks with 8gent         |
|  2. Compare scores to Claude baselines    |
|  3. Identify weak benchmarks              |
|  4. Generate enhanced prompt patterns     |
|  5. Append patterns to system prompt      |
|  6. Repeat from step 1                    |
|                                           |
|  Loop runs forever until interrupted      |
+-------------------------------------------+

The Harness

Located at benchmarks/autoresearch/harness.ts, the harness:

Extracts the system prompt from packages/eight/prompts/system-prompt.ts
Sends benchmark tasks to the local Ollama model with the system prompt
Grades responses using execution-based scoring (code runs against bun:test suites)
Compares to Claude baselines established by running the same tasks through Claude Code
Generates enhanced patterns for categories where 8gent underperforms
Appends patterns to the system prompt file with deduplication (exact match + 70% word overlap)

Grading Methodology

All benchmarks use a two-part grading system:

Component	Weight	Method
Execution	70%	Code is compiled and run against test assertions. Score = tests passed / total tests
Keywords	30%	Checks for domain-specific patterns (JWT, topological sort, NPV, etc.)

Temperature sweep runs each benchmark at 0.3, 0.5, and 0.7. The best result is kept.

Important: 8gent uses AI-based judging, never string matching. The Vercel AI SDK's judge model evaluates output semantically, handling ambiguity, synonyms, and edge cases correctly.

Enhanced Patterns

When 8gent loses a benchmark, the harness generates category-specific enhanced patterns that are injected into the system prompt:

Bug Fixing

Race condition patterns (mutex, lock, finally blocks)
Memory leak patterns (cleanup, WeakMap, listener removal)
Null reference patterns (optional chaining, nullish coalescing)

File Manipulation

Input validation patterns (typeof, instanceof, Array.isArray)
Error message patterns (expected type, actual type, parameter name)
Code organization patterns (validate at entry, extract helpers)

Feature Implementation

LRU caching patterns (Map-based, TTL, eviction, stats)
Complete implementation examples for complex features

Results

After 15+ iterations of autoresearch:

Benchmark	Before	After	Improvement
BF001 Race Conditions	50	100	+50
BF003 Null References	50	100	+50
FM001 Validation	50	100	+50
FI001 LRU Caching	50	100	+50
BF002 Memory Leaks	50	85	+35

Key Insights

Variance is inherent. Local LLMs have high variance between runs. The same benchmark may score 50 or 100 on consecutive iterations. Running many iterations and tracking best single-iteration performance gives the most accurate capability picture.
Pattern injection works. Adding enhanced patterns to the system prompt reliably improves scores on targeted benchmarks by 15-50 points.
Diminishing returns. After core patterns are added, further improvement comes from model variance rather than prompt changes.
Temperature matters. The same model scores 43 at temp=0.3 and 92 at temp=0.7 on the same benchmark.
Knowledge vs Execution gap. Models score 100% on keywords but 0% on execution for complex tasks. They know every pattern but cannot always produce coordinated code that compiles and runs.
Mutations compound. BT001 went from 85 to 94 after one round of mutations. The system learns from its own failures.

Running Autoresearch

# Single pass (all benchmarks)
bun run benchmark:v2

# Autoresearch loop (iterative improvement)
CATEGORY=battle-test MAX_ITERATIONS=5 bun run benchmark:loop

# Overnight continuous runner (all categories)
bash benchmarks/autoresearch/overnight-runner.sh

# Monitor progress
tail -f benchmarks/autoresearch/run.log

# Check results
cat benchmarks/results.tsv

Integration with Kernel Fine-Tuning

The autoresearch benchmark suite doubles as the regression gate for RL fine-tuning checkpoints. When the @8gent/kernel package trains a new LoRA checkpoint via the training proxy, it validates the checkpoint against the benchmark suite before promotion.

bun run benchmarks/autoresearch/validate-checkpoint.ts

If the fine-tuned model outperforms the baseline on the benchmark suite, the checkpoint is promoted. If it regresses, it is rolled back automatically. This creates a feedback loop: autoresearch improves prompts, kernel fine-tuning improves weights, and both are validated against the same benchmark suite.

On this page