Overview
8gent uses Karpathy's autoresearch methodology to iteratively improve its system prompts. The harness runs benchmarks in a loop, identifies weaknesses, generates enhanced prompt patterns, and re-runs - scores improve automatically without human intervention.
How It Works
+-------------------------------------------+
| AUTORESEARCH LOOP |
| |
| 1. Run all benchmarks with 8gent |
| 2. Compare scores to Claude baselines |
| 3. Identify weak benchmarks |
| 4. Generate enhanced prompt patterns |
| 5. Append patterns to system prompt |
| 6. Repeat from step 1 |
| |
| Loop runs forever until interrupted |
+-------------------------------------------+The Harness
Located at benchmarks/autoresearch/harness.ts, the harness:
- Extracts the system prompt from
packages/eight/prompts/system-prompt.ts - Sends benchmark tasks to the local Ollama model with the system prompt
- Grades responses using execution-based scoring (code runs against
bun:testsuites) - Compares to Claude baselines established by running the same tasks through Claude Code
- Generates enhanced patterns for categories where 8gent underperforms
- Appends patterns to the system prompt file with deduplication (exact match + 70% word overlap)
Grading Methodology
All benchmarks use a two-part grading system:
| Component | Weight | Method |
|---|---|---|
| Execution | 70% | Code is compiled and run against test assertions. Score = tests passed / total tests |
| Keywords | 30% | Checks for domain-specific patterns (JWT, topological sort, NPV, etc.) |
Temperature sweep runs each benchmark at 0.3, 0.5, and 0.7. The best result is kept.
Important: 8gent uses AI-based judging, never string matching. The Vercel AI SDK's judge model evaluates output semantically, handling ambiguity, synonyms, and edge cases correctly.
Enhanced Patterns
When 8gent loses a benchmark, the harness generates category-specific enhanced patterns that are injected into the system prompt:
Bug Fixing
- Race condition patterns (mutex, lock, finally blocks)
- Memory leak patterns (cleanup, WeakMap, listener removal)
- Null reference patterns (optional chaining, nullish coalescing)
File Manipulation
- Input validation patterns (typeof, instanceof, Array.isArray)
- Error message patterns (expected type, actual type, parameter name)
- Code organization patterns (validate at entry, extract helpers)
Feature Implementation
- LRU caching patterns (Map-based, TTL, eviction, stats)
- Complete implementation examples for complex features
Results
After 15+ iterations of autoresearch:
| Benchmark | Before | After | Improvement |
|---|---|---|---|
| BF001 Race Conditions | 50 | 100 | +50 |
| BF003 Null References | 50 | 100 | +50 |
| FM001 Validation | 50 | 100 | +50 |
| FI001 LRU Caching | 50 | 100 | +50 |
| BF002 Memory Leaks | 50 | 85 | +35 |
Key Insights
-
Variance is inherent. Local LLMs have high variance between runs. The same benchmark may score 50 or 100 on consecutive iterations. Running many iterations and tracking best single-iteration performance gives the most accurate capability picture.
-
Pattern injection works. Adding enhanced patterns to the system prompt reliably improves scores on targeted benchmarks by 15-50 points.
-
Diminishing returns. After core patterns are added, further improvement comes from model variance rather than prompt changes.
-
Temperature matters. The same model scores 43 at temp=0.3 and 92 at temp=0.7 on the same benchmark.
-
Knowledge vs Execution gap. Models score 100% on keywords but 0% on execution for complex tasks. They know every pattern but cannot always produce coordinated code that compiles and runs.
-
Mutations compound. BT001 went from 85 to 94 after one round of mutations. The system learns from its own failures.
Running Autoresearch
# Single pass (all benchmarks)
bun run benchmark:v2
# Autoresearch loop (iterative improvement)
CATEGORY=battle-test MAX_ITERATIONS=5 bun run benchmark:loop
# Overnight continuous runner (all categories)
bash benchmarks/autoresearch/overnight-runner.sh
# Monitor progress
tail -f benchmarks/autoresearch/run.log
# Check results
cat benchmarks/results.tsvIntegration with Kernel Fine-Tuning
The autoresearch benchmark suite doubles as the regression gate for RL fine-tuning checkpoints. When the @8gent/kernel package trains a new LoRA checkpoint via the training proxy, it validates the checkpoint against the benchmark suite before promotion.
bun run benchmarks/autoresearch/validate-checkpoint.tsIf the fine-tuned model outperforms the baseline on the benchmark suite, the checkpoint is promoted. If it regresses, it is rolled back automatically. This creates a feedback loop: autoresearch improves prompts, kernel fine-tuning improves weights, and both are validated against the same benchmark suite.