8gent Code

8gent includes a comprehensive benchmark suite with 39+ tasks across 6 tiers spanning 15 professional domains. Every benchmark is execution-graded: code runs against bun:test suites or it fails.

All Local Inference

Every score below is from local inference via Ollama at zero cost. No cloud APIs, no paid models.


The Six Tiers

Tier 1: Fundamentals

5 benchmarks: Bug fixing, feature implementation, file manipulation. Race conditions, memory leaks, null references, LRU caching, input validation.

Tier 2: Fullstack

3 benchmarks: End-to-end systems: REST API with JWT auth, event-driven pub/sub with dead letter queues, typed state machines with guards.

Tier 3: Agentic

7 benchmarks: Config parsing, ETL pipelines, reverse engineering, debugging.

Tier 4: UI/CSS

8 benchmarks: Neumorphic, glassmorphism, 3D, animations, responsive layouts.

Tier 5: Long-Horizon

9 benchmarks: Three.js rendering, React Native, Next.js server components, creative generation (music, generative art).

Tier 6: Battle Test

15 benchmarks: The headline tier. Fifteen professional-domain tasks graded end-to-end.


Battle Test Results

The headline tier. Fifteen tasks across professional domains.

IDDomainTaskScoreStatus
BT001Software EngineeringSaaS Auth: JWT, Roles, Rate Limiting94PASS
BT002Software EngineeringEvent Architecture: Pub/Sub, DLQ, Retry92PASS
BT003Data EngineeringData Pipeline: Stream Processing, Validation100PERFECT
BT004Developer ToolsCLI Framework: Parser, Help, Flags, Subcommands53Improving
BT005Software EngineeringState Machine: Typed Transitions, Guards92PASS
BT006Financial ConsultingFinancial Dashboard: ROI, NPV, IRR, EBITDA54Improving
BT007Digital MarketingSEO Audit Engine: Meta, Scoring, Core Web Vitals96PASS
BT008Marketing AutomationEmail Campaign: Templates, A/B Testing, Analytics54Improving
BT009DevOpsCI/CD Pipeline: DSL, Dependency Graph, YAML33Improving
BT010Design SystemsDesign Tokens: Multi-Format Export, Scales39Improving
BT011Video ProductionVideo Planner: Scene Graph, Timeline, FFmpeg100PERFECT
BT012Music TechnologyMusic Theory: Notes, Chords, Scales, Progressions81PASS
BT013Data VisualizationCharts, Scales, Layouts in SVG/ASCII30Improving
BT014AI ConsultingReport Generator: Assessment, Roadmap95PASS
BT015CybersecuritySecurity Audit: Scanner, Vuln DB, Reports30Improving

Grading Methodology

8gent uses AI-based judging via the Vercel AI SDK. Responses are evaluated semantically, not with regex or .includes() checks. This handles ambiguity, synonyms, and edge cases correctly.

ComponentWeightMethod
Execution70%Code is compiled and run against test assertions. Score = passed / total
Keywords30%Domain-specific pattern checks (JWT, topological sort, NPV, etc.)

Each benchmark runs at three temperatures: 0.3, 0.5, and 0.7. The best result is kept. This accounts for the inherent variance in local LLM inference.

All benchmarks are scored on a 100-point scale.

CriteriaWeightDescription
Correctness40%Does the code solve the problem?
Code Quality25%Clean, readable, idiomatic code
Efficiency20%Algorithmic and resource efficiency
Best Practices15%Error handling, edge cases, patterns

Running Benchmarks

# Single pass (all benchmarks)
bun run benchmark:v2

# Autoresearch loop (iterative improvement)
CATEGORY=battle-test MAX_ITERATIONS=5 bun run benchmark:loop

# Overnight continuous runner (all categories)
bash benchmarks/autoresearch/overnight-runner.sh

Results are logged to benchmarks/results.tsv in TSV format:

iteration  benchmark_id  claude_baseline  8gent_score  gap  status  action
1          BF001         95               100          -5   improved  none
1          BF002         92               50           42   regressed pattern added

Adding New Benchmarks

Create a fixture

Add your test fixture in benchmarks/fixtures/<category>/ with the problem statement and expected behavior.

Define the benchmark

Add the benchmark definition to benchmarks/categories/<category>/benchmarks.ts with grading criteria matching the rubric.

Establish baselines

Run the harness to generate initial scores and confirm the benchmark is correctly graded.


Checkpoint Validation

Regression Gate

The benchmark suite also serves as the regression gate for kernel fine-tuning. When the @8gent/kernel package trains a new LoRA checkpoint, it validates against these benchmarks before promotion.

bun run benchmarks/autoresearch/validate-checkpoint.ts

See Kernel Fine-Tuning for the full pipeline.