8gent includes a comprehensive benchmark suite with 39+ tasks across 6 tiers spanning 15 professional domains. Every benchmark is execution-graded: code runs against bun:test suites or it fails.

All Local Inference

Every score below is from local inference via Ollama at zero cost. No cloud APIs, no paid models.

The Six Tiers

Tier 1: Fundamentals

5 benchmarks: Bug fixing, feature implementation, file manipulation. Race conditions, memory leaks, null references, LRU caching, input validation.

Tier 2: Fullstack

3 benchmarks: End-to-end systems: REST API with JWT auth, event-driven pub/sub with dead letter queues, typed state machines with guards.

Tier 3: Agentic

7 benchmarks: Config parsing, ETL pipelines, reverse engineering, debugging.

Tier 4: UI/CSS

8 benchmarks: Neumorphic, glassmorphism, 3D, animations, responsive layouts.

Tier 5: Long-Horizon

9 benchmarks: Three.js rendering, React Native, Next.js server components, creative generation (music, generative art).

Tier 6: Battle Test

15 benchmarks: The headline tier. Fifteen professional-domain tasks graded end-to-end.

Battle Test Results

The headline tier. Fifteen tasks across professional domains.

ID	Domain	Task	Score	Status
BT001	Software Engineering	SaaS Auth: JWT, Roles, Rate Limiting	94	PASS
BT002	Software Engineering	Event Architecture: Pub/Sub, DLQ, Retry	92	PASS
BT003	Data Engineering	Data Pipeline: Stream Processing, Validation	100	PERFECT
BT004	Developer Tools	CLI Framework: Parser, Help, Flags, Subcommands	53	Improving
BT005	Software Engineering	State Machine: Typed Transitions, Guards	92	PASS
BT006	Financial Consulting	Financial Dashboard: ROI, NPV, IRR, EBITDA	54	Improving
BT007	Digital Marketing	SEO Audit Engine: Meta, Scoring, Core Web Vitals	96	PASS
BT008	Marketing Automation	Email Campaign: Templates, A/B Testing, Analytics	54	Improving
BT009	DevOps	CI/CD Pipeline: DSL, Dependency Graph, YAML	33	Improving
BT010	Design Systems	Design Tokens: Multi-Format Export, Scales	39	Improving
BT011	Video Production	Video Planner: Scene Graph, Timeline, FFmpeg	100	PERFECT
BT012	Music Technology	Music Theory: Notes, Chords, Scales, Progressions	81	PASS
BT013	Data Visualization	Charts, Scales, Layouts in SVG/ASCII	30	Improving
BT014	AI Consulting	Report Generator: Assessment, Roadmap	95	PASS
BT015	Cybersecurity	Security Audit: Scanner, Vuln DB, Reports	30	Improving

Grading Methodology

8gent uses AI-based judging via the Vercel AI SDK. Responses are evaluated semantically, not with regex or .includes() checks. This handles ambiguity, synonyms, and edge cases correctly.

Component	Weight	Method
Execution	70%	Code is compiled and run against test assertions. Score = passed / total
Keywords	30%	Domain-specific pattern checks (JWT, topological sort, NPV, etc.)

Each benchmark runs at three temperatures: 0.3, 0.5, and 0.7. The best result is kept. This accounts for the inherent variance in local LLM inference.

All benchmarks are scored on a 100-point scale.

Criteria	Weight	Description
Correctness	40%	Does the code solve the problem?
Code Quality	25%	Clean, readable, idiomatic code
Efficiency	20%	Algorithmic and resource efficiency
Best Practices	15%	Error handling, edge cases, patterns

Running Benchmarks

# Single pass (all benchmarks)
bun run benchmark:v2

# Autoresearch loop (iterative improvement)
CATEGORY=battle-test MAX_ITERATIONS=5 bun run benchmark:loop

# Overnight continuous runner (all categories)
bash benchmarks/autoresearch/overnight-runner.sh

Results are logged to benchmarks/results.tsv in TSV format:

iteration  benchmark_id  claude_baseline  8gent_score  gap  status  action
1          BF001         95               100          -5   improved  none
1          BF002         92               50           42   regressed pattern added

bun run benchmarks/autoresearch/validate-checkpoint.ts

See Kernel Fine-Tuning for the full pipeline.

The Six Tiers

Tier 1: Fundamentals

Tier 2: Fullstack

Tier 3: Agentic

Tier 4: UI/CSS

Tier 5: Long-Horizon

Tier 6: Battle Test

Battle Test Results

Grading Methodology

Running Benchmarks

Adding New Benchmarks

Create a fixture

Define the benchmark

Establish baselines

Checkpoint Validation

On this page