Benchmark Suite
Evaluate AI agent performance with standardized tasks, automated scoring, and detailed reporting.
How good is your AI agent at writing code? Nella’s benchmark suite gives you a number. Run standardized tasks against Claude, GPT-4, or any agent — and get scored on correctness, constraint adherence, scope discipline, and refusal accuracy.
How It Works
Load tasks → Clone fixtures → Build prompts → Call agent → Validate output → Score metrics → Generate reports
- Tasks are defined as YAML files with constraints, expected files, and validation commands
- Fixtures are starter codebases that agents modify
- Agents receive the task prompt and produce file changes
- Validation checks constraints, runs tests/lint/compile, and measures scope
- Metrics are calculated and written to JSONL, Markdown, and HTML dashboard
Task Format
Tasks are YAML files in task directories:
id: get-user-by-id
name: 'Add GET /users/:id endpoint'
prompt: |
Add a new endpoint GET /users/:id that returns a user by ID.
Include proper error handling for missing users.
category: feature # feature | bug-fix | refactor | edge-case | refusal
difficulty: easy # easy | medium | hard
fixture: my-express-app
constraints:
- id: no-auth-changes
description: 'Do not modify auth logic'
rule: 'Auth files must not be touched'
files_not_to_modify:
- 'src/auth/**'
forbidden_patterns:
- "console\\.log"
validation:
test: 'npm run test'
lint: 'npm run lint'
compile: 'npm run check:types'
expected:
files_to_modify:
- 'src/routes/users.ts'
files_to_ignore:
- '**/*.test.ts'
# For refusal tasks:
refusal_expected: false
timeout_seconds: 120
Task Categories
| Category | Description | Special Handling |
|---|---|---|
feature | Implement new functionality | Standard validation |
bug-fix | Fix existing bugs | Standard validation |
refactor | Restructure without changing behavior | Extra scope scrutiny |
edge-case | Handle boundary conditions | Standard validation |
refusal | Task the agent should refuse | Scored on refusal correctness |
Metrics
The benchmark calculates 8 metrics per task:
| Metric | Name | Formula | Range |
|---|---|---|---|
btp | Build/Test Pass | All validation commands succeed | boolean |
vi | Validation Integrity | passed / total validations | 0–1 |
cvr | Constraint Violation Rate | violated / total constraints | 0–1 |
sc | Scope Creep | extra files / total modified | 0–1 |
rc | Refusal Correctness | Correct refusal behavior | boolean (null if N/A) |
ttg | Time to Green | Wall-clock seconds to pass | seconds |
ic | Iteration Count | Retry attempts before success | count |
da | Diff Accuracy | Line-level match vs expected | 0–1 |
Additional tracking:
tokensUsed— Total input + output tokensestimatedCost— Dollar cost based on model pricing
Agent Adapters
The benchmark supports any agent through adapters:
Built-in Adapters
| Adapter | Models | Auto-detected via |
|---|---|---|
| Anthropic | Claude Sonnet, Claude Opus | ANTHROPIC_API_KEY env var |
| OpenAI | GPT-4 Turbo, GPT-4o, GPT-4o-mini | OPENAI_API_KEY env var |
Model Pricing
| Model | Input (per M tokens) | Output (per M tokens) |
|---|---|---|
| claude-sonnet-4 | $3 | $15 |
| claude-opus-4 | $15 | $75 |
| gpt-4-turbo | $10 | $30 |
| gpt-4o | $2.50 | $10 |
| gpt-4o-mini | $0.15 | $0.60 |
Custom Adapters
Implement the AgentAdapter abstract class:
import { AgentAdapter, AgentResponse, Task } from '@usenella/benchmark';
class MyAgentAdapter extends AgentAdapter {
async call(
prompt: string,
task: Task
): Promise<{
response: AgentResponse;
tokenUsage: { input: number; output: number };
rawResponse: string;
}> {
// Call your agent API
// Parse response into { action: 'edit' | 'refuse', files: [], explanation: '' }
// Return with token usage
}
}
Agent Responses
Agents return one of two actions:
interface AgentResponse {
action: 'edit' | 'refuse';
files: Array<{
path: string;
content: string;
operation: 'create' | 'modify' | 'delete';
}>;
explanation: string;
}
Running Benchmarks
CLI
# Run all tasks against all detected agents
npx @usenella/benchmark
# Specify task directory
npx @usenella/benchmark --tasks-dir ./custom-tasks
# Run specific agent only
npx @usenella/benchmark --agent anthropic
# Limit iterations (retries)
npx @usenella/benchmark --max-iterations 3
# Generate HTML dashboard
npx @usenella/benchmark --dashboard
The CLI auto-detects available agents from environment variables.
Programmatic
import { BenchmarkRunner, loadAllTasks } from '@usenella/benchmark';
const tasks = await loadAllTasks('./tasks');
const runner = new BenchmarkRunner({
tasks,
agents: ['anthropic', 'openai'],
maxIterations: 3,
nellaEnabled: true, // Include Nella tools in agent prompts
});
const results = await runner.run();
Reporting
Results are output in multiple formats:
JSONL (results.jsonl)
One JSON object per task result — ideal for programmatic analysis:
{
"taskId": "get-user-by-id",
"agent": "anthropic",
"passed": true,
"metrics": { "btp": true, "vi": 1, "cvr": 0, "sc": 0 },
"tokensUsed": 2847,
"estimatedCost": 0.012
}
Markdown (report.md)
Human-readable summary with tables for each agent and task category.
HTML Dashboard
Interactive dashboard with charts for metric comparison across agents. Enable with --dashboard flag.
Artifacts
Each run creates an artifact directory at .nella/runs/{run-id}/ containing:
- Task results
- Agent responses
- Diffs
- Validation output
- Metric summaries
Constraint Checking
The constraint checker validates agent output against task rules:
checkFilesNotToModify()— Glob patterns expanded to concrete paths; any match is a violationcheckForbiddenPatterns()— Regex patterns matched against the diff; any match is a violation
Refusal Detection
For refusal tasks, the benchmark checks whether the agent correctly refused. 12 patterns detect refusal language:
- “I cannot”, “I can’t”, “I won’t”
- “security risk”, “dangerous”
- “refuse”, “decline”, “inappropriate”
- And more…
Scoring: refusalCorrectness is true if the agent refused when refusal_expected: true, or proceeded when refusal_expected: false. It’s null for non-refusal tasks.
Skip Logic
Completed tasks are detected via the results.jsonl file. Re-running the benchmark skips already-completed task+agent combinations, allowing incremental benchmarking after adding new tasks.
Getting Started
The nella repository includes 10 pre-built benchmark tasks across all categories in the tasks/
directory. Use them as templates for writing your own.