TerrariaBench / spec

Benchmark Spec

The rendered project specification. The raw markdown copy is also available in the docs folder.

Browse runs Raw SPEC.md Raw artifacts

TerrariaBench Specification

Purpose

TerrariaBench is a small, high-signal benchmark for evaluating agentic game-control models in a live Terraria/tModLoader environment. It is inspired by KernelBench-Hard's harness philosophy: small curated tasks, native model harnesses, archived transcripts, disposable workspaces, and objective post-run scoring.

Non-goals

Official Evaluation Pattern

Each run is a tuple:


(harness, model, task)

The native harness receives one prompt from `problems/<task>/PROMPT.txt`. The prompt tells the agent how to inspect a live game slot through `tools/terraria_slot.py`, act through real window controls, and verify progress through `checkpoints.jsonl`.

Official scoring uses only the checkpoint log emitted by the tModLoader mod.

Active Harness Matrix

The intended matrix keeps KernelBench-Hard's native-harness style, but is filtered to models with native screenshot/image input support in the selected harness route:

The local OpenCode registry reports no native screenshot/image input for the selected DeepSeek V4 Pro/Flash, Qwen 3.6 Max/Plus, GLM 5.1, and MiniMax M2.7 routes, so they are excluded from official vision-required sweeps. Qwen 3.6 27B is included through the pinned OpenRouter route. Exact Gemini 3.1 Flash is not available through the local OpenCode provider registry; the active sweep intentionally uses the available vision-capable Flash Lite and Gemini 3 Flash Preview lanes instead.

Keep this matrix in `scripts/sweep.sh`.

Task Interface

Each task directory contains:

The tModLoader mod recognizes:

Coordinates are screenshot-relative. For official `open_ended` runs, pointer actions must be real game-window input. The mod ignores direct `click` and `mouse_down` commands written to `control.jsonl` during `open_ended`, because direct tile mutation is not a valid computer-use action.

Scoring

Tiny smoke tasks pass if `checkpoints.jsonl` contains:


{"event":"pass","task":"<task>"}

Official sweeps should use `problems/03_open_ended_progress`. That task has no single pass condition. It records concrete progression checkpoints, then `tools/score_run.py` assigns the official score. Future runs do not emit reward-hackable survival timers, distance/descent counters, or generic block-count milestones. Historical survival, damage, death, held tools, inventory opening, travel, descent, layer entry, `blocks_removed_*`, `tree_tile*`, and `hardmode_unlocked` events are retained only as non-scoring diagnostics when rescoring old runs. Official scored milestones are explicit Terraria progression facts: wood/resource collection, crafted stations, biome discovery, life/mana upgrades, ores/bars, boss prep, pre-hardmode boss defeats, underworld prep, Wall of Flesh summon, and Wall of Flesh defeat. `wood_*` is the authoritative scored signal for early tree harvesting until the tile-break hook is independently validated. Wall of Flesh defeat is the final scored checkpoint; hardmode is not scored separately. Runs that contain historical direct tile-control evidence such as `click_kill_tile` are flagged `direct_tile_control_used` and are not valid official scores.

`scripts/run_terraria.sh` archives:

Run Artifact Policy

Generated artifacts live under:

Both are ignored by git.