TerrariaBench Specification
Purpose
TerrariaBench is a small, high-signal benchmark for evaluating agentic game-control models in a live Terraria/tModLoader environment. It is inspired by KernelBench-Hard's harness philosophy: small curated tasks, native model harnesses, archived transcripts, disposable workspaces, and objective post-run scoring.
Non-goals
- It is not a generic OpenRouter vision benchmark.
- It is not a leaderboard based on model self-reported success.
- It is not intended to preserve every generated run artifact in git.
- It does not require the agent to edit Terraria source code.
Official Evaluation Pattern
Each run is a tuple:
(harness, model, task)
The native harness receives one prompt from `problems/<task>/PROMPT.txt`. The prompt tells the agent how to inspect a live game slot through `tools/terraria_slot.py`, act through real window controls, and verify progress through `checkpoints.jsonl`.
Official scoring uses only the checkpoint log emitted by the tModLoader mod.
Active Harness Matrix
The intended matrix keeps KernelBench-Hard's native-harness style, but is filtered to models with native screenshot/image input support in the selected harness route:
- `claude claude-opus-4-7 max`
- `codex gpt-5.5 xhigh`
- `opencode openrouter-moonshot/moonshotai/kimi-k2.6`
- `opencode openrouter/google/gemini-3.1-pro-preview`
- `opencode openrouter/google/gemini-3.1-flash-lite-preview`
- `opencode openrouter/google/gemini-3-flash-preview`
- `opencode openrouter-pinned/qwen/qwen3.6-27b`
The local OpenCode registry reports no native screenshot/image input for the selected DeepSeek V4 Pro/Flash, Qwen 3.6 Max/Plus, GLM 5.1, and MiniMax M2.7 routes, so they are excluded from official vision-required sweeps. Qwen 3.6 27B is included through the pinned OpenRouter route. Exact Gemini 3.1 Flash is not available through the local OpenCode provider registry; the active sweep intentionally uses the available vision-capable Flash Lite and Gemini 3 Flash Preview lanes instead.
Keep this matrix in `scripts/sweep.sh`.
Task Interface
Each task directory contains:
- `PROMPT.txt`: one human-style prompt sent directly to the native agent.
- `task.json`: machine-readable task metadata.
The tModLoader mod recognizes:
- `{"command":"start_task","task":"inventory"}`
- `{"command":"start_task","task":"break_block"}`
- `{"command":"start_task","task":"open_ended"}`
- `{"command":"key","key":"esc"}`
- `{"command":"click","x":320,"y":360}`
Coordinates are screenshot-relative. For official `open_ended` runs, pointer actions must be real game-window input. The mod ignores direct `click` and `mouse_down` commands written to `control.jsonl` during `open_ended`, because direct tile mutation is not a valid computer-use action.
Scoring
Tiny smoke tasks pass if `checkpoints.jsonl` contains:
{"event":"pass","task":"<task>"}
Official sweeps should use `problems/03_open_ended_progress`. That task has no single pass condition. It records concrete progression checkpoints, then `tools/score_run.py` assigns the official score. Future runs do not emit reward-hackable survival timers, distance/descent counters, or generic block-count milestones. Historical survival, damage, death, held tools, inventory opening, travel, descent, layer entry, `blocks_removed_*`, `tree_tile*`, and `hardmode_unlocked` events are retained only as non-scoring diagnostics when rescoring old runs. Official scored milestones are explicit Terraria progression facts: wood/resource collection, crafted stations, biome discovery, life/mana upgrades, ores/bars, boss prep, pre-hardmode boss defeats, underworld prep, Wall of Flesh summon, and Wall of Flesh defeat. `wood_*` is the authoritative scored signal for early tree harvesting until the tile-break hook is independently validated. Wall of Flesh defeat is the final scored checkpoint; hardmode is not scored separately. Runs that contain historical direct tile-control evidence such as `click_kill_tile` are flagged `direct_tile_control_used` and are not valid official scores.
`scripts/run_terraria.sh` archives:
- native harness transcript
- stderr
- tModLoader setup summary
- checkpoint log
- normalized screenshot frames in `frames/frame_00001.png`, `frames/frame_00002.png`, ...
- static visual report in `index.html`
- result.json
- agent workspace
Run Artifact Policy
Generated artifacts live under:
- `runs/` for direct `orchestrator.py` smoke runs
- `outputs/runs/` for official native harness runs
Both are ignored by git.