TerrariaBench Specification

Purpose

TerrariaBench is a small, high-signal benchmark for evaluating agentic game-control models in a live Terraria/tModLoader environment. It is inspired by KernelBench-Hard's harness philosophy: small curated tasks, native model harnesses, archived transcripts, disposable workspaces, and objective post-run scoring.

Non-goals

It is not a generic OpenRouter vision benchmark.
It is not a leaderboard based on model self-reported success.
It is not intended to preserve every generated run artifact in git.
It does not require the agent to edit Terraria source code.

Official Evaluation Pattern

Each run is a tuple:


(harness, model, task)

The native harness receives one prompt from `problems/<task>/PROMPT.txt`. The prompt tells the agent how to inspect a live game slot through `tools/terraria_slot.py`, act through real window controls, and verify progress through `checkpoints.jsonl`.

Official scoring uses only the checkpoint log emitted by the tModLoader mod.

Active Harness Matrix

The intended matrix keeps KernelBench-Hard's native-harness style, but is filtered to models with native screenshot/image input support in the selected harness route:

`claude claude-opus-4-7 max`
`codex gpt-5.5 xhigh`
`opencode openrouter-moonshot/moonshotai/kimi-k2.6`
`opencode openrouter/google/gemini-3.1-pro-preview`
`opencode openrouter/google/gemini-3.1-flash-lite-preview`
`opencode openrouter/google/gemini-3-flash-preview`
`opencode openrouter-pinned/qwen/qwen3.6-27b`

The local OpenCode registry reports no native screenshot/image input for the selected DeepSeek V4 Pro/Flash, Qwen 3.6 Max/Plus, GLM 5.1, and MiniMax M2.7 routes, so they are excluded from official vision-required sweeps. Qwen 3.6 27B is included through the pinned OpenRouter route. Exact Gemini 3.1 Flash is not available through the local OpenCode provider registry; the active sweep intentionally uses the available vision-capable Flash Lite and Gemini 3 Flash Preview lanes instead.

Keep this matrix in `scripts/sweep.sh`.

Task Interface

Each task directory contains:

`PROMPT.txt`: one human-style prompt sent directly to the native agent.
`task.json`: machine-readable task metadata.

The tModLoader mod recognizes:

`{"command":"start_task","task":"inventory"}`
`{"command":"start_task","task":"break_block"}`
`{"command":"start_task","task":"open_ended"}`
`{"command":"key","key":"esc"}`
`{"command":"click","x":320,"y":360}`

Coordinates are screenshot-relative. For official `open_ended` runs, pointer actions must be real game-window input. The mod ignores direct `click` and `mouse_down` commands written to `control.jsonl` during `open_ended`, because direct tile mutation is not a valid computer-use action.

Scoring

Tiny smoke tasks pass if `checkpoints.jsonl` contains:


{"event":"pass","task":"<task>"}

Official sweeps should use `problems/03_open_ended_progress`. That task has no single pass condition. It records concrete progression checkpoints, then `tools/score_run.py` assigns the official score. Future runs do not emit reward-hackable survival timers, distance/descent counters, or generic block-count milestones. Historical survival, damage, death, held tools, inventory opening, travel, descent, layer entry, `blocks_removed_*`, `tree_tile*`, and `hardmode_unlocked` events are retained only as non-scoring diagnostics when rescoring old runs. Official scored milestones are explicit Terraria progression facts: wood/resource collection, crafted stations, biome discovery, life/mana upgrades, ores/bars, boss prep, pre-hardmode boss defeats, underworld prep, Wall of Flesh summon, and Wall of Flesh defeat. `wood_*` is the authoritative scored signal for early tree harvesting until the tile-break hook is independently validated. Wall of Flesh defeat is the final scored checkpoint; hardmode is not scored separately. Runs that contain historical direct tile-control evidence such as `click_kill_tile` are flagged `direct_tile_control_used` and are not valid official scores.

`scripts/run_terraria.sh` archives:

native harness transcript
stderr
tModLoader setup summary
checkpoint log
normalized screenshot frames in `frames/frame_00001.png`, `frames/frame_00002.png`, ...
static visual report in `index.html`
result.json
agent workspace

Run Artifact Policy

Generated artifacts live under:

`runs/` for direct `orchestrator.py` smoke runs
`outputs/runs/` for official native harness runs

Both are ignored by git.