# TerrariaBench Specification ## Purpose TerrariaBench is a small, high-signal benchmark for evaluating agentic game-control models in a live Terraria/tModLoader environment. It is inspired by KernelBench-Hard's harness philosophy: small curated tasks, native model harnesses, archived transcripts, disposable workspaces, and objective post-run scoring. ## Non-goals - It is not a generic OpenRouter vision benchmark. - It is not a leaderboard based on model self-reported success. - It is not intended to preserve every generated run artifact in git. - It does not require the agent to edit Terraria source code. ## Official Evaluation Pattern Each run is a tuple: ```text (harness, model, task) ``` The native harness receives one prompt from `problems//PROMPT.txt`. The prompt tells the agent how to inspect a live game slot through `tools/terraria_slot.py`, act through real window controls, and verify progress through `checkpoints.jsonl`. Official scoring uses only the checkpoint log emitted by the tModLoader mod. ## Active Harness Matrix The intended matrix keeps KernelBench-Hard's native-harness style, but is filtered to models with native screenshot/image input support in the selected harness route: - `claude claude-opus-4-7 max` - `codex gpt-5.5 xhigh` - `opencode openrouter-moonshot/moonshotai/kimi-k2.6` - `opencode openrouter/google/gemini-3.1-pro-preview` - `opencode openrouter/google/gemini-3.1-flash-lite-preview` - `opencode openrouter/google/gemini-3-flash-preview` - `opencode openrouter-pinned/qwen/qwen3.6-27b` The local OpenCode registry reports no native screenshot/image input for the selected DeepSeek V4 Pro/Flash, Qwen 3.6 Max/Plus, GLM 5.1, and MiniMax M2.7 routes, so they are excluded from official vision-required sweeps. Qwen 3.6 27B is included through the pinned OpenRouter route. Exact Gemini 3.1 Flash is not available through the local OpenCode provider registry; the active sweep intentionally uses the available vision-capable Flash Lite and Gemini 3 Flash Preview lanes instead. Keep this matrix in `scripts/sweep.sh`. ## Task Interface Each task directory contains: - `PROMPT.txt`: one human-style prompt sent directly to the native agent. - `task.json`: machine-readable task metadata. The tModLoader mod recognizes: - `{"command":"start_task","task":"inventory"}` - `{"command":"start_task","task":"break_block"}` - `{"command":"start_task","task":"open_ended"}` - `{"command":"key","key":"esc"}` - `{"command":"click","x":320,"y":360}` Coordinates are screenshot-relative. For official `open_ended` runs, pointer actions must be real game-window input. The mod ignores direct `click` and `mouse_down` commands written to `control.jsonl` during `open_ended`, because direct tile mutation is not a valid computer-use action. ## Scoring Tiny smoke tasks pass if `checkpoints.jsonl` contains: ```json {"event":"pass","task":""} ``` Official sweeps should use `problems/03_open_ended_progress`. That task has no single pass condition. It records concrete progression checkpoints, then `tools/score_run.py` assigns the official score. Future runs do not emit reward-hackable survival timers, distance/descent counters, or generic block-count milestones. Historical survival, damage, death, held tools, inventory opening, travel, descent, layer entry, `blocks_removed_*`, `tree_tile*`, and `hardmode_unlocked` events are retained only as non-scoring diagnostics when rescoring old runs. Official scored milestones are explicit Terraria progression facts: wood/resource collection, crafted stations, biome discovery, life/mana upgrades, ores/bars, boss prep, pre-hardmode boss defeats, underworld prep, Wall of Flesh summon, and Wall of Flesh defeat. `wood_*` is the authoritative scored signal for early tree harvesting until the tile-break hook is independently validated. Wall of Flesh defeat is the final scored checkpoint; hardmode is not scored separately. Runs that contain historical direct tile-control evidence such as `click_kill_tile` are flagged `direct_tile_control_used` and are not valid official scores. `scripts/run_terraria.sh` archives: - native harness transcript - stderr - tModLoader setup summary - checkpoint log - normalized screenshot frames in `frames/frame_00001.png`, `frames/frame_00002.png`, ... - static visual report in `index.html` - result.json - agent workspace ## Run Artifact Policy Generated artifacts live under: - `runs/` for direct `orchestrator.py` smoke runs - `outputs/runs/` for official native harness runs Both are ignored by git.