# TerrariaBench Specification ## Purpose TerrariaBench is a small, high-signal benchmark for evaluating agentic game-control models in a live Terraria/tModLoader environment. It is inspired by KernelBench-Hard's harness philosophy: small curated tasks, native model harnesses, archived transcripts, disposable workspaces, and objective post-run scoring. ## Non-goals - It is not a generic OpenRouter vision benchmark. - It is not a leaderboard based on model self-reported success. - It is not intended to preserve every generated run artifact in git. - It does not require the agent to edit Terraria source code. ## Official Evaluation Pattern Each run is a tuple: ```text (harness, model, task) ``` The native harness receives one prompt from `problems//PROMPT.txt`. The prompt tells the agent how to inspect a live game slot through `tools/terraria_slot.py`, act through `control.jsonl` or direct window controls, and verify progress through `checkpoints.jsonl`. Official scoring uses only the checkpoint log emitted by the tModLoader mod. ## Active Harness Matrix The intended matrix keeps KernelBench-Hard's native-harness style, but is filtered to models with native screenshot/image input support in the selected harness route: - `claude claude-opus-4-7 max` - `codex gpt-5.5 xhigh` - `kimi kimi-k2.6` - `opencode openrouter/google/gemini-3.1-pro-preview` - `opencode openrouter/google/gemini-3.1-flash-lite-preview` - `opencode openrouter/google/gemini-3-flash-preview` The local OpenCode registry reports `input.image=false` for the selected DeepSeek V4 Pro/Flash, Qwen 3.6 Max/Plus/27B, GLM 5.1, and MiniMax M2.7 routes, so they are excluded from official vision-required sweeps. Exact Gemini 3.1 Flash is not available through the local OpenCode provider registry; the active sweep intentionally uses the available vision-capable Flash Lite and Gemini 3 Flash Preview lanes instead. Keep this matrix in `scripts/sweep.sh`. ## Task Interface Each task directory contains: - `PROMPT.txt`: one human-style prompt sent directly to the native agent. - `task.json`: machine-readable task metadata. The tModLoader mod recognizes: - `{"command":"start_task","task":"inventory"}` - `{"command":"start_task","task":"break_block"}` - `{"command":"start_task","task":"open_ended"}` - `{"command":"key","key":"esc"}` - `{"command":"click","x":320,"y":360}` Coordinates are screenshot-relative. ## Scoring Tiny smoke tasks pass if `checkpoints.jsonl` contains: ```json {"event":"pass","task":""} ``` Official sweeps should use `problems/03_open_ended_progress`. That task has no single pass condition. It records `checkpoint` events for open-ended early-game progress such as opening inventory, moving away from spawn, surviving time thresholds, taking damage, dying, and breaking blocks. The score is the checkpoint sequence and count, joined later with harness token/timing metadata. `scripts/run_terraria.sh` archives: - native harness transcript - stderr - tModLoader setup summary - checkpoint log - normalized screenshot frames in `frames/frame_00001.png`, `frames/frame_00002.png`, ... - static visual report in `index.html` - result.json - agent workspace ## Run Artifact Policy Generated artifacts live under: - `runs/` for direct `orchestrator.py` smoke runs - `outputs/runs/` for official native harness runs Both are ignored by git.