mindcraft-ce Local Implementation Insights
Target: mindcraft-ce
mindcraft-ce is a massive benchmark and evaluation framework. Our /probe is a tiny, headless NPC dialogue loop. Therefore, we must aggressively filter out the benchmarking infrastructure, command surface, and code generation, keeping only the strict operational rules for tool execution and multi-agent conversation.
Core Takeaways
ActionManager: Timeouts & Interrupts
All long-running tasks are wrapped with timeouts and explicit interrupt checks (requestInterrupt(), bot.interrupt_code). Concurrent actions are blocked by an executing gate.
- For
/probe: Implement anActionRunnerthat strictly enforces one active tool at a time. Long tools must check anAbortControlleror shared flag. Tool results must always return a structured{ ok, status, message, durationMs }to the transcript.
Separation of World vs. Skills
The repo cleanly divides read-only world observation helpers (world.js) from mutation/action helpers (skills.js).
- For
/probe: Adopt this boundary.observeWorld(bot)returns pure state (position, nearby actors, inventory).tools/moveToActor,tools/say, etc., handle state changes. Avoid porting the massive 2k+ line skill files.
History Compaction
Old chat context is chunked, summarized, and moved to a ledger while keeping the active prompt small.
- For
/probe: We don't need LLM-driven summarization yet. Just keep the prompt context limited to the last N events, write the full uncompressed transcript to a file (data/evidence), and use a simpleremember(note)array for memory.
NPC & Multi-Agent Collaboration
This is the most valuable part. mindcraft-ce enforces single active conversations and manages response latency based on busy states.
- For
/probe:- Inject
busyUntilorcurrentActioninto the observation. - A
sayaction targeting a busy bot must fail with ablocked_busystatus. - Force the LLM to handle the
blockedresult by choosingwaitor a shorter retry. - Reject concurrent conversation requests.
- Tag incoming messages with
(FROM OTHER BOT)to prevent hallucination.
- Inject
Task System as Scenario Spec
Tasks define initial conditions, blocked actions, and completion criteria.
- For
/probe: Use this pattern to define dialogue scenarios (e.g., "npc_a wants X, npc_b starts busy"). Validation should be simple runtime checks (e.g., "3+ turns, both spoke"), not complex voxel blueprint matching.
What to Discard
- Generated Code (
coder.js): Absolutely no dynamic JS execution or SES compartments. LLM output must be strictly typed JSON matching our tool schema. - Large Action Surface: Ignore cooking, crafting, combat, and building.
- Cheat Orchestration: Do not intertwine
/give,/tp,/fillcommands with the main agent loop (use only for test fixtures). - Heavy Infrastructure: Ignore the tmux wrappers, embedding-based skill retrieval, and benchmarking workflows.
Action Items for /probe
- ActionRunner: Enforce single-execution limits and strict timeouts.
- Read/Write Separation: Keep
observeWorldread-only; use distinct tool handlers for mutations. - Event Ledger: Store full history on disk, but feed only the last 5-10 events to the prompt.
- Busy/Available Gates: Implement
busyUntilstate andblocked_busytool results for social interactions. - Scenario Specs: Define task success via simple runtime invariants, not external evaluators.
- No Code Gen: Restrict the LLM to returning
AgentProposalJSON.