Mutual NPC Interaction Probe Review
Date: 2026-05-19
Repo: minecraft-llm-agent-community
Bottom line
mutual_npc_interaction_probe_v1 now runs against a real local Docker-backed
Minecraft server and finishes with all three interaction categories marked
passed.
The successful live artifact is:
data/evidence/mutual_npc_interaction_probe_v1-1779166592708.json
A preserved copy is stored at:
docs/superpowers/reports/artifacts/2026-05-19-mutual-npc-interaction-probe-transcript.json
What this probe adds beyond v0
The first probe proved that one bot could run a bounded
observe -> move_to -> say -> wait -> say -> remember loop while the runtime
owned the busy -> available transition.
This second probe keeps the provider deterministic, but expands the runtime in three concrete ways:
npc_bnow acts instead of only existing as gated state.- movement and attention are part of the proof, not just chat delivery.
- one small world action, dropping a
papermarker, changes a later NPC response.
Final verdict from the live transcript
From data/evidence/mutual_npc_interaction_probe_v1-1779166592708.json:
conversationTurnState: passedspatialAttentionApproach: passedmaterialEnvironmentHandoff: passed- final status:
success - final reason:
both NPCs responded to each other's dialogue and world actions
What actually happened
1. Conversation and turn state
npc_a approached npc_b and said:
"Jun, can you confirm the marker?"
npc_b heard that line in its observation state and answered with a
runtime-owned busy reply before becoming available later in the sequence.
Relevant transcript evidence:
- step 3:
npc_ausedsay - step 4:
npc_busedreply_towithresult: "busy_reply"
2. Spatial attention and approach
The movement part is no longer hand-wavy. In the successful run:
- step 2:
npc_ausedmove_to - distance changed from
12.73to0.97 arrived: true- step 6:
npc_busedlook_at_actor
That is enough to show that the proof is reacting to distance and facing, not only chat text.
3. Material handoff
The first live attempt on this slice still failed the material category. The important part is why.
The dropped paper entity did exist, but the runtime observed too early and also looked at the wrong fields. Mineflayer exposed the dropped entity as:
name: "item"displayName: "Item"- stack data in entity metadata, not in a paper-specific display name
The fix had two parts:
observeWorld()now checks item metadata, not onlynameanddisplayName.dropItem()now waits until the dropped item entity is visible before it returns, so the next observation does not race the server update.
After that change, the successful transcript shows:
- step 7:
npc_auseddrop_item - step 9:
npc_busedreply_to - step 9 observation includes
markerEntitySeen: true
That is the key evidence that the world action changed the later response.
Notable implementation detail
The proof still uses deterministic providers. This is intentional. The runtime, tool validation, movement, transcript shape, and item timing had to be made reliable before introducing a live model provider.
So the correct reading is:
- the interaction is real
- the server run is real
- the bounded tool loop is real
- the provider is still staged and deterministic
Known runtime caveat
The successful run printed the transcript path and exited 0, but Docker
cleanup still produced the known bounded-timeout warning during
docker compose down -v.
That warning does not invalidate the transcript. The runtime keeps the success
result and transcript path even if teardown is noisy later, which matches the
behavior already used for v0.
Files that matter most for review
probe/src/mutual/runMutualProbe.tsprobe/src/mutual/tools/index.tsprobe/src/mutual/tools/observeWorld.tsprobe/src/mutual/tools/dropItem.tsprobe/src/mutual/mutualLoop.tsdata/evidence/mutual_npc_interaction_probe_v1-1779166592708.json
Next questions
The next useful questions are narrower now:
- Should
npc_bget a richer memory/action loop, or is the next priority still movement quality? - Should the transcript record the actual reply text in addition to the tool result status?
- Should the Docker management timeout be raised so cleanup warnings are less common in OrbStack?