The evidence loop that turns retrieval, reads, tests, and runtime signals into next actions.
This page is generated from
packages/workspace/decision.md. Edit the source Markdown, then run bun run --cwd packages/consuelo-docs generate-os-source-docs to refresh the public docs.What this file controls
| Field | Value |
|---|---|
| Source file | packages/workspace/decision.md |
| Runtime role | Decision-engine doctrine for explore, evidence, confidence, and confirmation. |
| Controls | How agents choose what to inspect next and how they decide whether a path is true. |
| Generated route | /os/agent-context/decision |
Source document
decision process
mandatory workspace app transport
you are working inside the workspace mcp app. the app exposes exactly two tools:workspace.get_steering()workspace.call(\{ tool, input, taskSession, timeout \})
workspace.call inputs. do not construct nested shell strings or JSON-in-shell command strings for normal workspace operations.
get_steering and call. if a call fails, inspect the returned envelope and fix the typed input or implementation.
alignment is the number one thing this system protects.
if an agent does not know what to inspect next, it should use the decision process.
if an agent does not know whether a path is right, it should use the decision process.
if the evidence conflicts, it should update its belief instead of pretending the first answer was correct.
this file is the why, judgment, and operating doctrine for the workspace explore, evidence, confidence, and confirmation system.
procedural command details belong in packages/workspace/SCRIPTS.md.
coding standards belong in AGENTS.md and CODING-STANDARDS.md.
task-specific evidence belongs in .task/evidence-log.json, .task/explore-state.json, and the task workpad.
handoffs belong in memory or tmp/context files.
do not turn this file into a script reference.
this file should teach agents how to think with the system, what the signals mean, when to trust them, and when to stop.
worker delegation naming decision
Use neutral workspace tool names to reduce safety-filter collisions and keep provider names abbreviated.worker.call is the provider-agnostic delegation contract; cdx, pi, and opc are provider codes. mini is a legacy/profile name normalized to pi with profile: mini. Codex App Server, cloud sessions, MCP integration, and A2A integration are intentionally deferred.
1. what this system is
the decision process is not “better search.” it is a repo-aware decision loop for coding agents. it answers two questions:- what should i do next?
- how do i know this path is right?
- git-aware repository indexing
- tree-sitter structural chunking
- local qwen embeddings
- graph expansion across imports, tests, callers, and siblings
- evidence events
- belief updates
- next-action policy
- confirmation through tests, verify, and runtime truth
2. mental model
think of the system as a markov-style decision process over agent work. current state:- current explore results
- files already read
- tests inspected or run
- verify results
- runtime checks
- edits made
- contradictions found
- current hypotheses
- belief scores for likely paths
- explore a question
- read a file
- inspect a test
- mark a file relevant or irrelevant
- run verify
- run a targeted test
- check runtime logs
- exploit the best path
- confirm the fix
- every action changes the state
- every useful observation becomes an evidence event
- every evidence event updates beliefs
decideNextchooses the highest-value next action from the current state- good policy balances relevance, uncertainty, graph reach, and confirmation value
confirmcloses the loop- tests, verify, runtime logs, and production behavior matter more than retrieval scores
3. retrieval is a prior
retrieval is the prior distribution over where to look. it is not a conclusion. it is not proof. it is not enough to raise confidence by itself. high semantic similarity means:- use
exploreto find candidate files - use
decideNextto pick the next action - read or test the recommended target
- let the evidence event update beliefs
- repeat until the system has enough evidence to exploit
- confirm with runtime or validation truth
4. evidence is the source of confidence
confidence comes from observations, not from search. evidence includes:- a file was read
- a file was marked relevant
- a file was marked irrelevant
- a test passed
- a test failed
- verify passed
- verify failed
- runtime logs were clean
- runtime logs showed an error
- an edit was made
- a contradiction was found
- a hypothesis was confirmed
- a hypothesis was weakened
5. command family
the system has six public commands. each command should read or write decision state.explore
useexplore when the agent needs to know what files or paths are likely relevant.
- ensures the local index exists
- embeds the query with qwen
- searches structural chunks
- expands through graph connections
- ranks results
- writes explore state
- writes
explore.resultevidence
- explore is not confirmation
- graph connections matter
- structural reasons matter
- results should include implementation files, tests, and connected context
decideNext
usedecideNext when the agent has evidence and needs the next best action.
- reads explore state
- reads evidence events
- updates beliefs
- estimates information value
- recommends one action
- read the highest-value unread file
- inspect a connected test
- mark a file relevant or irrelevant
- exploit a concentrated belief
- run confirm after validation evidence exists
confidenceScore
useconfidenceScore when the agent needs to know how justified the current path is.
- reads evidence
- updates beliefs
- separates evidence for, evidence against, and uncertainty
- reports a score
- qwen candidate count is context, not evidence for correctness
- high top posterior is useful, but not final proof
- contradictory evidence should lower confidence
- confidence should start low on a cold task
exploit
useexploit when the system has enough evidence to stop exploring and commit to a path.
- selects the current strongest target
- records exploitation state
- names the target file and related context files
- narrows the editing surface
- exploit is a transition from investigation into editing
- exploit should not happen just because one result scored high
- exploit should happen when evidence is concentrated enough to act
confirm
useconfirm when the agent needs validation truth.
- piggybacks on
packages/workspace/scripts/verify.js - parses verify/test/runtime output
- writes confirmation evidence
- updates the explore state
- gives a verdict
- confirm is where belief meets reality
- parse failures are failures
- a passing command that cannot be interpreted is not positive evidence
- runtime checks matter for production behavior
audit
useaudit when the agent needs to know whether the script and doc surface is truthful.
- compares documented scripts against package scripts
- detects undocumented or missing commands
- can check docs and index freshness
- undocumented scripts are drift
- missing docs are part of the bug
- script behavior and script docs should change together
6. default workflow
use this flow when starting from a question or bug:7. what each signal means
embedding similarity
embedding similarity means the query and chunk are close in meaning. use it to decide where to look first. do not use it as proof. good:structural chunk type
structural chunk type tells the agent what kind of code matched. stronger structural reasons:- class
- method
- function
- type
- export
- block
- large fallback chunk
graph connections
graph connections tell the agent what code sits near the candidate in the actual program. use graph connections to find:- imports
- imported-by files
- tests
- callers
- callees
- same-directory siblings
recency
recency tells the agent where the repo has changed recently. use it as a weak signal. recent code is not automatically relevant. old code is not automatically irrelevant.current diff relevance
current diff relevance tells the agent whether a file is active in the current task branch. use it to prioritize live work. do not let it hide untouched root-cause files.belief posterior
posterior belief is the current system estimate after retrieval and evidence. use it to decide when to keep exploring versus exploit. do not confuse it with verification.information value
information value estimates how much a next action could teach the agent. a lower-belief file can be worth reading if it reduces uncertainty or touches many connected paths. this is whydecideNext should not always pick the top score.
8. cold start behavior
on a new task, there may be no evidence yet. in that state:- qwen embeddings provide the prior
- graph expansion provides connected context
- confidence should stay modest
decideNextshould usually recommend reading
9. when to exploit
exploit when the system has enough evidence to act. good exploit signals:- top belief is high
- gap between top path and alternatives is meaningful
- important connected files have been read
- tests or callers support the path
- contradictions are resolved or understood
- the edit surface is clear
- only one explore was run
- no files were read
- tests exist but were ignored
- confidence came mostly from retrieval score
- graph connections are missing
- runtime evidence conflicts with the hypothesis
10. when to confirm
confirm after the system has something real to validate. confirmation can include:workspace.call(\{ tool: "verify", taskSession, input: \{\} \})- targeted tests
- runtime logs
- production or browser verification
- api responses
- deployment health
- script change: run the script and audit docs
- repo workflow change: run the full command chain
- dialer runtime change: check railway logs and the real call path
- ui change: browser verification
- api change: call the endpoint
11. evidence ledger rules
the evidence ledger is durable state for the task. primary task file:- specific
- timestamped
- tied to a real action
- connected to files or commands when possible
- honest about pass, fail, relevance, or uncertainty
12. read tracking
read tracking protects against fake confidence. an agent should not claim it understands a path because explore returned a file. it understands more only after reading evidence-producing files. preferred behavior:- reads through
workspace.call(\{ tool: "fs.read", ... \})should createfile.readevidence automatically - direct reads outside the wrapper should be manually marked
13. graph expansion
graph expansion follows code relationships after retrieval finds a candidate. the graph should include:- relative imports
- workspace imports
@consuelo/*imports- imported-by edges
- test and tested-by edges
- best-effort caller and called-by edges
- sibling edges
- check
graph_edgesrow count - check import resolution
- check retriever expansion
- check output serialization
14. indexing doctrine
the index is the foundation. first run can be slow. that is acceptable. accuracy is more important than speed. do not skip embeddings. do not lazy-embed only the top hits. do not reduce chunk quality to make indexing feel faster. qwen local embeddings are the chosen path unless ko explicitly changes the decision. important rules:- every chunk should be embedded before vector search is considered complete
- missing vectors mean the index is not done
- tree-sitter chunks should preserve functions, classes, methods, exports, and types
- oversized files can be bounded, but structural chunking should not be destroyed
- worktree overlays should avoid full reindexing for task changes
- content hashes should prevent redundant embedding work
15. common failure modes
treating search as truth
symptom:editing before evidence
symptom:graph connections are empty
symptom:stale evidence wins
symptom:type files outrank implementation files
symptom:utility hubs dominate
symptom:command output is not valid json
symptom:evidence is not mirrored
symptom:16. how to reason with contradictions
contradictions are not embarrassing. contradictions are useful. examples:- a file looked semantically relevant, but reading it showed it was only a type barrel
- a test passed, but runtime logs still show the error
- verify passed, but a targeted command failed
- the top three results are from unrelated subsystems
- a file in the index was deleted or moved
- write the contradiction as evidence
- lower confidence
- ask what observation would resolve it
- let
decideNextpick or inform the next action
17. how to write good questions
good explore questions name the behavior, not just a keyword. weak:18. how agents should use this during coding
before reading random files:- inspect the evidence
- inspect the graph connections
- mark relevant or irrelevant files explicitly
- rerun
decideNext
19. when to stop and ask ko
stop and ask ko when:- the evidence conflicts with the stated goal
- the system recommends a path that would change product architecture
- the next action would be destructive
- production or customer behavior is affected and the correct tradeoff is unclear
- the index appears corrupted and a full rebuild would be expensive or disruptive
- the user’s instruction conflicts with observed repo truth
- running explore
- reading recommended files
- checking evidence
- running audit
- running confirm
- verifying the current state
20. what good looks like
good agent behavior:- starts with a clear question
- explores the repo semantically
- follows graph connections
- reads the highest-value files
- records evidence automatically
- uses confidence as a live belief, not a badge
- exploits only when evidence is concentrated
- confirms with validation truth
- reports what was proven and what remains uncertain
- searches keywords manually and ignores the index
- treats top retrieval result as root cause
- edits before reading
- ignores tests and callers
- claims confidence without evidence
- hides contradictions
- bypasses
confirm - leaves docs and scripts drifting
21. response pattern after using the system
when reporting back, lead with the result and evidence. use:22. default behavior summary
retrieval is a prior. evidence updates belief. confidence comes from observations. graph expansion follows the program. decideNext is the policy. exploit is the commit point. confirm is truth. audit protects drift. do not optimize only for better search. do not optimize only for fewer reads. do not optimize only for fast indexing. optimize for better decisions from accumulated evidence. the system is working when an agent can say:worker.call as the neutral facade entrypoint and expose bun run worker -- call ... as the human/Codex CLI wrapper. Both paths must call the same worker runtime module; executor-owned provider implementations should be refactored into dedicated runtime modules.