Skip to main content
The evidence loop that turns retrieval, reads, tests, and runtime signals into next actions.
This page is generated from packages/workspace/decision.md. Edit the source Markdown, then run bun run --cwd packages/consuelo-docs generate-os-source-docs to refresh the public docs.

What this file controls

FieldValue
Source filepackages/workspace/decision.md
Runtime roleDecision-engine doctrine for explore, evidence, confidence, and confirmation.
ControlsHow agents choose what to inspect next and how they decide whether a path is true.
Generated route/os/agent-context/decision

Source document

decision process

mandatory workspace app transport

you are working inside the workspace mcp app. the app exposes exactly two tools:
  • workspace.get_steering()
  • workspace.call(\{ tool, input, taskSession, timeout \})
normal decision-engine workflow uses structured workspace.call inputs. do not construct nested shell strings or JSON-in-shell command strings for normal workspace operations.
await workspace.call({
  tool: "explore",
  input: { query: "how does auth work" },
  timeout: 120,
})
there are no per-operation mcp tools beyond get_steering and call. if a call fails, inspect the returned envelope and fix the typed input or implementation. alignment is the number one thing this system protects. if an agent does not know what to inspect next, it should use the decision process. if an agent does not know whether a path is right, it should use the decision process. if the evidence conflicts, it should update its belief instead of pretending the first answer was correct. this file is the why, judgment, and operating doctrine for the workspace explore, evidence, confidence, and confirmation system. procedural command details belong in packages/workspace/SCRIPTS.md. coding standards belong in AGENTS.md and CODING-STANDARDS.md. task-specific evidence belongs in .task/evidence-log.json, .task/explore-state.json, and the task workpad. handoffs belong in memory or tmp/context files. do not turn this file into a script reference. this file should teach agents how to think with the system, what the signals mean, when to trust them, and when to stop.

worker delegation naming decision

Use neutral workspace tool names to reduce safety-filter collisions and keep provider names abbreviated. worker.call is the provider-agnostic delegation contract; cdx, pi, and opc are provider codes. mini is a legacy/profile name normalized to pi with profile: mini. Codex App Server, cloud sessions, MCP integration, and A2A integration are intentionally deferred.

1. what this system is

the decision process is not “better search.” it is a repo-aware decision loop for coding agents. it answers two questions:
  • what should i do next?
  • how do i know this path is right?
the system combines:
  • git-aware repository indexing
  • tree-sitter structural chunking
  • local qwen embeddings
  • graph expansion across imports, tests, callers, and siblings
  • evidence events
  • belief updates
  • next-action policy
  • confirmation through tests, verify, and runtime truth
the important idea: retrieval starts the investigation. evidence decides the investigation. qwen tells the agent where to look first. the evidence ledger tells the agent whether that direction is becoming more or less true.

2. mental model

think of the system as a markov-style decision process over agent work. current state:
  • current explore results
  • files already read
  • tests inspected or run
  • verify results
  • runtime checks
  • edits made
  • contradictions found
  • current hypotheses
  • belief scores for likely paths
actions:
  • explore a question
  • read a file
  • inspect a test
  • mark a file relevant or irrelevant
  • run verify
  • run a targeted test
  • check runtime logs
  • exploit the best path
  • confirm the fix
transitions:
  • every action changes the state
  • every useful observation becomes an evidence event
  • every evidence event updates beliefs
policy:
  • decideNext chooses the highest-value next action from the current state
  • good policy balances relevance, uncertainty, graph reach, and confirmation value
terminal truth:
  • confirm closes the loop
  • tests, verify, runtime logs, and production behavior matter more than retrieval scores

3. retrieval is a prior

retrieval is the prior distribution over where to look. it is not a conclusion. it is not proof. it is not enough to raise confidence by itself. high semantic similarity means:
this file is probably relevant enough to inspect.
it does not mean:
this file is the root cause.
this path is correct.
the fix is proven.
the failure mode is subtle: an agent sees a high score, treats the result as proof, edits too early, and then uses passing syntax checks as a fake confirmation. the correct behavior:
  1. use explore to find candidate files
  2. use decideNext to pick the next action
  3. read or test the recommended target
  4. let the evidence event update beliefs
  5. repeat until the system has enough evidence to exploit
  6. confirm with runtime or validation truth

4. evidence is the source of confidence

confidence comes from observations, not from search. evidence includes:
  • a file was read
  • a file was marked relevant
  • a file was marked irrelevant
  • a test passed
  • a test failed
  • verify passed
  • verify failed
  • runtime logs were clean
  • runtime logs showed an error
  • an edit was made
  • a contradiction was found
  • a hypothesis was confirmed
  • a hypothesis was weakened
evidence should update beliefs. evidence should not permanently lock beliefs. the right posture is:
given what i have seen so far, this path is more or less likely now.
not:
the first plausible file is correct because it looked relevant.

5. command family

the system has six public commands. each command should read or write decision state.

explore

use explore when the agent needs to know what files or paths are likely relevant.
await workspace.call({ tool: "explore", input: { query: "how does the dialer queue work?" }, timeout: 120 })
what it does:
  • ensures the local index exists
  • embeds the query with qwen
  • searches structural chunks
  • expands through graph connections
  • ranks results
  • writes explore state
  • writes explore.result evidence
what to remember:
  • explore is not confirmation
  • graph connections matter
  • structural reasons matter
  • results should include implementation files, tests, and connected context

decideNext

use decideNext when the agent has evidence and needs the next best action.
await workspace.call({ tool: "decideNext", input: {}, timeout: 120 })
what it does:
  • reads explore state
  • reads evidence events
  • updates beliefs
  • estimates information value
  • recommends one action
examples of correct actions:
  • read the highest-value unread file
  • inspect a connected test
  • mark a file relevant or irrelevant
  • exploit a concentrated belief
  • run confirm after validation evidence exists
manual fallback:
await workspace.call({ tool: "decideNext", input: { markRead: "packages/dialer/src/dialer.ts" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: { markRelevant: "packages/dialer/src/dialer.ts" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: { markIrrelevant: "packages/dialer/src/types.ts" }, timeout: 120 })
manual marking exists because not every read happens through the normal workspace scripts. automatic read tracking is still preferred.

confidenceScore

use confidenceScore when the agent needs to know how justified the current path is.
await workspace.call({ tool: "confidenceScore", input: {}, timeout: 120 })
what it does:
  • reads evidence
  • updates beliefs
  • separates evidence for, evidence against, and uncertainty
  • reports a score
what to remember:
  • qwen candidate count is context, not evidence for correctness
  • high top posterior is useful, but not final proof
  • contradictory evidence should lower confidence
  • confidence should start low on a cold task

exploit

use exploit when the system has enough evidence to stop exploring and commit to a path.
await workspace.call({ tool: "exploit", input: {}, timeout: 120 })
what it does:
  • selects the current strongest target
  • records exploitation state
  • names the target file and related context files
  • narrows the editing surface
what to remember:
  • exploit is a transition from investigation into editing
  • exploit should not happen just because one result scored high
  • exploit should happen when evidence is concentrated enough to act

confirm

use confirm when the agent needs validation truth.
await workspace.call({ tool: "confirm", input: { verify: true }, timeout: 120 })
what it does:
  • piggybacks on packages/workspace/scripts/verify.js
  • parses verify/test/runtime output
  • writes confirmation evidence
  • updates the explore state
  • gives a verdict
what to remember:
  • confirm is where belief meets reality
  • parse failures are failures
  • a passing command that cannot be interpreted is not positive evidence
  • runtime checks matter for production behavior

audit

use audit when the agent needs to know whether the script and doc surface is truthful.
await workspace.call({ tool: "audit", input: { scripts: true }, timeout: 120 })
what it does:
  • compares documented scripts against package scripts
  • detects undocumented or missing commands
  • can check docs and index freshness
what to remember:
  • undocumented scripts are drift
  • missing docs are part of the bug
  • script behavior and script docs should change together

6. default workflow

use this flow when starting from a question or bug:
await workspace.call({ tool: "explore", input: { query: "{question or goal}" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: {}, timeout: 120 })
then do the recommended action. if it says to read a file, read the file through the workspace script so the read can be tracked:
await workspace.call({ tool: "fs.read", taskSession, input: { path: "{path}" }, timeout: 120 })
if the read happened outside normal tracking, mark it:
await workspace.call({ tool: "decideNext", input: { markRead: "{path}" }, timeout: 120 })
then check confidence:
await workspace.call({ tool: "confidenceScore", input: {}, timeout: 120 })
repeat:
decideNext -> action -> evidence -> confidenceScore
when confidence is concentrated enough:
await workspace.call({ tool: "exploit", input: {}, timeout: 120 })
after editing and validation:
await workspace.call({ tool: "confirm", input: { verify: true }, timeout: 120 })
before pushing script/workflow changes:
await workspace.call({ tool: "audit", input: { scripts: true }, timeout: 120 })

7. what each signal means

embedding similarity

embedding similarity means the query and chunk are close in meaning. use it to decide where to look first. do not use it as proof. good:
the dialer service matched the dialer queue question, so inspect it.
bad:
the dialer service matched, so the bug must be there.

structural chunk type

structural chunk type tells the agent what kind of code matched. stronger structural reasons:
  • class
  • method
  • function
  • type
  • export
weaker structural reasons:
  • block
  • large fallback chunk
structural chunks matter because code search should preserve code units. cutting functions in the middle makes search less useful.

graph connections

graph connections tell the agent what code sits near the candidate in the actual program. use graph connections to find:
  • imports
  • imported-by files
  • tests
  • callers
  • callees
  • same-directory siblings
graph expansion is the difference between semantic search and repo reasoning. if graph connections are empty for everything, the system is degraded. do not accept that as normal.

recency

recency tells the agent where the repo has changed recently. use it as a weak signal. recent code is not automatically relevant. old code is not automatically irrelevant.

current diff relevance

current diff relevance tells the agent whether a file is active in the current task branch. use it to prioritize live work. do not let it hide untouched root-cause files.

belief posterior

posterior belief is the current system estimate after retrieval and evidence. use it to decide when to keep exploring versus exploit. do not confuse it with verification.

information value

information value estimates how much a next action could teach the agent. a lower-belief file can be worth reading if it reduces uncertainty or touches many connected paths. this is why decideNext should not always pick the top score.

8. cold start behavior

on a new task, there may be no evidence yet. in that state:
  • qwen embeddings provide the prior
  • graph expansion provides connected context
  • confidence should stay modest
  • decideNext should usually recommend reading
cold start confidence should not begin high just because retrieval looks good. the standard:
before evidence: probably relevant
after evidence: increasingly justified or weakened
after confirm: proven or rejected

9. when to exploit

exploit when the system has enough evidence to act. good exploit signals:
  • top belief is high
  • gap between top path and alternatives is meaningful
  • important connected files have been read
  • tests or callers support the path
  • contradictions are resolved or understood
  • the edit surface is clear
bad exploit signals:
  • only one explore was run
  • no files were read
  • tests exist but were ignored
  • confidence came mostly from retrieval score
  • graph connections are missing
  • runtime evidence conflicts with the hypothesis
exploit is not “start coding because impatient.” exploit is “the evidence is strong enough to stop wandering.”

10. when to confirm

confirm after the system has something real to validate. confirmation can include:
  • workspace.call(\{ tool: "verify", taskSession, input: \{\} \})
  • targeted tests
  • runtime logs
  • production or browser verification
  • api responses
  • deployment health
use the most relevant truth source. examples:
  • script change: run the script and audit docs
  • repo workflow change: run the full command chain
  • dialer runtime change: check railway logs and the real call path
  • ui change: browser verification
  • api change: call the endpoint
confirmation should write evidence. confirmation should not be a private mental note.

11. evidence ledger rules

the evidence ledger is durable state for the task. primary task file:
.task/evidence-log.json
queryable mirror:
~/.cache/workspace-index/{repo-hash}/index.db
the json file exists so agents and humans can inspect task evidence easily. the sqlite mirror exists so the system can rank, query, and improve later. every event should be:
  • specific
  • timestamped
  • tied to a real action
  • connected to files or commands when possible
  • honest about pass, fail, relevance, or uncertainty
bad evidence:
looked good
probably fixed
seems related
good evidence:
file.read packages/dialer/src/dialer.ts
test.fail packages/dialer/src/dialer.test.ts
verify.pass workspace.call({ tool: "verify", taskSession, input: {} })
runtime.clean railway errors query returned no new errors

12. read tracking

read tracking protects against fake confidence. an agent should not claim it understands a path because explore returned a file. it understands more only after reading evidence-producing files. preferred behavior:
  • reads through workspace.call(\{ tool: "fs.read", ... \}) should create file.read evidence automatically
  • direct reads outside the wrapper should be manually marked
manual fallback:
await workspace.call({ tool: "decideNext", input: { markRead: "{path}" }, timeout: 120 })
if a file was read and the system does not know it, the next decision will be weaker. if a file was not read and the system thinks it was, confidence becomes fake. protect the read log.

13. graph expansion

graph expansion follows code relationships after retrieval finds a candidate. the graph should include:
  • relative imports
  • workspace imports
  • @consuelo/* imports
  • imported-by edges
  • test and tested-by edges
  • best-effort caller and called-by edges
  • sibling edges
the graph is intentionally best-effort. false positives are acceptable when they only add exploration candidates. empty graph connections are not acceptable. if graph connections disappear:
  1. check graph_edges row count
  2. check import resolution
  3. check retriever expansion
  4. check output serialization
without graph expansion, the system falls back toward vector search. that is degraded behavior.

14. indexing doctrine

the index is the foundation. first run can be slow. that is acceptable. accuracy is more important than speed. do not skip embeddings. do not lazy-embed only the top hits. do not reduce chunk quality to make indexing feel faster. qwen local embeddings are the chosen path unless ko explicitly changes the decision. important rules:
  • every chunk should be embedded before vector search is considered complete
  • missing vectors mean the index is not done
  • tree-sitter chunks should preserve functions, classes, methods, exports, and types
  • oversized files can be bounded, but structural chunking should not be destroyed
  • worktree overlays should avoid full reindexing for task changes
  • content hashes should prevent redundant embedding work
if the model takes a long time, wait. do not silently downgrade the system.

15. common failure modes

treating search as truth

symptom:
confidence is high before any file is read.
fix:
lower confidence, read files, collect evidence.

editing before evidence

symptom:
agent runs explore once, edits the first result, and calls syntax checks confirmation.
fix:
run decideNext, inspect connected files/tests, then exploit.

graph connections are empty

symptom:
every explore result has graph_connections: []
fix:
debug graph_edges, import resolution, and retriever expansion before trusting output.

stale evidence wins

symptom:
confidence reports both verify passed and verify failed.
fix:
latest event wins for repeated validation categories unless both are relevant to different scopes.

type files outrank implementation files

symptom:
types.ts is ranked above the service/class that actually runs the workflow.
fix:
prefer implementation chunks when similarity is close. types are context, not usually the root behavior.

utility hubs dominate

symptom:
logger, config, or generic helpers rank too high because many files connect to them.
fix:
centrality should be weighted by useful graph relationships and query relevance.

command output is not valid json

symptom:
--json output contains progress logs, warnings, or human text.
fix:
keep machine output parseable. route human progress to stderr or suppress it.

evidence is not mirrored

symptom:
.task/evidence-log.json updates, but sqlite evidence tables do not.
fix:
repair store mirroring and close db handles in finally blocks.

16. how to reason with contradictions

contradictions are not embarrassing. contradictions are useful. examples:
  • a file looked semantically relevant, but reading it showed it was only a type barrel
  • a test passed, but runtime logs still show the error
  • verify passed, but a targeted command failed
  • the top three results are from unrelated subsystems
  • a file in the index was deleted or moved
the correct response is not to hide the contradiction. the correct response:
  1. write the contradiction as evidence
  2. lower confidence
  3. ask what observation would resolve it
  4. let decideNext pick or inform the next action
beliefs should move when reality pushes them.

17. how to write good questions

good explore questions name the behavior, not just a keyword. weak:
await workspace.call({ tool: "explore", input: { query: "queue" }, timeout: 120 })
better:
await workspace.call({ tool: "explore", input: { query: "how does the dialer queue choose the next call?" }, timeout: 120 })
weak:
await workspace.call({ tool: "explore", input: { query: "auth" }, timeout: 120 })
better:
await workspace.call({ tool: "explore", input: { query: "how does authentication create and refresh workspace tokens?" }, timeout: 120 })
the query should describe the job the code performs. qwen can bridge synonyms, but it still needs a meaningful task.

18. how agents should use this during coding

before reading random files:
await workspace.call({ tool: "explore", input: { query: "{goal}" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: {}, timeout: 120 })
before editing:
await workspace.call({ tool: "confidenceScore", input: {}, timeout: 120 })
await workspace.call({ tool: "exploit", input: {}, timeout: 120 })
after editing:
await workspace.call({ tool: "confirm", input: { verify: true }, timeout: 120 })
before claiming done:
await workspace.call({ tool: "audit", input: { scripts: true }, timeout: 120 })
if the recommendation feels wrong:
  • inspect the evidence
  • inspect the graph connections
  • mark relevant or irrelevant files explicitly
  • rerun decideNext
do not ignore the system silently. either follow it or correct its state.

19. when to stop and ask ko

stop and ask ko when:
  • the evidence conflicts with the stated goal
  • the system recommends a path that would change product architecture
  • the next action would be destructive
  • production or customer behavior is affected and the correct tradeoff is unclear
  • the index appears corrupted and a full rebuild would be expensive or disruptive
  • the user’s instruction conflicts with observed repo truth
do not ask ko before:
  • running explore
  • reading recommended files
  • checking evidence
  • running audit
  • running confirm
  • verifying the current state
investigate first. ask only when the remaining question is judgment, not missing information.

20. what good looks like

good agent behavior:
  • starts with a clear question
  • explores the repo semantically
  • follows graph connections
  • reads the highest-value files
  • records evidence automatically
  • uses confidence as a live belief, not a badge
  • exploits only when evidence is concentrated
  • confirms with validation truth
  • reports what was proven and what remains uncertain
bad agent behavior:
  • searches keywords manually and ignores the index
  • treats top retrieval result as root cause
  • edits before reading
  • ignores tests and callers
  • claims confidence without evidence
  • hides contradictions
  • bypasses confirm
  • leaves docs and scripts drifting
the goal is not to make agents sound more certain. the goal is to make agents become more correct.

21. response pattern after using the system

when reporting back, lead with the result and evidence. use:
tl;dr: current best path and status.

evidence:
- explore found ...
- read evidence says ...
- tests/verify/runtime say ...
- confidence is ...

action:
- next action, exploit target, or confirmed done.
do not say:
i think it is probably fine.
say:
confidence is 0.64. the top path is dialer.ts, supported by one read and one connected test. remaining uncertainty is the queue worker path, which has not been read.
precision beats vibes.

22. default behavior summary

retrieval is a prior. evidence updates belief. confidence comes from observations. graph expansion follows the program. decideNext is the policy. exploit is the commit point. confirm is truth. audit protects drift. do not optimize only for better search. do not optimize only for fewer reads. do not optimize only for fast indexing. optimize for better decisions from accumulated evidence. the system is working when an agent can say:
i know what to do next because the current evidence makes that action highest value.
i know whether it was right because validation and runtime truth updated the belief.
Keep worker.call as the neutral facade entrypoint and expose bun run worker -- call ... as the human/Codex CLI wrapper. Both paths must call the same worker runtime module; executor-owned provider implementations should be refactored into dedicated runtime modules.