The evidence loop that turns retrieval, reads, tests, and runtime signals into next actions.

This page is generated from packages/workspace/decision.md. Edit the source Markdown, then run bun run --cwd packages/consuelo-docs generate-os-source-docs to refresh the public docs.

What this file controls

Field	Value
Source file	`packages/workspace/decision.md`
Runtime role	Decision-engine doctrine for explore, evidence, confidence, and confirmation.
Controls	How agents choose what to inspect next and how they decide whether a path is true.
Generated route	`/os/agent-context/decision`

Source document

decision process

mandatory workspace app transport

you are working inside the workspace mcp app. the app exposes exactly two tools:

workspace.get_steering()
workspace.call(\{ tool, input, taskSession, timeout \})

normal decision-engine workflow uses structured workspace.call inputs. do not construct nested shell strings or JSON-in-shell command strings for normal workspace operations.

await workspace.call({
  tool: "explore",
  input: { query: "how does auth work" },
  timeout: 120,
})

there are no per-operation mcp tools beyond get_steering and call. if a call fails, inspect the returned envelope and fix the typed input or implementation. alignment is the number one thing this system protects. if an agent does not know what to inspect next, it should use the decision process. if an agent does not know whether a path is right, it should use the decision process. if the evidence conflicts, it should update its belief instead of pretending the first answer was correct. this file is the why, judgment, and operating doctrine for the workspace explore, evidence, confidence, and confirmation system. procedural command details belong in packages/workspace/SCRIPTS.md. coding standards belong in AGENTS.md and CODING-STANDARDS.md. task-specific evidence belongs in .task/evidence-log.json, .task/explore-state.json, and the task workpad. handoffs belong in memory or tmp/context files. do not turn this file into a script reference. this file should teach agents how to think with the system, what the signals mean, when to trust them, and when to stop.

worker delegation naming decision

Use neutral workspace tool names to reduce safety-filter collisions and keep provider names abbreviated. worker.call is the provider-agnostic delegation contract; cdx, pi, and opc are provider codes. mini is a legacy/profile name normalized to pi with profile: mini. Codex App Server, cloud sessions, MCP integration, and A2A integration are intentionally deferred.

1. what this system is

the decision process is not “better search.” it is a repo-aware decision loop for coding agents. it answers two questions:

what should i do next?
how do i know this path is right?

the system combines:

git-aware repository indexing
tree-sitter structural chunking
local qwen embeddings
graph expansion across imports, tests, callers, and siblings
evidence events
belief updates
next-action policy
confirmation through tests, verify, and runtime truth

the important idea: retrieval starts the investigation. evidence decides the investigation. qwen tells the agent where to look first. the evidence ledger tells the agent whether that direction is becoming more or less true.

2. mental model

think of the system as a markov-style decision process over agent work. current state:

current explore results
files already read
tests inspected or run
verify results
runtime checks
edits made
contradictions found
current hypotheses
belief scores for likely paths

actions:

explore a question
read a file
inspect a test
mark a file relevant or irrelevant
run verify
run a targeted test
check runtime logs
exploit the best path
confirm the fix

transitions:

every action changes the state
every useful observation becomes an evidence event
every evidence event updates beliefs

policy:

decideNext chooses the highest-value next action from the current state
good policy balances relevance, uncertainty, graph reach, and confirmation value

terminal truth:

confirm closes the loop
tests, verify, runtime logs, and production behavior matter more than retrieval scores

3. retrieval is a prior

retrieval is the prior distribution over where to look. it is not a conclusion. it is not proof. it is not enough to raise confidence by itself. high semantic similarity means:

this file is probably relevant enough to inspect.

it does not mean:

this file is the root cause.
this path is correct.
the fix is proven.

the failure mode is subtle: an agent sees a high score, treats the result as proof, edits too early, and then uses passing syntax checks as a fake confirmation. the correct behavior:

use explore to find candidate files
use decideNext to pick the next action
read or test the recommended target
let the evidence event update beliefs
repeat until the system has enough evidence to exploit
confirm with runtime or validation truth

4. evidence is the source of confidence

confidence comes from observations, not from search. evidence includes:

a file was read
a file was marked relevant
a file was marked irrelevant
a test passed
a test failed
verify passed
verify failed
runtime logs were clean
runtime logs showed an error
an edit was made
a contradiction was found
a hypothesis was confirmed
a hypothesis was weakened

evidence should update beliefs. evidence should not permanently lock beliefs. the right posture is:

given what i have seen so far, this path is more or less likely now.

not:

the first plausible file is correct because it looked relevant.

5. command family

the system has six public commands. each command should read or write decision state.

explore

use explore when the agent needs to know what files or paths are likely relevant.

await workspace.call({ tool: "explore", input: { query: "how does the dialer queue work?" }, timeout: 120 })

what it does:

ensures the local index exists
embeds the query with qwen
searches structural chunks
expands through graph connections
ranks results
writes explore state
writes explore.result evidence

what to remember:

explore is not confirmation
graph connections matter
structural reasons matter
results should include implementation files, tests, and connected context

decideNext

use decideNext when the agent has evidence and needs the next best action.

await workspace.call({ tool: "decideNext", input: {}, timeout: 120 })

what it does:

reads explore state
reads evidence events
updates beliefs
estimates information value
recommends one action

examples of correct actions:

read the highest-value unread file
inspect a connected test
mark a file relevant or irrelevant
exploit a concentrated belief
run confirm after validation evidence exists

manual fallback:

await workspace.call({ tool: "decideNext", input: { markRead: "packages/dialer/src/dialer.ts" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: { markRelevant: "packages/dialer/src/dialer.ts" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: { markIrrelevant: "packages/dialer/src/types.ts" }, timeout: 120 })

manual marking exists because not every read happens through the normal workspace scripts. automatic read tracking is still preferred.

confidenceScore

use confidenceScore when the agent needs to know how justified the current path is.

await workspace.call({ tool: "confidenceScore", input: {}, timeout: 120 })

what it does:

reads evidence
updates beliefs
separates evidence for, evidence against, and uncertainty
reports a score

what to remember:

qwen candidate count is context, not evidence for correctness
high top posterior is useful, but not final proof
contradictory evidence should lower confidence
confidence should start low on a cold task

exploit

use exploit when the system has enough evidence to stop exploring and commit to a path.

await workspace.call({ tool: "exploit", input: {}, timeout: 120 })

what it does:

selects the current strongest target
records exploitation state
names the target file and related context files
narrows the editing surface

what to remember:

exploit is a transition from investigation into editing
exploit should not happen just because one result scored high
exploit should happen when evidence is concentrated enough to act

confirm

use confirm when the agent needs validation truth.

await workspace.call({ tool: "confirm", input: { verify: true }, timeout: 120 })

what it does:

piggybacks on packages/workspace/scripts/verify.js
parses verify/test/runtime output
writes confirmation evidence
updates the explore state
gives a verdict

what to remember:

confirm is where belief meets reality
parse failures are failures
a passing command that cannot be interpreted is not positive evidence
runtime checks matter for production behavior

audit

use audit when the agent needs to know whether the script and doc surface is truthful.

await workspace.call({ tool: "audit", input: { scripts: true }, timeout: 120 })

what it does:

compares documented scripts against package scripts
detects undocumented or missing commands
can check docs and index freshness

what to remember:

undocumented scripts are drift
missing docs are part of the bug
script behavior and script docs should change together

6. default workflow

use this flow when starting from a question or bug:

await workspace.call({ tool: "explore", input: { query: "{question or goal}" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: {}, timeout: 120 })

then do the recommended action. if it says to read a file, read the file through the workspace script so the read can be tracked:

await workspace.call({ tool: "fs.read", taskSession, input: { path: "{path}" }, timeout: 120 })

if the read happened outside normal tracking, mark it:

await workspace.call({ tool: "decideNext", input: { markRead: "{path}" }, timeout: 120 })

then check confidence:

await workspace.call({ tool: "confidenceScore", input: {}, timeout: 120 })

repeat:

decideNext -> action -> evidence -> confidenceScore

when confidence is concentrated enough:

await workspace.call({ tool: "exploit", input: {}, timeout: 120 })

after editing and validation:

await workspace.call({ tool: "confirm", input: { verify: true }, timeout: 120 })

before pushing script/workflow changes:

await workspace.call({ tool: "audit", input: { scripts: true }, timeout: 120 })

7. what each signal means

embedding similarity

embedding similarity means the query and chunk are close in meaning. use it to decide where to look first. do not use it as proof. good:

the dialer service matched the dialer queue question, so inspect it.

bad:

the dialer service matched, so the bug must be there.

structural chunk type

structural chunk type tells the agent what kind of code matched. stronger structural reasons:

class
method
function
type
export

weaker structural reasons:

block
large fallback chunk

structural chunks matter because code search should preserve code units. cutting functions in the middle makes search less useful.

graph connections

graph connections tell the agent what code sits near the candidate in the actual program. use graph connections to find:

imports
imported-by files
tests
callers
callees
same-directory siblings

graph expansion is the difference between semantic search and repo reasoning. if graph connections are empty for everything, the system is degraded. do not accept that as normal.

recency

recency tells the agent where the repo has changed recently. use it as a weak signal. recent code is not automatically relevant. old code is not automatically irrelevant.

current diff relevance

current diff relevance tells the agent whether a file is active in the current task branch. use it to prioritize live work. do not let it hide untouched root-cause files.

belief posterior

posterior belief is the current system estimate after retrieval and evidence. use it to decide when to keep exploring versus exploit. do not confuse it with verification.

information value

information value estimates how much a next action could teach the agent. a lower-belief file can be worth reading if it reduces uncertainty or touches many connected paths. this is why decideNext should not always pick the top score.

8. cold start behavior

on a new task, there may be no evidence yet. in that state:

qwen embeddings provide the prior
graph expansion provides connected context
confidence should stay modest
decideNext should usually recommend reading

cold start confidence should not begin high just because retrieval looks good. the standard:

before evidence: probably relevant
after evidence: increasingly justified or weakened
after confirm: proven or rejected

9. when to exploit

exploit when the system has enough evidence to act. good exploit signals:

top belief is high
gap between top path and alternatives is meaningful
important connected files have been read
tests or callers support the path
contradictions are resolved or understood
the edit surface is clear

bad exploit signals:

only one explore was run
no files were read
tests exist but were ignored
confidence came mostly from retrieval score
graph connections are missing
runtime evidence conflicts with the hypothesis

exploit is not “start coding because impatient.” exploit is “the evidence is strong enough to stop wandering.”

10. when to confirm

confirm after the system has something real to validate. confirmation can include:

workspace.call(\{ tool: "verify", taskSession, input: \{\} \})
targeted tests
runtime logs
production or browser verification
api responses
deployment health

use the most relevant truth source. examples:

script change: run the script and audit docs
repo workflow change: run the full command chain
dialer runtime change: check railway logs and the real call path
ui change: browser verification
api change: call the endpoint

confirmation should write evidence. confirmation should not be a private mental note.

11. evidence ledger rules

the evidence ledger is durable state for the task. primary task file:

.task/evidence-log.json

queryable mirror:

~/.cache/workspace-index/{repo-hash}/index.db

the json file exists so agents and humans can inspect task evidence easily. the sqlite mirror exists so the system can rank, query, and improve later. every event should be:

specific
timestamped
tied to a real action
connected to files or commands when possible
honest about pass, fail, relevance, or uncertainty

bad evidence:

looked good
probably fixed
seems related

good evidence:

file.read packages/dialer/src/dialer.ts
test.fail packages/dialer/src/dialer.test.ts
verify.pass workspace.call({ tool: "verify", taskSession, input: {} })
runtime.clean railway errors query returned no new errors

12. read tracking

read tracking protects against fake confidence. an agent should not claim it understands a path because explore returned a file. it understands more only after reading evidence-producing files. preferred behavior:

reads through workspace.call(\{ tool: "fs.read", ... \}) should create file.read evidence automatically
direct reads outside the wrapper should be manually marked

manual fallback:

await workspace.call({ tool: "decideNext", input: { markRead: "{path}" }, timeout: 120 })

if a file was read and the system does not know it, the next decision will be weaker. if a file was not read and the system thinks it was, confidence becomes fake. protect the read log.

13. graph expansion

graph expansion follows code relationships after retrieval finds a candidate. the graph should include:

relative imports
workspace imports
@consuelo/* imports
imported-by edges
test and tested-by edges
best-effort caller and called-by edges
sibling edges

the graph is intentionally best-effort. false positives are acceptable when they only add exploration candidates. empty graph connections are not acceptable. if graph connections disappear:

check graph_edges row count
check import resolution
check retriever expansion
check output serialization

without graph expansion, the system falls back toward vector search. that is degraded behavior.

14. indexing doctrine

the index is the foundation. first run can be slow. that is acceptable. accuracy is more important than speed. do not skip embeddings. do not lazy-embed only the top hits. do not reduce chunk quality to make indexing feel faster. qwen local embeddings are the chosen path unless ko explicitly changes the decision. important rules:

every chunk should be embedded before vector search is considered complete
missing vectors mean the index is not done
tree-sitter chunks should preserve functions, classes, methods, exports, and types
oversized files can be bounded, but structural chunking should not be destroyed
worktree overlays should avoid full reindexing for task changes
content hashes should prevent redundant embedding work

if the model takes a long time, wait. do not silently downgrade the system.

15. common failure modes

treating search as truth

symptom:

confidence is high before any file is read.

fix:

lower confidence, read files, collect evidence.

editing before evidence

symptom:

agent runs explore once, edits the first result, and calls syntax checks confirmation.

fix:

run decideNext, inspect connected files/tests, then exploit.

graph connections are empty

symptom:

every explore result has graph_connections: []

fix:

debug graph_edges, import resolution, and retriever expansion before trusting output.

stale evidence wins

symptom:

confidence reports both verify passed and verify failed.

fix:

latest event wins for repeated validation categories unless both are relevant to different scopes.

type files outrank implementation files

symptom:

types.ts is ranked above the service/class that actually runs the workflow.

fix:

prefer implementation chunks when similarity is close. types are context, not usually the root behavior.

utility hubs dominate

symptom:

logger, config, or generic helpers rank too high because many files connect to them.

fix:

centrality should be weighted by useful graph relationships and query relevance.

command output is not valid json

symptom:

--json output contains progress logs, warnings, or human text.

fix:

keep machine output parseable. route human progress to stderr or suppress it.

evidence is not mirrored

symptom:

.task/evidence-log.json updates, but sqlite evidence tables do not.

fix:

repair store mirroring and close db handles in finally blocks.

16. how to reason with contradictions

contradictions are not embarrassing. contradictions are useful. examples:

a file looked semantically relevant, but reading it showed it was only a type barrel
a test passed, but runtime logs still show the error
verify passed, but a targeted command failed
the top three results are from unrelated subsystems
a file in the index was deleted or moved

the correct response is not to hide the contradiction. the correct response:

write the contradiction as evidence
lower confidence
ask what observation would resolve it
let decideNext pick or inform the next action

beliefs should move when reality pushes them.

17. how to write good questions

good explore questions name the behavior, not just a keyword. weak:

await workspace.call({ tool: "explore", input: { query: "queue" }, timeout: 120 })

better:

await workspace.call({ tool: "explore", input: { query: "how does the dialer queue choose the next call?" }, timeout: 120 })

weak:

await workspace.call({ tool: "explore", input: { query: "auth" }, timeout: 120 })

better:

await workspace.call({ tool: "explore", input: { query: "how does authentication create and refresh workspace tokens?" }, timeout: 120 })

the query should describe the job the code performs. qwen can bridge synonyms, but it still needs a meaningful task.

18. how agents should use this during coding

before reading random files:

await workspace.call({ tool: "explore", input: { query: "{goal}" }, timeout: 120 })
await workspace.call({ tool: "decideNext", input: {}, timeout: 120 })

before editing:

await workspace.call({ tool: "confidenceScore", input: {}, timeout: 120 })
await workspace.call({ tool: "exploit", input: {}, timeout: 120 })

after editing:

await workspace.call({ tool: "confirm", input: { verify: true }, timeout: 120 })

before claiming done:

await workspace.call({ tool: "audit", input: { scripts: true }, timeout: 120 })

if the recommendation feels wrong:

inspect the evidence
inspect the graph connections
mark relevant or irrelevant files explicitly
rerun decideNext

do not ignore the system silently. either follow it or correct its state.

19. when to stop and ask ko

stop and ask ko when:

the evidence conflicts with the stated goal
the system recommends a path that would change product architecture
the next action would be destructive
production or customer behavior is affected and the correct tradeoff is unclear
the index appears corrupted and a full rebuild would be expensive or disruptive
the user’s instruction conflicts with observed repo truth

do not ask ko before:

running explore
reading recommended files
checking evidence
running audit
running confirm
verifying the current state

investigate first. ask only when the remaining question is judgment, not missing information.

20. what good looks like

good agent behavior:

starts with a clear question
explores the repo semantically
follows graph connections
reads the highest-value files
records evidence automatically
uses confidence as a live belief, not a badge
exploits only when evidence is concentrated
confirms with validation truth
reports what was proven and what remains uncertain

bad agent behavior:

searches keywords manually and ignores the index
treats top retrieval result as root cause
edits before reading
ignores tests and callers
claims confidence without evidence
hides contradictions
bypasses confirm
leaves docs and scripts drifting

the goal is not to make agents sound more certain. the goal is to make agents become more correct.

21. response pattern after using the system

when reporting back, lead with the result and evidence. use:

tl;dr: current best path and status.

evidence:
- explore found ...
- read evidence says ...
- tests/verify/runtime say ...
- confidence is ...

action:
- next action, exploit target, or confirmed done.

do not say:

i think it is probably fine.

say:

confidence is 0.64. the top path is dialer.ts, supported by one read and one connected test. remaining uncertainty is the queue worker path, which has not been read.

precision beats vibes.

22. default behavior summary

retrieval is a prior. evidence updates belief. confidence comes from observations. graph expansion follows the program. decideNext is the policy. exploit is the commit point. confirm is truth. audit protects drift. do not optimize only for better search. do not optimize only for fewer reads. do not optimize only for fast indexing. optimize for better decisions from accumulated evidence. the system is working when an agent can say:

i know what to do next because the current evidence makes that action highest value.
i know whether it was right because validation and runtime truth updated the belief.

Keep worker.call as the neutral facade entrypoint and expose bun run worker -- call ... as the human/Codex CLI wrapper. Both paths must call the same worker runtime module; executor-owned provider implementations should be refactored into dedicated runtime modules.

​What this file controls

​Source document

​decision process

​mandatory workspace app transport

​worker delegation naming decision

​1. what this system is

​2. mental model

​3. retrieval is a prior

​4. evidence is the source of confidence

​5. command family

​explore

​decideNext

​confidenceScore

​exploit

​confirm

​audit

​6. default workflow

​7. what each signal means

​embedding similarity

​structural chunk type

​graph connections

​recency

​current diff relevance

​belief posterior

​information value

​8. cold start behavior

​9. when to exploit

​10. when to confirm

​11. evidence ledger rules

​12. read tracking

​13. graph expansion

​14. indexing doctrine

​15. common failure modes

​treating search as truth

​editing before evidence

​graph connections are empty

​stale evidence wins

​type files outrank implementation files

​utility hubs dominate

​command output is not valid json

​evidence is not mirrored

​16. how to reason with contradictions

​17. how to write good questions

​18. how agents should use this during coding

​19. when to stop and ask ko

​20. what good looks like

​21. response pattern after using the system

​22. default behavior summary

What this file controls

Source document

decision process

mandatory workspace app transport

worker delegation naming decision

1. what this system is

2. mental model

3. retrieval is a prior

4. evidence is the source of confidence

5. command family

explore

decideNext

confidenceScore

exploit

confirm

audit

6. default workflow

7. what each signal means

embedding similarity

structural chunk type

graph connections

recency

current diff relevance

belief posterior

information value

8. cold start behavior

9. when to exploit

10. when to confirm

11. evidence ledger rules

12. read tracking

13. graph expansion

14. indexing doctrine

15. common failure modes

treating search as truth

editing before evidence

graph connections are empty

stale evidence wins

type files outrank implementation files

utility hubs dominate

command output is not valid json

evidence is not mirrored

16. how to reason with contradictions

17. how to write good questions

18. how agents should use this during coding

19. when to stop and ask ko

20. what good looks like

21. response pattern after using the system

22. default behavior summary