Skip to main content
  1. Operating stance Ship one focused change at a time. Keep main stable and boring. Preserve parallel-agent safety. Prefer typed OS tools over raw shell commands. Prefer existing project patterns over clever new abstractions. Treat uncertainty as a signal to gather evidence, not as permission to guess. Do not lose code. Push & promote to fully stream through the approved task workflow once validation is credible. Do not widen scope silently. Record scope pressure in the workpad. A good worker agent is not just fast. A good worker agent leaves behind a workspace better than they found it & a task that is easy to review, easy to trust, and easy to recover from.
  2. Tooling principles Use the workspace tool surface as the source of truth.
Preference order: Use direct typed workspace tools for single known operations and durable transitions. Use context.search, explore, decideNext, and confidenceScore for discovery and prior context. Use no-session code.run for multi-step read/investigation before a task exists. Use task-scoped code.run for semantic workflows that compose multiple typed tools inside a task. Use exact task-scoped file tools for file work: fs.read, fs.search, fs.list, fs.patch, fs.write, and fs.trash. Use batch for fixed independent checklists where later steps do not depend on earlier results. Use git.diff for structured diff inspection after edits. Use lifecycle tools directly: status, audit, review.run, verify, task.push, task.pr, task.merge, and task.finish. Use github for GitHub/PR state and raw GitHub escape hatches with reason; use current gh only as a temporary fallback. Use task.call only for focused command/runtime evidence: tests, builds, typechecks, package scripts, syntax checks, exact CLI reproduction, or commands with no typed workspace equivalent. Use local shell-style execution only when the workspace tool model explicitly requires it or no typed path exists. Examples of typed-tool preference: Use status instead of task.exec git status. Use github for GitHub state instead of ad hoc gh shell commands when possible. Use audit for workspace scripts/docs/index drift. Use check-files for JavaScript syntax checks. Use review.run for workspace review. Use verify for the publish safety gate. Use confirm for validation evidence when it adds proof. Use dev for service-backed local infrastructure and behavior validation. code.run is a semantic composer over the existing typed workspace tools. It is not a separate execution environment and not a guardrail bypass. Do not wrap a single obvious typed call in code.run. Use direct os.call for one known operation unless code.run is adding summarization, branching, or composition value. Prefer code.run when the work is workspace orchestration: inspect traces/logs/status and summarize the result search → read → decide read → patch/write → reread inspect manifest/config/docs and return a compact answer coordinate several typed workspace tools in one semantic pass avoid ad hoc python -c, node -e, heredocs, or long shell strings for file inspection/transformation when typed workspace tools can do it Use no-session code.run before a task exists for read/investigation workflows that compose non-task tools. Use task-scoped code.run after task.start when composing task-scoped tools. Pass taskSession on the outer workspace call; nested workspace helpers inherit task context. Do not drop task.call, but treat it as the last-mile command runner. Use task.call when the command itself is the evidence: run focused tests run package scripts run build/typecheck/lint commands run language/runtime syntax checks reproduce exact CLI behavior execute commands that have no typed workspace equivalent validate shell-level behavior such as exit codes, argv parsing, redirects, or process output task.call should usually be short, focused, and validation-oriented. If an agent is using task.call repeatedly for reading files, editing files, JSON inspection, workpad updates, or glue logic, switch to code.run plus typed workspace tools. Explore is a discovery command, not just decision-engine setup Use explore anywhere you would otherwise start guessing paths, grepping broadly, or asking “where is this implemented?” Treat it as a workspace navigation primitive alongside fs.read and fs.search: explore finds likely files, symbols, tests, docs, and related implementation paths. fs.read verifies actual content. fs.search follows up with exact targeted symbol/string lookup after direction is narrowed. decideNext decides what evidence/action should come next. confirm proves behavior against reality. Do not wait until a formal decision-engine loop to use explore. Use it early, especially before broad fs.search or raw shell search.
  1. Decision and evidence principles Use the decision engine to move from uncertainty to evidence-backed action.
The task workflow skill owns the exact loop. This skill owns the judgment standard: explore is retrieval, not proof. decideNext is the policy layer. confidenceScore measures evidence quality, not permission to skip tests. exploit means the evidence is concentrated enough to stop wandering and edit. confirm means belief meets reality. audit checks workspace surface truth: scripts, docs, and index freshness. Do not edit the first plausible file just because search found it. Read enough context to understand the local pattern and failure mode. Use fs.search only as targeted follow-up after the direction is clear. Confidence comes from: files actually read connected code paths inspected tests or runtime checks run validation output contradictions resolved behavior reproduced or smoked Confidence does not come from: one semantic search result memory vibes a syntax check alone an API response that does not cover callbacks, queues, locks, jobs, or side effects 4. Design principles Prefer clear boundaries. Backend business logic belongs in services, not thin API adapters. Frontend should call stable app contracts, not provider-specific or legacy runtime paths. External provider callbacks/webhooks can remain REST endpoints when the provider requires public URLs. Provider-specific objects should be adapted at the boundary. Internal services should consume typed domain objects. For platform code, separate: app contract domain service provider integration persistence lifecycle/callback handling UI state Avoid mixing these into one function or component.
  1. Workpad contract Maintain the scoped task workpad throughout execution:
.task/(area)/(task-slug)/workpad.md acceptance criteria implementation plan files changed key decisions notes for Ko improvements noticed errors or blockers validation commands and results The workpad is a running engineering log, not polished prose. Use it to record: why one implementation path was chosen over another what evidence supported the edit what validation proved what validation was skipped or narrowed and why any surprising failure out-of-scope issues noticed during the task Do not leave the workpad stale. A reviewer should be able to understand the task state without reconstructing your reasoning from chat history. Test-first engineering discipline Test-driven development is the default engineering posture for this skill. For any non-trivial code change, the agent must define the behavior contract before editing production code. Tests are not an after-the-fact safety net; they are the executable specification that guides the implementation. Before production edits, update the scoped workpad with a Test-first contract: behavior under test existing local pattern to follow new or changed tests focused red command expected red failure no-test waiver, only when a new or changed test is genuinely inappropriate Then write or update the focused test first. Run it before implementation and confirm the expected red failure. After that, implement the smallest correct change that makes the focused test pass. Rerun the same test and record green evidence. Only then proceed to broader validation. Do not weaken, delete, or rewrite the pretest after implementation unless the original contract was wrong. If the contract changes, record why in the workpad. Every task requires test decision coverage. Most behavior changes require test-first coverage. A no-test waiver is acceptable only for copy-only, docs-only, generated-file, trivial formatting, mechanical rename, or emergency repair tasks, and the waiver must say what validation replaces the missing test. Choose the test layer by risk Use the smallest test that proves the behavior, then add broader proof when the behavior crosses runtime boundaries. Change type / risk Default test layer What it proves Examples Extra validation Pure domain logic Unit test A small function returns the correct result for known inputs. Mapping no_answer → no-answer; parsing a phone number; validating an email; normalizing a disposition; formatting a duration. Usually no E2E needed. Run focused unit test plus typecheck/static checks. Business-rule branching Unit test or small service test The rule behaves correctly across important cases, including edge cases. “If outcome is other, require manual disposition”; “If auto-advance is disabled, show manual advance”; “If confidence is low, do not auto-commit.” Prefer table-driven tests with multiple cases. Data mapping / adapter boundary Unit test Provider-specific or legacy values are safely converted into internal domain values. Twilio status → internal call status; Pi outcome → canonical call disposition; API response DTO → frontend view model. Include unknown/unsupported values and failure cases. React presentational component Component test The user sees the right UI for given props/state. Modal shows contact name, duration, disposition; button label changes from “Pause Queue” to “Resume Queue”; checkbox says the opposite auto-advance action. Use Testing Library-style user-visible assertions. Avoid testing internal component state directly. React hook or UI state machine Hook test or component integration test State transitions happen correctly after user/system events. Call ended → analyzing; analysis complete → ready-auto-advance; cancel countdown → advance-cancelled; manual disposition selected → advance enabled. Use fake timers for countdowns. Test actions, not implementation details. Frontend flow across several components Component integration test Components are wired together correctly without running the whole app. Clicking “End Call” disconnects only current call; clicking “Stop Queue” ends the queue; wrap-up modal receives the correct call summary. Mock app contracts/hooks only where needed. Prefer existing test helpers. API contract API/controller test or service test The endpoint accepts input, calls the right service, and returns the expected response shape/status. POST /v1/calls/:id/disposition; GET /v1/calls/:id/analysis; queue advance endpoint returns next member. Assert status code, response body, and important side effects. Persistence behavior Service/integration test Database writes are correct and durable. Post-call analysis writes calls.analysis; deterministic outcome writes calls.outcome; list member disposition is updated. Use test DB/fixtures if available. Verify committed state, not just returned objects. Background job / queue / lock behavior Focused service/integration test first Job enqueueing, processing, locking, retries, and cleanup behave correctly. Dialer queue advance; release Redis lock after terminal state; retry failed post-call analysis; prevent double advance. Then run service-backed/dev proof because timing and runtime infra can fail outside unit tests. Webhook / callback behavior API/integration test first External provider callbacks are handled safely and idempotently. Twilio call status callback; media stream callback; transcription callback; Stripe webhook. Add local service-backed proof with tunnel/provider simulator when behavior depends on public callback reachability. Auth / permissions / tenant boundaries Integration or E2E test The right user/workspace can access or mutate only the allowed data. User cannot update another workspace’s call; queue member belongs to workspace; API rejects missing auth. Include negative tests. Do not rely only on happy path. External provider integration Adapter test with mocks/fixtures first The app sends the correct provider request and handles provider responses/errors. Twilio call disconnect; Groq transcription request; OpenAI-compatible client setup; CRM sync. Avoid real provider calls in normal tests. Use explicit dev/live proof only when needed and safe. AI-generated analysis / LLM behavior Deterministic fixture test The app handles the AI output contract safely. Pi outcome maps to canonical disposition; malformed analysis requires manual fallback; empty transcript does not auto-commit. Do not call real AI in ordinary tests. Use fixtures/mocks. Add separate manual/dev proof if needed. Full user journey E2E/browser test after lower-level tests The real app works from the user’s point of view. Start queue → call contact → end call → see wrap-up → advance next call; login → create record → verify UI update. E2E is expensive and brittle. Use it for critical flows, not every small function. Visual/layout-only UI change Component test, screenshot/design check, or no-test waiver depending on risk The visible layout or text behavior is protected if it matters. Buttons align in a 2×2 grid; modal copy changes; empty state layout. If pure copy/style with low risk, use no-test waiver plus visual/manual check. Docs-only / comments-only No-test waiver No runtime behavior changed. README update; steering text; comments that do not affect generated docs. Run docs/audit/spell/checkFiles only if relevant. Generated files Test the source/generator, not the generated output by hand Generated output is consistent with source schema or manifest. Regenerated TOOLS.md; regenerated type stubs; generated API client. Run generator and audit/check drift. Do not hand-edit generated files unless project pattern allows it. Mechanical rename / formatting No-test waiver or existing focused test if risky Behavior should be unchanged. Rename variable; move file with imports updated; Prettier-only change. Run typecheck/static checks. Add tests only if the move changes behavior or module boundaries. Emergency production repair Best available focused test or explicit waiver The immediate failure is fixed without making recovery worse. Hotfix broken deploy; unblock workspace tool; restore broken script. Record the waiver, exact risk, and follow-up test debt in the workpad. Rules: Prefer the lowest layer that proves the behavior. Add broader validation when the behavior crosses process, database, queue, provider, auth, or browser boundaries. A test is good only if it would fail when the intended behavior is broken. Do not test private implementation details when user-visible or contract-level behavior can be tested instead. Do not call real external providers, production services, or AI models in ordinary tests. Use deterministic fixtures, mocks, or explicit dev/integration proof. If no test is appropriate, record a no-test waiver before editing production code. A test is good only if it would fail when the intended behavior is broken. Prefer behavior assertions over implementation-detail assertions. Prefer existing test helpers and local project patterns. Avoid calling real external providers or AI models in ordinary tests; use deterministic fixtures, mocks, or explicit integration/dev proof. If the agent cannot identify an appropriate test, it must stop and explain the testing uncertainty instead of silently implementing first.
  1. Implementation principles Implement only what the acceptance criteria require.
Before editing: read the relevant standards and local patterns identify the smallest correct edit target understand nearby tests or validation paths write or update the plan in the workpad During editing: prefer small, reviewable changes preserve existing architecture unless there is evidence it is wrong avoid mixing cleanup, refactor, and behavior changes unless the task explicitly calls for it keep domain logic separate from side effects keep error paths explicit avoid broad rewrites when a focused fix is enough After editing: reread changed ranges inspect the diff with the best available typed/workspace tooling run the smallest meaningful validation first then run the broader review/verify gates update the workpad with actual results Do not hide uncertainty. If the validation does not prove the behavior, say so and continue or escalate.
  1. Coding standards TypeScript Strict mode is expected. No any. No as any. No untyped catches. Use catch (err: unknown) with type guards. Prefer named exports. Avoid top-level imports of optional peer/runtime dependencies; use lazy runtime import when appropriate. Avoid useEffect as an escape hatch for non-external synchronization. Use functional React components only. Logging and errors Do not use console.log, console.warn, or console.error in app/runtime code. Use the project’s structured logging pattern. Never swallow errors silently. Every async flow should have an intentional failure path. Reset cached client references on initialization failure when retry should be possible. Error messages should be clear enough to debug without exposing secrets. Data and security Validate and sanitize at boundaries. Never hardcode secrets. Never log raw tokens, credentials, full phone numbers, or sensitive payloads. Use parameterized queries. Do not interpolate user input into SQL. Security comes before formatting or convenience. Readability One function should have one clear responsibility. Use meaningful domain names; avoid vague abbreviations. Prefer guard clauses over deep nesting. Comments explain why, not what. Use short // comments for ordinary code. Avoid routine JSDoc blocks. Add ticket-tagged TODOs for intentional incompleteness. Consistency Reuse existing patterns unless clearly broken. If deviating, explain why in the workpad. Prefer consistency over cleverness. Do not introduce a new abstraction unless it removes real duplication or clarifies ownership.
  2. Validation principles Run validation that matches the risk of the change.
Validation ladder: Syntax or static checks for touched files. Focused unit or integration tests. Workspace review. Full verify gate when publishing. Service-backed E2E when behavior depends on runtime infrastructure. Runtime/provider/log validation when callbacks, queues, jobs, external APIs, auth, or permissions are involved. Use typed validation tools where available: check-files for JavaScript syntax checks. audit for workspace scripts/docs/index drift. review.run for review. verify for the task safety gate. confirm for validation evidence. dev for service-backed local validation. status and doctor for environment and workspace state. Use task.exec for validation only when no typed tool exists, such as: Python compilation focused package test command one-off script smoke targeted repo command not yet exposed by workspace tooling A syntax check alone is not sufficient for behavior changes. A successful HTTP response alone is not sufficient when the behavior depends on: callbacks queues locks background jobs external provider lifecycle database side effects Redis state auth/session state webhook reachability Record exact validation commands, tool calls, environment mode, logs, transcript paths, and results in the workpad.
  1. Service-backed and E2E behavior proof For product behavior changes, focused tests are not enough by default. Run the smallest meaningful end-to-end path that proves the changed behavior in a realistic development environment.
Use the dev workspace tooling when available. Use dev for: starting/stopping Postgres and Redis verifying pgvector or required extensions exporting Keychain-backed credentials into the server process starting app services with correct environment starting public tunnels for webhooks/callbacks checking public callback reachability running scenario modes inspecting Redis locks safely collecting transcripts and runtime logs Do not manually recreate Postgres/Redis/Twilio/tunnel setup if the dev tool exposes it. For behavior changes involving calls, jobs, webhooks, queues, locks, callbacks, external APIs, auth, or permissions, the validation ladder is: focused unit/integration tests local service-backed E2E through dev mock scenario where applicable live/test-provider scenario where applicable production/deployed log validation when the change affects production runtime For dialer-like flows, a valid proof usually includes: request entered the intended app contract backend service handled the request external provider was called only when intended callback/webhook returned to the app terminal state was recorded locks/resources were released logs/transcripts redact secrets and full phone numbers If dev setup fails, fix the setup problem when it is in scope. If it requires Ko input, stop and ask with the exact missing item. Do not silently fall back to a weaker test and claim behavior is proven. Do not publish behavior changes with only typecheck/lint unless Ko explicitly waives E2E validation.
  1. Review discipline Before publishing or reporting done, self-review the changed files.
Look for: accidental scope creep generated slop dead code stale comments untested behavior fragile timing or race assumptions poor error messages leaked secrets or sensitive logs inconsistent local patterns missing cleanup paths missing validation evidence A clean diff is part of the deliverable. Do not leave “while I was here” edits in the PR unless they directly support the task.
  1. Git and commit principles The task workflow skill owns the exact publish commands.
Engineering rules: Commit message format: type(scope): description. Keep commits scoped to the current task. Ko remains author. suelo-kiro[bot] is committer. Never force-push or overwrite remote task state unless Ko explicitly approves and the reason is documented. Push through the approved workspace task publish path. Do not claim work is shipped until the relevant PR state is verified. 12) Conflict and parallel-agent principles Assume parallel agents are active. Do not rely on shared mutable pointers when task-specific context exists. Metadata-only conflicts may be resolved by policy under the task workflow skill. Mixed conflicts require judgment. Stop when conflicts involve: code docs tests generated runtime files package manifests migrations configs product behavior files Do not casually pick ours/theirs for real code conflicts.
  1. Definition of engineering done A task is engineering-ready for review only when:
acceptance criteria are complete implementation matches the narrow task scope decision evidence is sufficient changed files were self-reviewed validation matches the actual risk behavior changes have behavior proof workpad reflects final decisions and validation publish state is verified by the task workflow no outcome-changing warnings remain in stderr/logs A task is not done just because: code compiles tests pass unrelated areas a command exits zero a PR exists the agent is confident the first implementation seems plausible Done means the change is reviewable, recoverable, and proven at the right level. Finish the task or name the real blocker Do not stop at the first tool failure when the user asked for a shippable change. Tool failures are work to diagnose, not completion states. For any requested code, docs, workflow, or repo change, the agent must continue until exactly one of these terminal states is true: The change is merged to the requested target branch and local state is updated when requested. The change is pushed to a review PR and the user explicitly asked to stop at review. A real blocker remains after recovery attempts, and the blocker is named with exact evidence. A timeout, validation error, safety-blocked call, stale metadata error, dirty worktree error, merge conflict, or failed push is not a terminal state by itself. Treat it as an incident to resolve. Required recovery loop: Read the structured error envelope. Identify whether the failure is input shape, timeout budget, task-session resolution, stale metadata, merge conflict, dirty worktree, safety filtering, missing dependency, or external service state. Retry once with the smallest corrected workspace call. If the same class of error repeats, switch to the next workspace-supported path. If fallback tooling is required, state why the workspace facade could not complete the operation and keep the fallback scoped to the task worktree. Continue toward ship/review after recovery. Before saying “done,” verify and report: target branch or PR commit SHA or merge SHA files changed validation run local state if the user requested local sync Temporary file recovery path Before saying “blocked,” check whether the failure is caused by prompt transport, tool argument serialization, wrapper filtering, command length, shell quoting, or model/tool call formatting rather than the actual underlying work being unsafe or impossible. When a direct tool call or long inline prompt is blocked, malformed, truncated, or repeatedly failing, prefer staging the intended instructions or payload in a temporary file, then pass only the file path to the next tool or command. Use this pattern for legitimate repo/task work when the goal is already approved and the failure is caused by transport/tooling friction. Required blocked report Before saying “blocked,” report: exact command/tool exact error taskSession and branch involved evidence that the failure is outside normal task recovery safest next action Temp file workflow Write the full prompt, patch plan, command script, or worker instructions to a temp file. Prefer /tmp for local-only handoff: cat > /tmp/(task-name)-instructions.md <<‘EOF’ (full instructions here) EOF Prefer the task worktree when another tool or worker requires repo-local paths: cat > .task/(area)/(task-name)/worker-instructions.md <<‘EOF’ (full instructions here) EOF Invoke the tool with a short path-only instruction. Example for Codex: codex exec —cd “$PWD” —sandbox workspace-write —json - ( /tmp/<task-name)-instructions.md Example worker prompt: Follow the instructions in .task/(area)/(task-name)/worker-instructions.md. Use the existing taskSession and branch. Do not start a new task. If the temp-file path works, continue the task normally and record the recovery in the workpad. If the temp-file path also fails, then report the blocked state using the required blocked report. Rules Do not use temp files to skip approval, hide intent, or avoid safety policy. Do use temp files to avoid malformed JSON, command-length issues, prompt truncation, shell quoting failures, or platform wrappers blocking large inline payloads. Keep temp files out of commits unless they are intentionally part of the task deliverable. Clean accidental temp artifacts before task.push. If a worker requires repo-local instructions, copy the temp file into .task/(area)/(task-name)/. If a direct Codex command is used, avoid unsupported flags. For the current Codex CLI, use: codex exec —cd “$PWD” —sandbox workspace-write —json - ( /tmp/<task-name)-instructions.md Do not include —ask-for-approval never unless the local Codex CLI help confirms that flag is supported. After a worker or direct Codex run, inspect stdout/stderr logs, task diff, and trace output before pushing. Mini worker recovery path Use the mini worker only as a narrow unblocker when the primary agent is blocked by model/tool transport, but the needed workspace tool is clearly available and the requested action is simple, bounded, and already approved. The mini worker is for one-off tool execution, not long-running reasoning, implementation planning, broad edits, or autonomous task completion. When to use mini Use mini when all of these are true: The normal path is blocked by an OpenAI/tool wrapper issue, prompt transport issue, or repeated malformed tool call. The exact workspace tool needed is known. The action is small and bounded. The result can be copied back into the main task flow. The action does not require new product judgment or broad code changes. Good mini uses: Read this file and return the contents. List this directory. Run this exact validation command. Call this exact workspace tool with this exact input. Fetch this specific trace row. Show the diff for this exact file. Bad mini uses: Implement this feature. Refactor this subsystem. Investigate the whole PR. Decide architecture. Fix all review comments. Run a long task autonomously. Mini worker pattern If a normal workspace call is blocked but the tool is available, ask mini to perform exactly one action. Example: Use os.call to run fs.read with this exact input: { “path”: “packages/os/scripts/install.ts” } Return only the file contents. Do not edit files. Do not start a task. Do not call unrelated tools. Example for a validation command: Use os.call to run task.call in the existing task session: taskSession: (taskSession) branch: (branch) command: bun —cwd packages/os ./scripts/install.ts —dry-run —yes —json Return the command, exit status, stdout, and stderr. Do not fix anything. Rules Prefer normal workspace tools first. Prefer temp-file recovery for long prompts or large payloads. Use mini only for small one-off actions. Do not give mini broad autonomy. Do not ask mini to make product decisions. Do not ask mini to edit files unless the exact patch is already known and small. Always pass the existing taskSession and branch when the action is task-scoped. Always ask mini to report the exact tool call, output, and any error. After mini returns, the main agent must inspect the result and continue the task. Blocked report before mini Before using mini, record: exact command/tool that failed exact error taskSession and branch involved why the action is safe and bounded exact mini instruction being sent Mini output requirement Mini should return: tool: input: status: stdout/result: stderr/error: notes: If mini cannot complete the exact action, it must stop and report the failure. It should not improvise a different workflow.