Execution engine
The execution engine is the gateway subsystem responsible for turning a plan or workflow into resilient, auditable execution. It is where reliability guarantees live: retries, idempotency, budgets/timeouts, pause/resume, and evidence capture.
Why it exists
LLMs are good at planning, but they are a poor place to host the control plane for long-running, side-effecting work. The execution engine moves orchestration into a typed runtime so that:
- Side effects can be paused behind approvals and resumed safely without repeating completed work.
- Runs can be retried deterministically without duplicating actions.
- “Done” is backed by postconditions + artifacts, not narrative.
- Operator UIs can observe progress in real time via events.
Responsibilities
- Queueing and scheduling: accept work from interactive sessions, cron jobs, hooks, and external triggers.
- Run state machine: track run lifecycle (
queued → running → paused|succeeded|failed|cancelled) with durable persistence. - Step execution: execute steps via the tool runtime and capability providers (nodes, MCP).
- Idempotency + safe retries: enforce
idempotency_keysemantics for side-effecting steps and define retry policies. - Approvals and pause/resume: pause runs when an approval is required and resume using a durable resume token.
- Budgets and timeouts: enforce cost/time ceilings per run and per step (including model budgets where applicable).
- Concurrency limits: limit parallelism per agent, per lane, per capability provider, and globally.
- Evidence and verification: capture artifacts and validate postconditions (required for state-changing steps when feasible).
- Rollback metadata: store human-readable rollback hints and optional structured compensation actions (always approval-gated).
- Auditability: emit events for run/step lifecycle and persist a run log suitable for troubleshooting and export.
Distributed execution (workers)
The execution engine can run co-located with the gateway edge (even in the same OS process) or be split into separate processes/hosts. To minimize surprises when scaling up, the same execution semantics apply in all deployments: workers claim/lease work in the StateStore and publish lifecycle events through the backplane abstraction (see Scaling and High Availability).
Cluster-safe execution typically requires:
- Claim/lease: workers claim work with a time-bounded lease recorded in the StateStore so only one worker executes a given attempt at a time.
- Idempotency: side-effecting steps define
idempotency_keysemantics so retries are safe under at-least-once execution. - Lane serialization: workers acquire a distributed lock/lease keyed by
(session_key, lane)before executing steps that must be serialized. - Durable outcomes: attempt results, artifacts, and postcondition evaluations are persisted before emitting “completed” events.
Claimable work items carry explicit lease fields (for example lease_owner and lease_expires_at). Claims are atomic updates, leases are renewed periodically, and takeover occurs safely on expiry.
Lane serialization uses explicit lane lease rows keyed by (session_key, lane) with the same expiry/renew/takeover behavior as work leases.
Idempotency is durable dedupe with cached outcomes: when an executor observes a duplicate (scope, kind, idempotency_key), it returns the stored outcome instead of repeating the side effect.
Retry policy is per-step with conservative defaults. Automatic retries apply only when idempotency semantics are enforced for the step.
Workspace-backed execution (ToolRunner)
Many Tyrum steps are filesystem- or process-oriented (for example running a CLI tool in a workspace, reading/writing files, generating evidence artifacts). To keep TYRUM_HOME durable across runs while still scaling to multi-node clusters, Tyrum treats workspace access as an explicit execution boundary:
- ToolRunner is the execution context that mounts the workspace filesystem and runs side-effecting tools.
- Workers coordinate work in the StateStore (claims/leases, idempotency, lane serialization) and delegate step execution to ToolRunner.
ToolRunner has deployment-parity implementations:
- Single-host/desktop: ToolRunner is a local subprocess (or in-process) operating on the local persistent
TYRUM_HOME. - Cluster/Kubernetes: ToolRunner is a sandboxed job/pod that mounts the workspace PVC (RWO) and writes outcomes back to the StateStore.
This keeps execution semantics identical while ensuring that long-lived edge/scheduler replicas do not need to mount shared workspace volumes.
Non-responsibilities
- The execution engine does not decide what to do from a user message (planning is in the agent/planner).
- The execution engine does not implement device-specific automation (that lives behind node capabilities).
- The execution engine does not store raw secrets (that lives behind the secret provider).
Core concepts
Job vs run
- Job: the queued unit of work (created by a session request, cron, or hook).
- Run: an execution attempt of a job. A job can create multiple runs due to retries or operator-requested replays.
Step and attempt
- Step: one atomic action in a workflow (for example “HTTP request”, “click button”, “send message”).
- Attempt: one execution attempt of a step (attempt count increments on retry).
Pause/resume
When a run reaches a step that requires approval (or takeover), the engine:
- Persists the run in a paused state.
- Creates an approval request record.
- Returns/emits a resume token that references the paused state.
- Resumes only after the approval is resolved (approved/denied/expired).
Resume tokens are opaque identifiers (random ids) that map to paused-state rows in the StateStore. Tokens support expiry and revocation.
Evidence + postconditions (hard rule)
For state-changing steps, a postcondition should be defined whenever a verification check is feasible. The engine is responsible for executing and evaluating the postcondition and storing evidence artifacts.
If a step cannot be verified automatically, the engine must:
- Mark the outcome as unverifiable (not “done”), and
- Escalate to the operator (approval/takeover) before proceeding with further dependent side effects.
Unverifiable outcomes are represented as a pause with stored reports describing missing evidence; they are not separate terminal statuses.
Postconditions are typed assertion kinds (not arbitrary expression evaluation). The core set stays small and explicit; extensions are registered via plugins/connectors and validated by contracts.
Topology
Data model
jobs(id, created_at, trigger_type, trigger_key, agent_id, lane, status, input, ...)runs(id, job_id, started_at, finished_at, status, attempt, budgets, ...)run_steps(id, run_id, index, kind, args, idempotency_key, approval_id?, postcondition, ...)run_step_attempts(id, run_step_id, attempt, started_at, finished_at, status, result, error, artifacts[])
Exact schemas belong in @tyrum/schemas and exported contracts.
Observability and cost
- Structured logs include stable identifiers:
request_id,event_id,job_id,run_id,step_id,attempt_id, andapproval_id. - Cost attribution (model tokens, executor time) is persisted per run/step/attempt so budgets and approvals can be evaluated and UIs can aggregate accurately.
- Deployments export tracing and metrics via OpenTelemetry.
Client/UI expectations
Operator clients should be able to:
- See run progress as a timeline (queued/running/paused/completed).
- Inspect per-step evidence (artifacts) and postcondition results.
- Resolve approvals and resume/cancel paused runs.
- Request safe retries or rollbacks when supported.