lloyal-agents API Reference

lloyal-agents

Continuous Context agent runtime for multi-agent inference.

Instead of N independent model calls rebuilding the prompt each step, all agents advance inside one continuous decode process. They fork from shared KV cache state, prefill tool results directly into the attention mechanism, and spawn sub-agents from their own live branches. Context is never serialized, summarized, or reconstructed.

Making Edge-AI practical

Deep Research: 3 agents analyzing DOJ v Apple complaint — plan, research with tool calls, verify, synthesize
Qwen3 4B + 0.6B reranker · 3 agents · 14 tool calls · 98s · fully offline on M2 MacBook Pro

What you get

Parallel agents, one GPU — N branches advance in a single forward pass
Recursive sub-agents — agents spawn agents from live state, not summaries
Shared KV prefix — tokenize once, every agent inherits it
Multi-hop tool use — results land fully before the next action
Tools that spawn agents — the model decides when to go deeper
Branch comparison — N attempts from one origin, measure agreement
Fully offline — no API key, no network

Install

npm i @lloyal-labs/lloyal-agents @lloyal-labs/lloyal.node

lloyal-agents provides the agent runtime. lloyal.node provides the native inference backend — prebuilt binaries for macOS, Linux, and Windows with CPU/GPU support. Both are required.

Use this if

You want local tool-calling agents
You need parallel or recursive task execution
You want shared-state efficiency instead of many isolated model calls
You care about inspectable execution and real runtime control

Don't use this if

You just need a chat wrapper
You only use hosted APIs
You do not need sub-agents, branching, or runtime-level control

Quickstart

import { main, call } from "effection";
import { createContext } from "@lloyal-labs/lloyal.node";
import { initAgents, generate } from "@lloyal-labs/lloyal-agents";

main(function* () {
  const ctx = yield* call(() =>
    createContext({
      modelPath: "model.gguf",
      nCtx: 16384,
      nSeqMax: 8,
      typeK: "q4_0",
      typeV: "q4_0",
    }),
  );

  const { session } = yield* initAgents(ctx);

  const result = yield* generate({
    parent: session.trunk,
    prompt: "In one sentence, explain KV cache sharing.",
  });

  console.log(result.text);
});

The basic mental model is simple:

create a backend context
initialize the runtime
generate from a branch

From there, you can fork branches, run agents in parallel, attach tools, and promote winning branches into the session trunk.

Why it's different

Most agent frameworks orchestrate around a model — prompt, read response, call a tool, prompt again. Each agent is a separate API call with its own context window.

lloyal-agents orchestrates inside the running inference process. Agents are branches of one live model runtime. They share KV cache state up to a fork point, advance together through batched decode steps, and consume tool results by prefilling tokens directly into their own branches.

When an agent calls a tool, the result is fully prefilled into its KV cache before it generates another token. The model sees the complete result and makes a clean decision — call another tool, refine the query, or report. This produces multi-hop reasoning: later tool calls reference discoveries from earlier ones, because the full chain is physically present in the branch's attention state.

When an agent needs to go deeper, it calls a tool that spawns sub-agents. The sub-agents fork from a shared root within the same inference process — same GPU, same KV cache, same event stream. The calling agent's branch stays alive; when the sub-agents report back, their findings return as a tool result into the caller's live context.

What ships

`@lloyal-labs/lloyal-agents`

The high-level runtime for recursive agents, tools, and orchestration.

Includes:

initAgents
generate
diverge
useAgentPool
runAgents
withSharedRoot
createToolkit
Tool
events and Effection contexts

npm i @lloyal-labs/lloyal-agents

`@lloyal-labs/sdk`

The lower-level branching inference primitives the agent runtime is built on.

Includes:

Branch
BranchStore
Session
Rerank

npm i @lloyal-labs/sdk

Public API surface

import {
  initAgents,
  generate,
  diverge,
  useAgentPool,
  runAgents,
  withSharedRoot,
  createToolkit,
  Ctx,
  Store,
  Events,
} from "@lloyal-labs/lloyal-agents";

That is essentially the framework.

Examples

The repo ships four examples demonstrating canonical agent patterns. All examples share corpus tools, resources, and a reranker via examples/shared/. Each defines its own WorkflowEvent = AgentEvent | StepEvent union — AgentEvent is the stable runtime contract, StepEvent is example-specific.

Deep Research (reference architecture)

examples/deep-research — Plan, Research, Synthesize, Evaluate, Promote. Grounded synthesis (1 tool-using agent) separated from entropy sampling (N cheap text-only diverge attempts). Demonstrates shared-root parallelism, grammar-constrained planning, recursive sub-agents via ResearchTool, agreement analysis, and session accumulation.

npx tsx examples/deep-research/main.ts \
  --corpus /path/to/docs \
  --query "How does the KV cache eviction policy work?"

ReAct Agent

examples/react-agent — Single agent with corpus tools answers a question. The simplest workflow, demonstrating withSharedRoot + useAgentPool with one agent.

npx tsx examples/react-agent/main.ts \
  --corpus /path/to/docs \
  --query "What is the main argument?"

Reflection

examples/reflection — Research, Draft, Critique, Revise. The critic forks from the draft's live branch; the reviser forks from the critic's branch. Demonstrates manual branch lifecycle, buildUserDelta for injecting follow-up turns, and diverge with parent branch for perplexity selection. No re-prompting — KV continuity across phases.

npx tsx examples/reflection/main.ts \
  --corpus /path/to/docs \
  --query "Explain the key findings"

Supervisor

examples/supervisor — Classify, Route to specialist agents, Execute in parallel, Synthesize. Demonstrates grammar-constrained routing via generate(), dynamic agent count from classifier output, heterogeneous useAgentPool tasks, and warm trunk synthesis for multi-turn follow-ups.

npx tsx examples/supervisor/main.ts \
  --corpus /path/to/docs \
  --query "Compare the two approaches described in the document"

All examples run in-process, on local weights, fully offline.

Shared-root parallelism

yield *
  withSharedRoot(
    { systemPrompt: RESEARCH_PROMPT, tools: toolsJson },
    function* (root) {
      return yield* runAgents({
        tasks: questions.map((q) => ({
          systemPrompt: RESEARCH_PROMPT,
          content: q,
          tools: toolsJson,
          parent: root,
        })),
        tools: toolMap,
      });
    },
  );

Every task forks from the same prefilled root. Everything before the fork is shared KV state. Everything after the fork is independent reasoning.

Recursive agents

Recursion happens at two levels:

Harness-level — the developer writes the pipeline. The deep-research example includes a reportPass: if a research agent gets cut off, a reporter sub-agent forks from its live branch with a narrower mandate.

const reporters =
  yield *
  runAgents({
    tasks: hardCut.map((a) => ({
      systemPrompt: REPORT_PROMPT,
      content: "Report your findings.",
      parent: a.branch, // continues from the agent's live KV state
    })),
    tools: new Map([["report", reportTool]]),
    terminalTool: "report",
  });

Model-level — the model decides when to recurse. A Tool subclass whose execute() returns an Operation can yield* into any framework primitive. The deep-research example includes a ResearchTool — when an agent calls research(questions), the tool spawns parallel sub-agents via withSharedRoot + useAgentPool, waits for their findings, and returns them as the tool result. The calling agent's branch stays alive; findings flow back into its live context.

In both cases, the sub-agent continues from live state, not from a summary pasted into a prompt.

Branch comparison

diverge() forks multiple branches from a shared frontier, generates independently, and returns the attempts plus the surviving best branch.

const result =
  yield *
  diverge({
    parent: root,
    attempts: 3,
    params: { temperature: 0.7 },
  });

Because those branches share a computational ancestor, agreement and disagreement between them are meaningful signals.

Session accumulation

When a branch wins, it can be promoted into the session trunk.

That means future work starts from accumulated branch state, not from an empty prompt. Over multiple queries, the session compounds what the system has already established.

Tools

Tools are class-based and expose OpenAI-compatible function schemas:

import type { Operation } from "effection";
import { Tool } from "@lloyal-labs/lloyal-agents";

class SearchTool extends Tool<{ query: string }> {
  readonly name = "search";
  readonly description = "Semantic search over the corpus";
  readonly parameters = {
    type: "object",
    properties: { query: { type: "string", description: "Search query" } },
    required: ["query"],
  };

  *execute(args: { query: string }): Operation<unknown> {
    return this.search(args.query);
  }
}

createToolkit(tools) turns a tool set into:

toolMap for runtime dispatch
toolsJson for prompt formatting

Events

The runtime emits structured events for TUI, logging, and telemetry:

Event	Payload
`agent:spawn`	`agentId`, `parentAgentId`
`agent:produce`	`agentId`, `text`, `tokenCount`, `entropy?`, `surprisal?`
`agent:tool_call`	`agentId`, `tool`, `args`
`agent:tool_result`	`agentId`, `tool`, `result`
`agent:tool_progress`	`agentId`, `tool`, `filled`, `total`
`agent:report`	`agentId`, `findings`
`agent:done`	`agentId`

API Reference

lloyal-ai.github.io/lloyal-agents — generated from source with TypeDoc.

Built on:

lloyal.node — forkable decode state + continuous tree batching over llama.cpp
liblloyal — C++20 inference kernel

Testing

Every pull request must pass:

Build
Typecheck
GPU integration tests against real models on NVIDIA L4 hardware

The GPU gate runs cross-repo: SDK PRs trigger lloyal.node's GPU workflow, which builds the PR packages against the native runtime and runs the full agent integration suite before merge.

Model matrix

GPU integration tests run against 6 architectures and chat template families:

Model	Params	Quant	Template
SmolLM2-1.7B-Instruct	1.7B	Q4_K_M	ChatML
Llama-3.2-1B-Instruct	1B	Q4_K_M	Llama 3
Phi-3.5-mini-instruct	3.8B	Q4_K_M	Phi 3
Qwen3-4B-Thinking	4B	Q4_K_M	ChatML
gemma-3-1b-it	1B	Q4_K_M	Gemma
GLM-Edge	—	Q4_K_M	GLM-Edge

Distribution matrix

The native backend ships prebuilt binaries for 13 platform/GPU combinations:

Platform	arm64	x64
macOS	Metal	CPU
Linux	CPU, CUDA, Vulkan	CPU, CUDA, Vulkan
Windows	CPU, Vulkan	CPU, CUDA, Vulkan

License

Apache-2.0