lloyal-agents API Reference
    Preparing search index...

    Class BranchStore

    High-throughput multi-branch decode operations

    The naive approach to N-branch generation is N sequential llama_decode() calls — each paying full GPU kernel launch overhead, memory barrier, and PCIe round-trip. BranchStore eliminates this by packing all branches into a single llama_batch and dispatching once: O(1) GPU round-trips regardless of branch count. The GPU parallelizes across sequences within the batch, so N branches approach the wall-time cost of 1.

    Two operations, two packing strategies:

    commit() — Generation step. Each branch contributes exactly 1 token. Packs N tokens into a single batch via decode_each (one row per sequence, all at their respective positions). Single llama_decode() call. Logits captured per-branch at batch index i. O(N) total work, O(1) GPU dispatches, O(1) amortized dispatch overhead per branch. Accept-first ordering with rollback: accepts each token into its branch's repeat-penalty window before decode, restores from clones if decode throws.

    prefill() — Bulk token injection. Each branch contributes a variable-length token array. Uses a two-pass bin-packing algorithm:

    • Pass 1 (planning): Greedy first-fit packs items into chunks ≤ nBatch. Items larger than nBatch get a dedicated chunk and fall through to decode_many's internal auto-chunking (ceil(nTokens / nBatch) calls).
    • Pass 2 (dispatch): Normal chunks dispatch via decode_scatter (one llama_decode per chunk). Logits are indexed by flattened cursor position: for item k in a chunk, logits live at cursor + nTokens[k] - 1.

    For T total tokens across N branches with batch capacity B:

    • Best case (T ≤ B): 1 GPU dispatch, all branches in one batch.
    • Worst case: ceil(T / B) dispatches. Each dispatch is fully packed.
    • Amortized per-token GPU overhead: O(1/B) — vanishes as batch fills.

    Does NOT accept tokens into the sampler penalty window — use for external/replayed tokens where repeat-penalty tracking is unwanted. For model-generated tokens, use commit instead.

    Both methods take [branch, token(s)] tuples — the branch-to-token binding is structural, not positional. After either call, each branch's logits snapshot is updated with the output distribution from its decoded token(s), ready for the next produce()/sample() call.

    const store = new BranchStore(ctx);
    const entries = await Promise.all(branches.map(async b => [b, (await b.produce()).token] as [Branch, number]));
    await store.commit(entries); // 32 tokens, 1 llama_decode()
    const store = new BranchStore(ctx);
    const branches = [];
    for (const _ of [1, 2, 3]) branches.push(await root.fork());

    for (let step = 0; step < 50; step++) {
    const produced = await Promise.all(branches.map(async b => [b, await b.produce()] as const));
    const live = produced.filter(([, p]) => !p.isStop);
    if (!live.length) break;
    await store.commit(live.map(([b, p]) => [b, p.token]));
    }
    await store.prefill([
    [branchA, systemPromptTokens], // 200 tokens
    [branchB, shortQueryTokens], // 12 tokens
    [branchC, longDocumentTokens], // 800 tokens
    ]);
    // Bin-packed into ceil(1012 / nBatch) GPU dispatches
    Index

    Constructors

    Accessors

    Methods

    Constructors

    Accessors

    Methods

    • Batched single-token commit for model-generated tokens

      Each tuple [branch, token] binds one token to one branch. Accepts each token into its branch's repeat-penalty window (for correct PPL measurement), then decodes all N tokens in a single llama_decode() call via decode_each and captures logits per-branch. Accept-first ordering with rollback: if decode throws, sampler/grammar/metrics are restored from clones taken before the accept.

      Parameters

      • entries: [Branch, number][]

        Array of [branch, token] tuples (branches must not be disposed)

      Returns Promise<void>

      If any branch is disposed

    • Batched variable-length prefill for external tokens

      Each tuple [branch, tokens] binds a token array to one branch. Each branch can receive a different number of tokens — decode_scatter handles variable-length runs and auto-chunks to fit nBatch.

      Does NOT call accept_token — use for external/replayed tokens where repeat-penalty tracking is unwanted. For model-generated tokens, use commit instead.

      Parameters

      • entries: [Branch, number[]][]

        Array of [branch, tokens] tuples (branches must not be disposed)

      Returns Promise<void>

      If any branch is disposed

    • Retain only the winner branch — evict all other leases and free their slots.

      Nuclear operation: calls kv::seq_keep on the winner's seq_id (stripping all other sequences from KV cache in a single pass), then frees all loser slots and rebuilds the vacancy list. The winner's topology is reset (no parent, no children).

      Parameters

      • winner: Branch

        The branch to keep (must not be disposed, must hold a lease)

      Returns Promise<void>

      If winner is disposed or has no lease