Child branch handles
Whether this branch has been disposed
Position at which this branch was forked from its parent (0 for root branches)
Internal handle (for debugging)
True if this branch holds a KV lease
True if this branch has no children
Parent branch handle, or null if root
Branch's perplexity (exp of mean surprisal)
Branch's current position (number of tokens decoded)
Sampling-level perplexity (from filtered distribution)
Returns perplexity from the distribution actually sampled from (after top-k/p/temp/penalties). Useful for policy priors and monitoring sampler chain impact.
Compare with perplexity which is model-level (raw logits).
Async iterator — generate tokens until EOG
Commit-before-yield semantics: every yielded token is already written to KV and accepted into the sampler. Breaking out of the loop is clean — no orphaned uncommitted tokens, perplexity reflects all yielded tokens.
For inspect-before-commit (speculative decoding, tree search), use the produce/commit protocol directly.
Record token in the sampler's repeat/presence penalty window
Token to accept
Clear all static logit biases from this branch
Clear all steer biases from this branch
Removes any dynamic logit adjustments set by steer(). Call this after
each generation step if your steer constraints are computed per-step
(e.g., N-gram blocking where the blocked set changes as text grows).
for (let i = 0; i < maxTokens; i++) {
// Compute constraints based on current state
const blocked = computeConstraints(generatedTokens);
branch.steer(blocked.map(t => ({ token: t, bias: -Infinity })));
const { token, isStop } = await branch.produce();
if (isStop) break;
await branch.commit(token);
branch.clearSteer(); // Reset for next iteration
generatedTokens.push(token);
}
Accept and decode — update branch state, then write token to KV
Accepts the token into the sampler penalty window (for correct PPL measurement), then decodes (writing to KV cache via AsyncWorker on the libuv thread pool) and captures the resulting logits for the next produce() call. Accept-first ordering with rollback: if decode throws, sampler/grammar/metrics are restored from clones.
Token to commit (from produce())
Fork this branch to a new sequence (sync)
The child shares the parent's KV prefix in memory (metadata-only under unified KV, no KV buffer copy). Logits, sampler state, and perplexity tracker are cloned so the child can diverge independently. Fork from any branch — root or intermediate — to build arbitrarily deep trees.
Call reseedSampler() on each child for stochastic diversity.
New forked Branch
Get a copy of this branch's captured logits snapshot.
Returns n_vocab floats — the raw logit distribution from the last prefill() or commit() call.
Returns an independent copy of the branch's internal snapshot. The returned Float32Array is safe to hold across async boundaries and is not affected by subsequent decode operations.
Independent copy of the logits snapshot (n_vocab elements)
Compute entropy of the branch's logits distribution
Measures model uncertainty from the branch's captured logits snapshot:
Operates directly on state->logits_snapshot — no JS round-trip.
Logarithm base: "nats" (default) or "bits"
Entropy value in specified base
COST: O(n_vocab) - must sum over all token probabilities
Compute surprisal (negative log-likelihood) for a specific token
Measures how "surprising" the model finds the given token from the branch's captured logits snapshot:
Operates directly on state->logits_snapshot — no JS round-trip.
Token ID to compute surprisal for
Logarithm base: "nats" (default) or "bits"
Surprisal value in specified base
COST: O(n_vocab) - softmax normalization required
Bulk-decode tokens into the branch's KV cache and capture logits.
tokens.length is the total count to process; the branch's nBatch
(set at Branch.create) controls how many are sent per llama_decode
call. E.g. 500 tokens with nBatch=64 → 8 calls (7×64 + 1×52).
Advances position by tokens.length. Stores final logits into the
branch's internal snapshot — the next produce()/sample() reads
from it.
Does NOT accept tokens into the repeat-penalty window — for external
tokens (user input between turns), not model-generated tokens.
For model output, use commit() which does accept + decode.
The primary way to feed tokens into a branch's KV cache.
Token IDs to decode
Sample next token without advancing state (async)
Async contract: local branches resolve immediately; cloud branches may perform an HTTP round-trip. Use produceSync when you know the branch is local and want zero-overhead sampling.
Discard this branch (async)
Async contract: local branches resolve immediately; cloud branches may perform an HTTP round-trip. Use pruneSync when you know the branch is local.
RESTRICT mode: throws if children exist. Use pruneSubtree to cascade-delete an entire subtree.
Discard this branch and all its descendants (async)
Async contract: local branches resolve immediately; cloud branches may perform an HTTP round-trip. Use pruneSubtreeSync when you know the branch is local.
Discard this branch and all its descendants — CASCADE delete (sync)
Iterative post-order traversal: prunes children first, then this branch. Use when tearing down an entire subtree (e.g. abandoned search path). Sets disposed synchronously.
Discard this branch — remove its divergent KV entries and free the handle (sync)
Only removes KV entries divergent from the shared prefix; sibling branches are unaffected. The disposed flag is set synchronously — any call to produce(), commit(), etc. after prune() will throw immediately.
RESTRICT mode: throws if children exist. Use pruneSubtreeSync to cascade-delete an entire subtree.
Reseed the sampler's PRNG for diversity after fork()
CRITICAL for parallel generation: Without reseeding, all forked branches produce identical outputs because they share the same PRNG state.
Only affects stochastic samplers (temperature > 0). Greedy samplers are unchanged.
New seed for the PRNG
Sample next token from branch's logits snapshot
Applies the branch's full sampler chain (top-k, top-p, temperature, repeat/presence penalties) to the captured logits.
Sampled token ID
Replace or remove the grammar constraint
Pass a GBNF grammar string to constrain generation. Pass empty string or undefined to remove the constraint. The grammar state is cloned on fork(), so sibling branches can diverge independently after hot-swap.
OptionalgrammarStr: stringGBNF grammar string, or empty/undefined to remove
Set lazy grammar — unconstrained until trigger, then grammar-constrained
Generation runs freely until a trigger pattern or token fires, at which
point the grammar activates and constrains subsequent tokens. Used for
tool-call generation: model writes freely until <tool_call>, then
grammar forces valid XML structure.
The grammar state is cloned on fork(), so sibling branches can diverge independently. Call again after a tool result prefill to reset.
GBNF grammar string
Trigger conditions from formatChat().grammarTriggers
Set static logit biases on this branch
Unlike steer (which is NOT inherited on fork), logit biases ARE cloned when forking. Use for persistent constraints that should propagate to child branches.
Applied during sample() in order: Grammar -> Logit Bias -> Steer -> Sampler Chain
Array of token adjustments. Use -Infinity to block,
positive to boost, negative to reduce.
Replace the sampler chain with new parameters (memoized)
If the new params match the current chain's params, this is a no-op. Otherwise the old chain is freed and a new one is created. Use for Entropy-Driven Temperature (EDT) and other adaptive sampling strategies that adjust parameters per-step.
New sampling parameters
Apply dynamic logit adjustments for this branch only
Unlike logit_bias in sampling params (which is cloned on fork), steer biases
are NOT inherited by child branches. Each branch manages its own steer state
independently. This makes steer ideal for path-dependent constraints.
Use cases:
Sampling order: Grammar → Logit Bias → Steer → Sampler Chain
Array of token adjustments. Use -Infinity to completely
block a token, positive values to boost probability, negative to reduce.
// Compute which tokens would create repeated 4-grams
const blocked = computeNgramBlocks(generatedTokens, n=4);
// Block those tokens for this sample only
branch.steer(blocked.map(t => ({ token: t, bias: -Infinity })));
const { token } = await branch.produce(); // Blocked tokens won't be sampled
await branch.commit(token);
// Clear for next iteration (recompute based on new history)
branch.clearSteer();
// Each beam penalizes tokens chosen by siblings this step
for (const beam of beams) {
// Collect tokens chosen by other beams
const siblingTokens = beams
.filter(b => b !== beam && b.lastToken !== undefined)
.map(b => b.lastToken);
// Penalize sibling choices to encourage diversity
beam.branch.steer(siblingTokens.map(t => ({ token: t, bias: -2.0 })));
const { token } = await beam.branch.produce();
await beam.branch.commit(token);
beam.lastToken = token;
beam.branch.clearSteer();
}
StaticcreateCreate a root branch at the given position
The branch takes ownership of the sequence and creates its own sampler chain from the provided params. Call prefill() to decode prompt tokens and capture the logit distribution before forking.
SessionContext to create branch on
Starting position (typically prompt token count)
Optionalparams: SamplingParamsSampling parameters (temperature, topP, etc.)
OptionalnBatch: numberPer-branch batch size override (defaults to context nBatch). Controls chunk size for prefill(). Has no effect on single-token commit() which uses a zero-allocation fast path.
Optionalgrammar: stringGBNF grammar string for constrained generation. When provided, sample() returns only grammar-valid tokens. The grammar state is cloned on fork(), so sibling branches can diverge independently.
New Branch instance
Forkable inference handle for covalent generation
A Branch owns everything needed for independent generation: a KV cache sequence, sampler chain, logits snapshot, and perplexity tracker.
Forking is cheap — the KV prefix is shared in memory (metadata-only operation under unified KV — no KV tensor buffers are copied), so sibling branches read from the same physical KV entries. Only tokens decoded after the fork point are exclusive to each branch.
Branches form trees, not just flat lists. Fork from root for best-of-N, fork from children for tree search/beam search, fork from a draft for speculative decoding.
The produce/commit protocol separates sampling from state advancement: produce() samples without writing to KV, letting you inspect the result before deciding to commit().
Example: Best-of-N with perplexity selection