Batched single-token commit for model-generated tokens
Each tuple [branch, token] binds one token to one branch.
Accepts each token into its branch's repeat-penalty window (for correct
PPL measurement), then decodes all N tokens in a single llama_decode()
call via decode_each and captures logits per-branch. Accept-first
ordering with rollback: if decode throws, sampler/grammar/metrics are
restored from clones taken before the accept.
Array of [branch, token] tuples (branches must not be disposed)
Batched variable-length prefill for external tokens
Each tuple [branch, tokens] binds a token array to one branch.
Each branch can receive a different number of tokens — decode_scatter
handles variable-length runs and auto-chunks to fit nBatch.
Does NOT call accept_token — use for external/replayed tokens where repeat-penalty tracking is unwanted. For model-generated tokens, use commit instead.
Array of [branch, tokens] tuples (branches must not be disposed)
Retain only the winner branch — evict all other leases and free their slots.
Nuclear operation: calls kv::seq_keep on the winner's seq_id (stripping all
other sequences from KV cache in a single pass), then frees all loser slots
and rebuilds the vacancy list. The winner's topology is reset (no parent, no children).
The branch to keep (must not be disposed, must hold a lease)
High-throughput multi-branch decode operations
The naive approach to N-branch generation is N sequential llama_decode() calls — each paying full GPU kernel launch overhead, memory barrier, and PCIe round-trip. BranchStore eliminates this by packing all branches into a single llama_batch and dispatching once: O(1) GPU round-trips regardless of branch count. The GPU parallelizes across sequences within the batch, so N branches approach the wall-time cost of 1.
Two operations, two packing strategies:
commit() — Generation step. Each branch contributes exactly 1 token. Packs N tokens into a single batch via
decode_each(one row per sequence, all at their respective positions). Singlellama_decode()call. Logits captured per-branch at batch indexi. O(N) total work, O(1) GPU dispatches, O(1) amortized dispatch overhead per branch. Accept-first ordering with rollback: accepts each token into its branch's repeat-penalty window before decode, restores from clones if decode throws.prefill() — Bulk token injection. Each branch contributes a variable-length token array. Uses a two-pass bin-packing algorithm:
decode_scatter(onellama_decodeper chunk). Logits are indexed by flattened cursor position: for item k in a chunk, logits live atcursor + nTokens[k] - 1.For T total tokens across N branches with batch capacity B:
Does NOT accept tokens into the sampler penalty window — use for external/replayed tokens where repeat-penalty tracking is unwanted. For model-generated tokens, use commit instead.
Both methods take
[branch, token(s)]tuples — the branch-to-token binding is structural, not positional. After either call, each branch's logits snapshot is updated with the output distribution from its decoded token(s), ready for the nextproduce()/sample()call.Example: 32-branch generation step — one GPU dispatch
Example: Best-of-N with batched commit
Example: Asymmetric prefill — variable-length injections, auto-chunked