|
liblloyal 1.0.0
Branched Inference for llama.cpp
|
Handle table and batched decode orchestrator for branch management. More...
#include <lloyal/branch.hpp>
Classes | |
| struct | Allocation |
| Result of allocate(): a slot handle + its leased seq_id. More... | |
Public Member Functions | |
| BranchStore (size_t initial_capacity=16) | |
| Construct a branch store with initial slot capacity. | |
| ~BranchStore () | |
| Destructor — frees CPU resources. | |
| Allocation | allocate () |
| Allocate a branch slot + KV lease atomically. | |
| void | release (BranchHandle handle) |
| Release a branch slot + evict its KV lease. | |
| void | init_tenancy (llama_context *ctx) |
| Initialize KV tenancy after context creation. | |
| void | drain () |
| Explicit teardown — evict all leases while context is alive. | |
| void | retainOnly (BranchHandle winner) |
| Keep only the winner — nuclear KV + CPU cleanup. | |
| size_t | available () const |
| Number of vacant seq_ids available for acquisition. | |
| KvPressure | kv_pressure () const |
| KV cache pressure snapshot — O(1), no tree walking. | |
| void | add_cells_used (uint32_t n) |
| Increment cells_used counter (for standalone prefill/step outside BranchStore methods) | |
| BranchHandle | parent (BranchHandle h) const |
| Get a branch's parent handle. | |
| llama_pos | fork_head (BranchHandle h) const |
| Get a branch's fork head (parent position at fork time) | |
| const std::vector< BranchHandle > & | children (BranchHandle h) const |
| Get a branch's child handles. | |
| bool | isLeaf (BranchHandle h) const |
| Test whether a branch is a leaf (no children) | |
| bool | isActive (BranchHandle h) const |
| Test whether a branch holds a KV lease. | |
| BranchState * | get (BranchHandle handle) |
| Look up branch state by handle. | |
| const BranchState * | get (BranchHandle handle) const |
| Look up branch state by handle. | |
| template<SamplingParamsLike P> | |
| SamplerChainHandle | create_sampler (const P ¶ms) |
| Create a sampler chain and register it. | |
| SamplerChainHandle | clone_sampler (SamplerChainHandle h) |
| Clone a sampler chain (for fork) | |
| void | free_sampler (SamplerChainHandle h) |
| Free a sampler chain. | |
| llama_sampler * | get_sampler_chain (SamplerChainHandle h) const |
| Dereference a sampler chain handle (non-owning) | |
| bool | sampler_has_dist (SamplerChainHandle h) const |
| Check if a sampler chain ends with dist (stochastic) or greedy. | |
| GrammarHandle | create_grammar (const llama_model *model, const char *grammar_str, const char *root="root") |
| Create a grammar sampler and register it. | |
| GrammarHandle | create_grammar_lazy (const llama_model *model, const char *grammar_str, const std::vector< std::string > &trigger_patterns, const std::vector< llama_token > &trigger_tokens, const char *root="root") |
| Create a lazy grammar (unconstrained until trigger fires) | |
| GrammarHandle | clone_grammar (GrammarHandle h) |
| Clone a grammar (for fork) | |
| void | free_grammar (GrammarHandle h) |
| Free a grammar. | |
| llama_sampler * | get_grammar_sampler (GrammarHandle h) const |
| Dereference a grammar handle (non-owning) | |
| MetricsHandle | create_metrics () |
| Create a metrics tracker and register it. | |
| MetricsHandle | clone_metrics (MetricsHandle h) |
| Clone a metrics tracker (for fork) | |
| void | free_metrics (MetricsHandle h) |
| Free a metrics tracker. | |
| void | add_model_surprisal (MetricsHandle h, float surprisal) |
| Add model-level surprisal to a metrics tracker. | |
| void | add_sampling_surprisal (MetricsHandle h, float surprisal) |
| Add sampling-level surprisal to a metrics tracker. | |
| float | get_model_ppl (MetricsHandle h) const |
| Get model-level perplexity from a metrics tracker. | |
| float | get_sampling_ppl (MetricsHandle h) const |
| Get sampling-level perplexity from a metrics tracker. | |
| void | decode_each (std::span< const DecodeEachItem > items) |
| Decode one token per branch in a single GPU dispatch. | |
| void | decode_scatter (std::span< const DecodeScatterItem > items) |
| Decode variable token counts per branch with auto-chunking. | |
Handle table and batched decode orchestrator for branch management.
Provides two concerns:
Slot management — A pool of BranchState slots addressed by opaque handles with generation counters for ABA prevention. Slot 0 is permanently reserved (handle 0 = INVALID_HANDLE). Auto-grows by doubling up to 65535 slots. Methods: allocate(), release(), get().
Batched decode — Orchestrates multi-branch GPU dispatches that amortize llama_decode() overhead across N branches. Each method validates handles, builds the appropriate decode primitive's input, dispatches, captures logits into per-branch snapshots, and advances positions atomically. Methods: decode_each(), decode_scatter().
Batched decode methods vs free-function decode:
| Method | Tokens/branch | Chunking | Logit capture |
|---|---|---|---|
| decode_each() | 1 | No (1 call) | Per-branch |
| decode_scatter() | Variable | Auto | Per-branch |
| branch::step() | 1 | No | Single branch |
llama_n_seq_max(ctx) (typically 256). allocate() acquires both a slot and a lease atomically; release()/drain() return both resources symmetrically.Definition at line 392 of file branch.hpp.
|
inlineexplicit |
Construct a branch store with initial slot capacity.
| initial_capacity | Number of slots to pre-allocate (minimum 2) |
Definition at line 398 of file branch.hpp.
|
inline |
Destructor — frees CPU resources.
drain() must be called first while the llama_context is still alive.
Definition at line 417 of file branch.hpp.
|
inline |
Increment cells_used counter (for standalone prefill/step outside BranchStore methods)
Definition at line 578 of file branch.hpp.
|
inline |
Add model-level surprisal to a metrics tracker.
| h | Metrics handle |
| surprisal | Surprisal in nats |
Definition at line 846 of file branch.hpp.
|
inline |
Add sampling-level surprisal to a metrics tracker.
| h | Metrics handle |
| surprisal | Surprisal in nats |
Definition at line 860 of file branch.hpp.
|
inline |
Allocate a branch slot + KV lease atomically.
Acquires a seq_id from tenancy, then a slot from the freelist. If either fails, both are rolled back cleanly.
Definition at line 439 of file branch.hpp.
|
inline |
Number of vacant seq_ids available for acquisition.
Definition at line 561 of file branch.hpp.
|
inline |
Get a branch's child handles.
| h | Branch handle |
Definition at line 605 of file branch.hpp.
|
inline |
Clone a grammar (for fork)
| h | Source grammar handle |
Definition at line 777 of file branch.hpp.
|
inline |
Clone a metrics tracker (for fork)
| h | Source metrics handle |
Definition at line 824 of file branch.hpp.
|
inline |
Clone a sampler chain (for fork)
| h | Source sampler chain handle |
Definition at line 687 of file branch.hpp.
|
inline |
Create a grammar sampler and register it.
| model | Llama model (for vocab) |
| grammar_str | GBNF grammar string |
| root | Root rule name (default "root") |
Definition at line 738 of file branch.hpp.
|
inline |
Create a lazy grammar (unconstrained until trigger fires)
| model | Llama model (for vocab) |
| grammar_str | GBNF grammar string |
| trigger_patterns | Regex patterns that activate the grammar |
| trigger_tokens | Token IDs that activate the grammar |
| root | Root rule name (default "root") |
Definition at line 757 of file branch.hpp.
|
inline |
Create a metrics tracker and register it.
Definition at line 813 of file branch.hpp.
|
inline |
Create a sampler chain and register it.
| params | Sampling parameters (any SamplingParamsLike type) |
Definition at line 672 of file branch.hpp.
|
inline |
Decode one token per branch in a single GPU dispatch.
Packs N tokens (one per branch) into a single llama_batch and calls decode::each(), amortizing GPU dispatch overhead across all branches. After decode, captures logits from the batch into each branch's logits_snapshot and advances each branch's position by 1.
llama_get_logits_ith(ctx, i). This is a 1:1 mapping because decode::each places exactly one token per batch slot.| items | Span of {handle, token} pairs (all handles must be valid) |
| std::runtime_error | if any handle is invalid, contexts don't match, or decode fails |
Definition at line 921 of file branch.hpp.
|
inline |
Decode variable token counts per branch with auto-chunking.
Two-pass algorithm:
Pass 1 — Build chunks: Greedily bin-packs items into chunks up to llama_n_batch(ctx) tokens. Oversized items (tokens.size() > n_batch) get their own chunk and are dispatched via decode::many(). Zero-length items are silently skipped.
Pass 2 — Dispatch: Iterates chunks, dispatching normal chunks via decode::scatter() and oversized chunks via decode::many(). Captures logits into per-branch snapshots and advances positions.
| items | Span of {handle, tokens} pairs (all handles must be valid) |
| std::runtime_error | if any handle is invalid, contexts don't match, or decode fails |
Definition at line 993 of file branch.hpp.
|
inline |
Explicit teardown — evict all leases while context is alive.
Must be called before llama_free(ctx). Idempotent. Terminal — BranchStore is not reusable after drain(). freelist_ is not repopulated; call init_tenancy() on a fresh store if you need a new cycle. After drain(), allocate() returns {INVALID_HANDLE, NO_LEASE}.
Definition at line 512 of file branch.hpp.
|
inline |
Get a branch's fork head (parent position at fork time)
| h | Branch handle |
Definition at line 595 of file branch.hpp.
|
inline |
Free a grammar.
| h | Handle to free (0 is a safe no-op) |
Definition at line 792 of file branch.hpp.
|
inline |
Free a metrics tracker.
| h | Handle to free (0 is a safe no-op) |
Definition at line 837 of file branch.hpp.
|
inline |
Free a sampler chain.
| h | Handle to free (0 is a safe no-op) |
Definition at line 703 of file branch.hpp.
|
inline |
Look up branch state by handle.
Validates the handle's index, generation, and in-use flag. Slot 0 always returns nullptr (reserved for INVALID_HANDLE).
| handle | Branch handle to look up |
Definition at line 640 of file branch.hpp.
|
inline |
Look up branch state by handle.
Validates the handle's index, generation, and in-use flag. Slot 0 always returns nullptr (reserved for INVALID_HANDLE).
| handle | Branch handle to look up |
Definition at line 660 of file branch.hpp.
|
inline |
Dereference a grammar handle (non-owning)
| h | Grammar handle |
Definition at line 801 of file branch.hpp.
|
inline |
Get model-level perplexity from a metrics tracker.
| h | Metrics handle |
Definition at line 874 of file branch.hpp.
|
inline |
Dereference a sampler chain handle (non-owning)
| h | Sampler chain handle |
Definition at line 712 of file branch.hpp.
|
inline |
Get sampling-level perplexity from a metrics tracker.
| h | Metrics handle |
Definition at line 888 of file branch.hpp.
|
inline |
Initialize KV tenancy after context creation.
| ctx | Llama context (must outlive BranchStore or call drain() first) |
Definition at line 499 of file branch.hpp.
|
inline |
Test whether a branch holds a KV lease.
| h | Branch handle |
Definition at line 626 of file branch.hpp.
|
inline |
Test whether a branch is a leaf (no children)
| h | Branch handle |
Definition at line 616 of file branch.hpp.
|
inline |
KV cache pressure snapshot — O(1), no tree walking.
cells_used tracks unique KV cells per branch. Incremented on decode_each/decode_scatter, decremented on release (position - fork_head), reset on drain/retainOnly/init_tenancy.
Definition at line 570 of file branch.hpp.
|
inline |
Get a branch's parent handle.
| h | Branch handle |
Definition at line 585 of file branch.hpp.
|
inline |
Release a branch slot + evict its KV lease.
Removes parent→child edge, evicts the seq_id (stripping KV tags), frees CPU resources, and returns the slot to the freelist.
| handle | Branch handle to release (INVALID_HANDLE is a safe no-op) |
Definition at line 463 of file branch.hpp.
|
inline |
Keep only the winner — nuclear KV + CPU cleanup.
Calls seq_keep(winner_seq) for a single KV pass, then releases all other slots (CPU only — KV already stripped by seq_keep).
| winner | Handle to the branch to retain (must be valid + leased) |
| std::runtime_error | if winner is invalid or has no lease |
Definition at line 534 of file branch.hpp.
|
inline |
Check if a sampler chain ends with dist (stochastic) or greedy.
| h | Sampler chain handle |
Definition at line 723 of file branch.hpp.