|
liblloyal 1.0.0
Branched Inference for llama.cpp
|
Classes | |
| class | Branch |
| struct | BranchState |
| Consolidated mutable state for a single branch. More... | |
| class | BranchStore |
| Handle table and batched decode orchestrator for branch management. More... | |
| struct | CachedSamplingParams |
| Concrete sampling params snapshot for memoization. More... | |
| struct | DecodeEachItem |
| Item for decode_each: one token per branch. More... | |
| struct | DecodeScatterItem |
| Item for decode_scatter: variable tokens per branch. More... | |
| struct | GrammarEntry |
| RAII entry for a grammar sampler in the registry. More... | |
| struct | KvPressure |
| Snapshot of KV cache pressure from BranchStore. More... | |
| struct | SamplerChainEntry |
| RAII entry for a sampler chain in the registry. More... | |
Typedefs | |
| using | BranchHandle = uint32_t |
| Opaque handle to a branch slot. | |
| using | SamplerChainHandle = int32_t |
| Handle to a sampler chain in BranchStore's registry (0 = invalid/none) | |
| using | GrammarHandle = int32_t |
| Handle to a grammar sampler in BranchStore's registry (0 = invalid/none) | |
| using | MetricsHandle = int32_t |
| Handle to a metrics tracker in BranchStore's registry (0 = invalid/none) | |
Functions | |
| uint16_t | handle_index (BranchHandle h) |
| Extract slot index from a branch handle. | |
| uint16_t | handle_generation (BranchHandle h) |
| Extract generation counter from a branch handle. | |
| BranchHandle | make_handle (uint16_t index, uint16_t generation) |
| Construct a branch handle from index and generation. | |
| template<SamplingParamsLike P> | |
| CachedSamplingParams | snapshot_params (const P &p) |
| Snapshot sampling params for memoization comparison. | |
| template<SamplingParamsLike P> | |
| BranchHandle | create (llama_context *ctx, const llama_model *model, BranchStore &s, llama_pos start_pos, const P ¶ms, int n_batch=DEFAULT_N_BATCH, const char *grammar_str=nullptr, boundaries::BoundaryTracker *boundary_tracker=nullptr) |
| Create a new branch with sampler chain, optional grammar, and metrics. | |
| BranchHandle | fork (BranchHandle source, BranchStore &s) |
| Fork a branch into a new independent sequence. | |
| void | set_logit_bias (BranchHandle handle, const llama_logit_bias *biases, size_t n_biases, BranchStore &s) |
| void | clear_logit_bias (BranchHandle handle, BranchStore &s) |
| Clear all logit biases from a branch. | |
| void | set_steer (BranchHandle handle, std::function< void(llama_token_data_array &)> steer_fn, BranchStore &s) |
| void | clear_steer (BranchHandle handle, BranchStore &s) |
| Clear the steer callback from a branch. | |
| template<SamplingParamsLike P> | |
| void | set_sampler_params (BranchHandle handle, const P ¶ms, BranchStore &s) |
| Replace a branch's sampler chain with new parameters. | |
| void | set_grammar (BranchHandle handle, const llama_model *model, const char *grammar_str, BranchStore &s) |
| Replace a branch's grammar constraint. | |
| void | set_grammar_lazy (BranchHandle handle, const llama_model *model, const char *grammar_str, const std::vector< std::string > &trigger_patterns, const std::vector< llama_token > &trigger_tokens, BranchStore &s) |
| Set lazy grammar on a branch (unconstrained until trigger fires) | |
| void | prune (BranchHandle handle, BranchStore &s) |
| Prune a leaf branch (RESTRICT — throws if children exist) | |
| void | pruneSubtree (BranchHandle h, BranchStore &s) |
| Prune a branch and all descendants (CASCADE — iterative post-order) | |
| void | force_snapshot_logits (BranchHandle handle, BranchStore &s) |
| Force-copy the shared llama.cpp logits buffer into this branch's private snapshot. | |
| void | prefill (BranchHandle handle, const llama_token *tokens, size_t n_tokens, BranchStore &s) |
| Decode multiple tokens and capture logits atomically (prompt prefill) | |
| void | step (BranchHandle handle, llama_token token, BranchStore &s) |
| Decode a single token and capture logits (generation step) | |
| const float * | get_logits (BranchHandle handle, BranchStore &s) |
| Get the branch's captured logits snapshot. | |
| llama_token | sample (BranchHandle handle, BranchStore &s) |
| Sample a token from the branch's captured logits. | |
| void | accept_token (BranchHandle handle, llama_token token, BranchStore &s) |
| Accept a sampled token, advancing grammar and sampler state. | |
| void | apply_grammar (BranchHandle handle, float *logits, int n_vocab, BranchStore &s) |
| Apply grammar constraints to an external logits buffer. | |
| std::vector< std::pair< llama_token, float > > | get_legal_priors (BranchHandle handle, BranchStore &s) |
| Get grammar-legal tokens with renormalized probabilities. | |
| float | get_legal_logsumexp (BranchHandle handle, BranchStore &s) |
| Compute log-sum-exp over grammar-legal logits. | |
| bool | is_token_legal (BranchHandle handle, llama_token token, BranchStore &s) |
| Check if a token is legal under grammar constraints. | |
| float | get_token_prior_assume_legal (BranchHandle handle, llama_token token, float logsumexp, BranchStore &s) |
| Compute prior probability for a token known to be grammar-legal. | |
| float | get_token_prior (BranchHandle handle, llama_token token, float logsumexp, BranchStore &s) |
| Compute prior probability for a token, checking grammar legality first. | |
| llama_pos | get_position (BranchHandle handle, BranchStore &s) |
| Get the branch's current decode position. | |
| llama_pos | get_fork_head (BranchHandle handle, BranchStore &s) |
| Get the branch's fork head (parent position at fork time) | |
| float | get_perplexity (BranchHandle handle, BranchStore &s) |
| Get model-level perplexity (from raw logits) | |
| float | get_sampling_perplexity (BranchHandle handle, BranchStore &s) |
| Get sampling-level perplexity (from filtered distribution) | |
| float | get_last_sampling_prior (BranchHandle handle, BranchStore &s) |
| Get the last sampled token's prior from the filtered distribution. | |
| int | get_n_vocab (BranchHandle handle, BranchStore &s) |
| Get the branch's vocabulary size. | |
Variables | |
| constexpr BranchHandle | INVALID_HANDLE = 0 |
| Null handle sentinel. | |
| constexpr llama_seq_id | NO_LEASE = kv::NO_LEASE |
| Branch has no KV residency. | |
| constexpr int | DEFAULT_N_BATCH = 512 |
| Default batch size for decode operations. | |
| constexpr uint32_t | GEN_SHIFT = 16 |
| Bit shift for generation field. | |
| constexpr uint32_t | INDEX_MASK = 0xFFFF |
| Mask for slot index field. | |
| using lloyal::branch::BranchHandle = typedef uint32_t |
Opaque handle to a branch slot.
Encoded as (generation << 16) | index:
Definition at line 94 of file branch.hpp.
| using lloyal::branch::GrammarHandle = typedef int32_t |
Handle to a grammar sampler in BranchStore's registry (0 = invalid/none)
Definition at line 154 of file branch.hpp.
| using lloyal::branch::MetricsHandle = typedef int32_t |
Handle to a metrics tracker in BranchStore's registry (0 = invalid/none)
Definition at line 157 of file branch.hpp.
| using lloyal::branch::SamplerChainHandle = typedef int32_t |
Handle to a sampler chain in BranchStore's registry (0 = invalid/none)
Definition at line 151 of file branch.hpp.
|
inline |
Accept a sampled token, advancing grammar and sampler state.
Updates:
Definition at line 1913 of file branch.hpp.
|
inline |
Apply grammar constraints to an external logits buffer.
Sets logits of grammar-illegal tokens to -INFINITY in the provided buffer. Uses the branch's internal candidates_buffer as scratch space when vocab sizes match; allocates a temporary buffer otherwise.
| handle | Branch with grammar to apply |
| logits | Logits buffer to modify in place (n_vocab floats) |
| n_vocab | Number of entries in the logits buffer |
| s | Branch store |
Definition at line 1981 of file branch.hpp.
|
inline |
Clear all logit biases from a branch.
| std::runtime_error | if handle is invalid |
Definition at line 1421 of file branch.hpp.
|
inline |
Clear the steer callback from a branch.
| std::runtime_error | if handle is invalid |
Definition at line 1485 of file branch.hpp.
|
inline |
Create a new branch with sampler chain, optional grammar, and metrics.
Allocates a slot + KV lease from the store, initializes the sampler chain from params, optionally attaches a GBNF grammar and boundary tracker, and pre-allocates logits/candidates buffers sized to the model's vocabulary.
| P | Any type satisfying the SamplingParamsLike concept |
| ctx | Llama context (not owned, must outlive branch) |
| model | Llama model (not owned, used for vocab size and sampler init) |
| s | Branch store to allocate from |
| start_pos | Starting decode position (typically prompt length after prefill) |
| params | Sampling parameters (temperature, top_k, top_p, penalties, etc.) |
| n_batch | Batch size for decode operations (default 512) |
| grammar_str | GBNF grammar string, or nullptr for unconstrained generation |
| boundary_tracker | Boundary detector (ownership transferred), or nullptr |
Definition at line 1225 of file branch.hpp.
|
inline |
Force-copy the shared llama.cpp logits buffer into this branch's private snapshot.
Copies the logits from the llama_context into the branch's internal buffer without performing a decode.
| std::runtime_error | if handle is invalid, vocab size is zero, or no logits are available (no prior decode with logits enabled) |
Definition at line 1675 of file branch.hpp.
|
inline |
Fork a branch into a new independent sequence.
Allocates a slot + KV lease, deep copies source state under the new seq_id. Records parent→child topology edge.
Cloned state:
NOT cloned:
| source | Handle of the branch to fork from |
| s | Branch store |
Definition at line 1300 of file branch.hpp.
|
inline |
Get the branch's fork head (parent position at fork time)
Definition at line 2291 of file branch.hpp.
|
inline |
Get the last sampled token's prior from the filtered distribution.
Returns P(token) from the post-filter sampling distribution. This is the correct prior for UCT-family algorithms since it matches what was actually sampled.
Definition at line 2346 of file branch.hpp.
|
inline |
Compute log-sum-exp over grammar-legal logits.
Returns log(sum(exp(logit_i))) over tokens that pass grammar constraints. Use for efficient per-token prior computation: P(token) = exp(logit[token] - logsumexp)
Numerically stable (max-subtraction trick).
Definition at line 2116 of file branch.hpp.
|
inline |
Get grammar-legal tokens with renormalized probabilities.
Returns (token, probability) pairs for tokens that pass grammar constraints. Probabilities are softmax-normalized over the legal set only (sum to 1.0).
Essential for policy priors in tree search: priors must only cover legal moves.
Definition at line 2037 of file branch.hpp.
|
inline |
Get the branch's captured logits snapshot.
Returns a pointer to the internal logits buffer (n_vocab floats). Only valid after force_snapshot_logits(), prefill(), or step().
Definition at line 1795 of file branch.hpp.
|
inline |
Get the branch's vocabulary size.
Definition at line 2383 of file branch.hpp.
|
inline |
Get model-level perplexity (from raw logits)
Returns perplexity computed from the full logit distribution before any sampler filtering. For the distribution actually sampled from, use get_sampling_perplexity().
Definition at line 2307 of file branch.hpp.
|
inline |
Get the branch's current decode position.
Definition at line 2279 of file branch.hpp.
|
inline |
Get sampling-level perplexity (from filtered distribution)
Returns perplexity from the distribution actually sampled from (after top-k/p/temp/penalties). Useful for policy priors and monitoring sampler chain impact.
Definition at line 2327 of file branch.hpp.
|
inline |
Compute prior probability for a token, checking grammar legality first.
O(grammar_complexity) — uses is_token_legal() before computing the prior. Safe for ad-hoc callers who don't know whether the token is grammar-legal.
| handle | Branch with captured logits and optional grammar |
| token | Token ID to compute prior for |
| logsumexp | Pre-computed value from get_legal_logsumexp() |
| s | Branch store |
Definition at line 2259 of file branch.hpp.
|
inline |
Compute prior probability for a token known to be grammar-legal.
O(1) operation — use in search inner loops where sample() already enforced grammar. Does NOT validate grammar legality; caller must ensure token is legal.
| handle | Branch with captured logits |
| token | Token ID (must be legal under grammar) |
| logsumexp | Pre-computed value from get_legal_logsumexp() |
| s | Branch store |
Definition at line 2228 of file branch.hpp.
|
inline |
Extract generation counter from a branch handle.
| h | Branch handle |
Definition at line 134 of file branch.hpp.
|
inline |
Extract slot index from a branch handle.
| h | Branch handle |
Definition at line 125 of file branch.hpp.
|
inline |
Check if a token is legal under grammar constraints.
Uses a 1-element candidate array for O(grammar_complexity) check instead of the O(n_vocab) full scan used by get_legal_priors().
Definition at line 2177 of file branch.hpp.
|
inline |
Construct a branch handle from index and generation.
| index | Slot index (0–65535) |
| generation | Generation counter |
Definition at line 144 of file branch.hpp.
|
inline |
Decode multiple tokens and capture logits atomically (prompt prefill)
Feeds tokens through the model in n_batch-sized chunks, advances the branch position, and snapshots logits. After this call, sample() and get_logits() are available.
| handle | Branch to decode into |
| tokens | Array of token IDs |
| n_tokens | Number of tokens in the array |
| s | Branch store |
| std::runtime_error | if handle is invalid, decode fails, or logits capture fails |
Definition at line 1711 of file branch.hpp.
|
inline |
Prune a leaf branch (RESTRICT — throws if children exist)
Evicts the KV lease and frees all resources via BranchStore::release(). If the branch has children, throws — use pruneSubtree() for CASCADE.
| std::runtime_error | if branch has children |
Definition at line 1627 of file branch.hpp.
|
inline |
Prune a branch and all descendants (CASCADE — iterative post-order)
Traverses the subtree rooted at h, collecting all descendants, then prunes leaves-first so RESTRICT on prune() always passes.
| h | Root of subtree to prune |
| s | Branch store |
Definition at line 1644 of file branch.hpp.
|
inline |
Sample a token from the branch's captured logits.
Applies modifiers in order: Grammar → Logit Bias → Steer → Sampler Chain, then selects a token. Also records filtered candidates for metrics.
Requires prior force_snapshot_logits(), prefill(), or step().
Definition at line 1821 of file branch.hpp.
|
inline |
Replace a branch's grammar constraint.
Frees the old grammar (if any) and attaches a new one. Pass nullptr or empty string to remove grammar constraints entirely.
| handle | Branch to modify |
| model | Llama model (for vocab) |
| grammar_str | GBNF grammar string, or nullptr to remove |
| s | Branch store |
| std::runtime_error | if handle is invalid |
Definition at line 1552 of file branch.hpp.
|
inline |
Set lazy grammar on a branch (unconstrained until trigger fires)
Replaces any existing grammar. The lazy grammar accepts all tokens until a trigger pattern or token fires, then constrains subsequent generation.
| handle | Branch handle |
| model | Llama model (for vocab) |
| grammar_str | GBNF grammar string |
| trigger_patterns | Regex patterns that activate the grammar |
| trigger_tokens | Token IDs that activate the grammar |
| s | Branch store |
| std::runtime_error | if handle is invalid |
Definition at line 1591 of file branch.hpp.
|
inline |
Definition at line 1395 of file branch.hpp.
|
inline |
Replace a branch's sampler chain with new parameters.
Memoized: if the new params match the cached snapshot, this is a no-op. Otherwise frees the old chain and creates a new one.
Primary use case: Entropy-based Dynamic Temperature (EDT), where temperature changes per-token based on model uncertainty.
| P | Any type satisfying the SamplingParamsLike concept |
| std::runtime_error | if handle is invalid |
Definition at line 1516 of file branch.hpp.
|
inline |
Definition at line 1462 of file branch.hpp.
|
inline |
Snapshot sampling params for memoization comparison.
Extracts resolved values from any SamplingParamsLike type using the same defaults as sampler::create_chain(), except seed defaults to 0 (not time) to avoid false cache misses from non-deterministic defaults.
Definition at line 237 of file branch.hpp.
|
inline |
Decode a single token and capture logits (generation step)
Uses decode::one() which maintains a thread_local batch — heap-allocated once per thread, reused across calls. No per-call allocation.
| std::runtime_error | if handle is invalid, decode fails, or logits capture fails |
Definition at line 1756 of file branch.hpp.
|
constexpr |
Default batch size for decode operations.
Definition at line 98 of file branch.hpp.
|
constexpr |
Bit shift for generation field.
Definition at line 99 of file branch.hpp.
|
constexpr |
Mask for slot index field.
Definition at line 100 of file branch.hpp.
|
constexpr |
Null handle sentinel.
Definition at line 96 of file branch.hpp.
|
constexpr |
Branch has no KV residency.
Definition at line 97 of file branch.hpp.