|
liblloyal 1.0.0
Branched Inference for llama.cpp
|
Classes | |
| struct | EachItem |
| Input item for decode::each — one token for one sequence. More... | |
| struct | PackedChunk |
| A chunk of item indices produced by bin_pack() More... | |
| struct | ScatterItem |
| Input item for decode::scatter — multiple tokens for one sequence. More... | |
| struct | Scratch |
| Reusable scratch buffers for multi-sequence batch construction. More... | |
Functions | |
| int | many (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0) |
| Decode multiple tokens into the KV cache with auto-chunking. | |
| int | many (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
| int | one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true) |
| Decode a single token into the KV cache. | |
| int | each (llama_context *ctx, const EachItem *items, int32_t n, Scratch &scratch) |
| Decode one token per sequence in a single llama_decode() call. | |
| int | each (llama_context *ctx, const std::vector< EachItem > &items, Scratch &scratch) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
| int | scatter (llama_context *ctx, const ScatterItem *items, int32_t n, Scratch &scratch) |
| Decode multiple tokens per sequence in a single llama_decode() call. | |
| int | scatter (llama_context *ctx, const std::vector< ScatterItem > &items, Scratch &scratch) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
| std::vector< PackedChunk > | bin_pack (const std::span< const llama_token > *items, int32_t n, int32_t n_batch) |
| Greedy first-fit bin-packing of token spans into n_batch-sized chunks. | |
|
inline |
Greedy first-fit bin-packing of token spans into n_batch-sized chunks.
Pure packing algorithm — no decoding, no logit capture, no context. Callers use the returned chunks to drive their own dispatch logic (decode::scatter for normal chunks, decode::many for oversized).
Empty spans (size 0) are skipped. Items exceeding n_batch get a solo oversized chunk.
| items | Array of token spans (only .size() is inspected) |
| n | Number of items |
| n_batch | Maximum total tokens per normal chunk |
Definition at line 480 of file decode.hpp.
|
inline |
Decode one token per sequence in a single llama_decode() call.
"each" = each sequence gets one token. Packs N tokens (each targeting a different seq_id) into one llama_batch. Amortizes GPU dispatch overhead across N sequences.
| ctx | Llama context (must not be null) |
| items | Array of (token, pos, seq_id, output_logits) tuples |
| n | Number of items |
| scratch | Reusable scratch buffers |
| std::runtime_error | if ctx is NULL |
Definition at line 339 of file decode.hpp.
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Definition at line 370 of file decode.hpp.
|
inline |
Decode multiple tokens into the KV cache with auto-chunking.
Orchestration logic:
The seq_id parameter specifies which KV cache sequence to update. Default is 0 (single-sequence mode, backward compatible).
Use different seq_ids for:
There are TWO different n_seq_max parameters - don't confuse them:
llama_batch_init(n_tokens, embd, n_seq_max)llama_context_params.n_seq_maxExample: 4 parallel steppers, each decoding its own branch
| ctx | Llama context (must be initialized) |
| tokens | Token array to decode |
| n_tokens | Number of tokens in array |
| n_past | Position to start decoding from (KV cache position) |
| n_batch | Batch size for chunking |
| seq_id | Sequence ID to update in KV cache (default: 0) |
| std::runtime_error | if ctx is NULL or tokens are invalid (validation errors) |
CRITICAL: Call kv::remove_range() BEFORE this function, never after.
Definition at line 124 of file decode.hpp.
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Definition at line 206 of file decode.hpp.
|
inline |
Decode a single token into the KV cache.
Fast path for autoregressive generation. Uses a thread_local batch (one-time init per thread) so repeated calls avoid allocation entirely.
Typical usage in a generation loop:
| ctx | Llama context (must not be null) |
| tok | Token to decode |
| pos | KV cache position for this token |
| seq_id | Sequence ID to update (default: 0) |
| want_logits | Whether to compute logits after this token (default: true). Set to false when prefilling tokens that don't need sampling. |
| std::runtime_error | if ctx is NULL |
Definition at line 238 of file decode.hpp.
|
inline |
Decode multiple tokens per sequence in a single llama_decode() call.
Single-batch primitive: packs token runs from multiple sequences into one llama_batch. Does NOT auto-chunk — total tokens must fit in n_batch.
| ctx | Llama context (must not be null) |
| items | Array of (tokens_span, start_pos, seq_id) tuples |
| n | Number of items |
| scratch | Reusable scratch buffers |
| std::runtime_error | if ctx is NULL or items are invalid |
Definition at line 395 of file decode.hpp.
|
inline |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
Definition at line 443 of file decode.hpp.