liblloyal 1.0.0
Branched Inference for llama.cpp
Loading...
Searching...
No Matches
lloyal::decode Namespace Reference

Classes

struct  EachItem
 Input item for decode::each — one token for one sequence. More...
 
struct  PackedChunk
 A chunk of item indices produced by bin_pack() More...
 
struct  ScatterItem
 Input item for decode::scatter — multiple tokens for one sequence. More...
 
struct  Scratch
 Reusable scratch buffers for multi-sequence batch construction. More...
 

Functions

int many (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 Decode multiple tokens into the KV cache with auto-chunking.
 
int many (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
int one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
 Decode a single token into the KV cache.
 
int each (llama_context *ctx, const EachItem *items, int32_t n, Scratch &scratch)
 Decode one token per sequence in a single llama_decode() call.
 
int each (llama_context *ctx, const std::vector< EachItem > &items, Scratch &scratch)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
int scatter (llama_context *ctx, const ScatterItem *items, int32_t n, Scratch &scratch)
 Decode multiple tokens per sequence in a single llama_decode() call.
 
int scatter (llama_context *ctx, const std::vector< ScatterItem > &items, Scratch &scratch)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
std::vector< PackedChunkbin_pack (const std::span< const llama_token > *items, int32_t n, int32_t n_batch)
 Greedy first-fit bin-packing of token spans into n_batch-sized chunks.
 

Function Documentation

◆ bin_pack()

std::vector< PackedChunk > lloyal::decode::bin_pack ( const std::span< const llama_token > *  items,
int32_t  n,
int32_t  n_batch 
)
inline

Greedy first-fit bin-packing of token spans into n_batch-sized chunks.

Pure packing algorithm — no decoding, no logit capture, no context. Callers use the returned chunks to drive their own dispatch logic (decode::scatter for normal chunks, decode::many for oversized).

Empty spans (size 0) are skipped. Items exceeding n_batch get a solo oversized chunk.

Parameters
itemsArray of token spans (only .size() is inspected)
nNumber of items
n_batchMaximum total tokens per normal chunk
Returns
Vector of PackedChunks with indices into the input array
Examples
/home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp, and /home/runner/work/liblloyal/liblloyal/include/lloyal/logits.hpp.

Definition at line 480 of file decode.hpp.

◆ each() [1/2]

int lloyal::decode::each ( llama_context *  ctx,
const EachItem items,
int32_t  n,
Scratch scratch 
)
inline

Decode one token per sequence in a single llama_decode() call.

"each" = each sequence gets one token. Packs N tokens (each targeting a different seq_id) into one llama_batch. Amortizes GPU dispatch overhead across N sequences.

Parameters
ctxLlama context (must not be null)
itemsArray of (token, pos, seq_id, output_logits) tuples
nNumber of items
scratchReusable scratch buffers
Returns
0 on success, non-zero on failure
Exceptions
std::runtime_errorif ctx is NULL
See also
one() for single-sequence single-token decode
scatter() for multi-token-per-sequence variant
Examples
/home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp.

Definition at line 339 of file decode.hpp.

◆ each() [2/2]

int lloyal::decode::each ( llama_context *  ctx,
const std::vector< EachItem > &  items,
Scratch scratch 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 370 of file decode.hpp.

◆ many() [1/2]

int lloyal::decode::many ( llama_context *  ctx,
const llama_token *  tokens,
int32_t  n_tokens,
int32_t  n_past,
int32_t  n_batch,
llama_seq_id  seq_id = 0 
)
inline

Decode multiple tokens into the KV cache with auto-chunking.

Orchestration logic:

  1. Uses a thread_local batch (heap-allocated once per thread, grows on demand)
  2. Chunks tokens into n_batch-sized pieces
  3. For each chunk: clear batch, add tokens, call llama_decode

Sequence ID Parameter

The seq_id parameter specifies which KV cache sequence to update. Default is 0 (single-sequence mode, backward compatible).

Use different seq_ids for:

  • Parallel generations (multiple steppers, each with own seq_id)
  • Branching/tree search (System 2)
  • Shared prefix optimization (decode prefix to seq_id=0, copy to others)

IMPORTANT: n_seq_max Clarification

There are TWO different n_seq_max parameters - don't confuse them:

  1. llama_batch_init(n_tokens, embd, n_seq_max)
    • Controls how many sequences A SINGLE TOKEN can belong to
    • Keep at 1 for normal decode (one token → one sequence)
    • Only increase for beam search where one token updates multiple branches
  2. llama_context_params.n_seq_max
    • Controls max TOTAL sequences (distinct KV cache states)
    • Increase for parallel generations or tree search

Example: 4 parallel steppers, each decoding its own branch

  • Context n_seq_max: 4 (four distinct sequences)
  • Batch n_seq_max: 1 (each token belongs to one sequence)
  • Call: decode::many(ctx, tokens, n, pos, batch, seq_id=stepper_id)
Parameters
ctxLlama context (must be initialized)
tokensToken array to decode
n_tokensNumber of tokens in array
n_pastPosition to start decoding from (KV cache position)
n_batchBatch size for chunking
seq_idSequence ID to update in KV cache (default: 0)
Returns
0 on success, non-zero on decode failure
Exceptions
std::runtime_errorif ctx is NULL or tokens are invalid (validation errors)

CRITICAL: Call kv::remove_range() BEFORE this function, never after.

See also
one() for single-token decode (autoregressive generation)
scatter() for multi-token decode across multiple sequences
Examples
/home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp, /home/runner/work/liblloyal/liblloyal/include/lloyal/kv.hpp, and /home/runner/work/liblloyal/liblloyal/include/lloyal/logits.hpp.

Definition at line 124 of file decode.hpp.

◆ many() [2/2]

int lloyal::decode::many ( llama_context *  ctx,
const std::vector< llama_token > &  tokens,
int32_t  n_past,
int32_t  n_batch,
llama_seq_id  seq_id = 0 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 206 of file decode.hpp.

◆ one()

int lloyal::decode::one ( llama_context *  ctx,
llama_token  tok,
llama_pos  pos,
llama_seq_id  seq_id = 0,
bool  want_logits = true 
)
inline

Decode a single token into the KV cache.

Fast path for autoregressive generation. Uses a thread_local batch (one-time init per thread) so repeated calls avoid allocation entirely.

Typical usage in a generation loop:

llama_token tok = sampler::sample(ctx, vocab);
if (decode::one(ctx, tok, n_past++) != 0) { handle error }
int one(llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
Decode a single token into the KV cache.
Definition decode.hpp:238
Parameters
ctxLlama context (must not be null)
tokToken to decode
posKV cache position for this token
seq_idSequence ID to update (default: 0)
want_logitsWhether to compute logits after this token (default: true). Set to false when prefilling tokens that don't need sampling.
Returns
0 on success, non-zero on decode failure
Exceptions
std::runtime_errorif ctx is NULL
See also
many() for batched multi-token decode with auto-chunking
each() for single-token decode across multiple sequences
Examples
/home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp.

Definition at line 238 of file decode.hpp.

◆ scatter() [1/2]

int lloyal::decode::scatter ( llama_context *  ctx,
const ScatterItem items,
int32_t  n,
Scratch scratch 
)
inline

Decode multiple tokens per sequence in a single llama_decode() call.

Single-batch primitive: packs token runs from multiple sequences into one llama_batch. Does NOT auto-chunk — total tokens must fit in n_batch.

Parameters
ctxLlama context (must not be null)
itemsArray of (tokens_span, start_pos, seq_id) tuples
nNumber of items
scratchReusable scratch buffers
Returns
0 on success, non-zero on failure
Exceptions
std::runtime_errorif ctx is NULL or items are invalid
Note
Does NOT auto-chunk. Total tokens must fit in n_batch.
See also
many() for single-sequence multi-token decode with auto-chunking
each() for single-token-per-sequence variant
BranchStore::decode_scatter for auto-chunking branch-level variant
Examples
/home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp, and /home/runner/work/liblloyal/liblloyal/include/lloyal/logits.hpp.

Definition at line 395 of file decode.hpp.

◆ scatter() [2/2]

int lloyal::decode::scatter ( llama_context *  ctx,
const std::vector< ScatterItem > &  items,
Scratch scratch 
)
inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 443 of file decode.hpp.