liblloyal 1.0.0
Composable primitives for llama.cpp inference
Loading...
Searching...
No Matches
lloyal::decoder Namespace Reference

Functions

void decode_tokens (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 Process tokens through model to update KV cache.
 
void decode_tokens (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 Convenience overload for std::vector<llama_token>
 
void decode_one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
 Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1)
 

Function Documentation

◆ decode_one()

void lloyal::decoder::decode_one ( llama_context *  ctx,
llama_token  tok,
llama_pos  pos,
llama_seq_id  seq_id = 0,
bool  want_logits = true 
)
inline

Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1)

Uses stack-allocated llama_batch to avoid llama_batch_init() overhead. This is the fast path for MCTS single-token expansion.

If LLOYAL_STACK_BATCH=0, uses thread_local batch for ABI safety.

Parameters
ctxLlama context
tokToken to decode
posPosition in KV cache
seq_idSequence ID (default: 0)
want_logitsRequest logits for this token (default: true)
Exceptions
std::runtime_errorif decode fails

Definition at line 202 of file decoder.hpp.

◆ decode_tokens() [1/2]

void lloyal::decoder::decode_tokens ( llama_context *  ctx,
const llama_token *  tokens,
int32_t  n_tokens,
int32_t  n_past,
int32_t  n_batch,
llama_seq_id  seq_id = 0 
)
inline

Process tokens through model to update KV cache.

Orchestration logic:

  1. Initializes batch with RAII cleanup
  2. Chunks tokens into n_batch-sized pieces
  3. For each chunk: clear batch, add tokens, call llama_decode
  4. Automatic batch cleanup via RAII guard

Sequence ID Parameter

The seq_id parameter specifies which KV cache sequence to update. Default is 0 (single-sequence mode, backward compatible).

Use different seq_ids for:

  • Parallel generations (multiple steppers, each with own seq_id)
  • Branching/tree search (System 2)
  • Shared prefix optimization (decode prefix to seq_id=0, copy to others)

IMPORTANT: n_seq_max Clarification

There are TWO different n_seq_max parameters - don't confuse them:

  1. llama_batch_init(n_tokens, embd, n_seq_max)
    • Controls how many sequences A SINGLE TOKEN can belong to
    • Keep at 1 for normal decode (one token → one sequence)
    • Only increase for beam search where one token updates multiple branches
  2. llama_context_params.n_seq_max
    • Controls max TOTAL sequences (distinct KV cache states)
    • Increase for parallel generations or tree search

Example: 4 parallel steppers, each decoding its own branch

  • Context n_seq_max: 4 (four distinct sequences)
  • Batch n_seq_max: 1 (each token belongs to one sequence)
  • Call: decode_tokens(ctx, tokens, n, pos, batch, seq_id=stepper_id)
Parameters
ctxLlama context (must be initialized)
tokensToken array to decode
n_tokensNumber of tokens in array
n_pastPosition to start decoding from (KV cache position)
n_batchBatch size for chunking
seq_idSequence ID to update in KV cache (default: 0)
Exceptions
std::runtime_errorif decode fails

CRITICAL: Call kv::remove_range() BEFORE this function, never after.

Examples
/home/runner/work/liblloyal/liblloyal/include/lloyal/kv.hpp.

Definition at line 127 of file decoder.hpp.

◆ decode_tokens() [2/2]

void lloyal::decoder::decode_tokens ( llama_context *  ctx,
const std::vector< llama_token > &  tokens,
int32_t  n_past,
int32_t  n_batch,
llama_seq_id  seq_id = 0 
)
inline

Convenience overload for std::vector<llama_token>

Definition at line 179 of file decoder.hpp.