|
liblloyal 1.0.0
Composable primitives for llama.cpp inference
|
Functions | |
| void | decode_tokens (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0) |
| Process tokens through model to update KV cache. | |
| void | decode_tokens (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0) |
| Convenience overload for std::vector<llama_token> | |
| void | decode_one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true) |
| Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1) | |
|
inline |
Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1)
Uses stack-allocated llama_batch to avoid llama_batch_init() overhead. This is the fast path for MCTS single-token expansion.
If LLOYAL_STACK_BATCH=0, uses thread_local batch for ABI safety.
| ctx | Llama context |
| tok | Token to decode |
| pos | Position in KV cache |
| seq_id | Sequence ID (default: 0) |
| want_logits | Request logits for this token (default: true) |
| std::runtime_error | if decode fails |
Definition at line 202 of file decoder.hpp.
|
inline |
Process tokens through model to update KV cache.
Orchestration logic:
The seq_id parameter specifies which KV cache sequence to update. Default is 0 (single-sequence mode, backward compatible).
Use different seq_ids for:
There are TWO different n_seq_max parameters - don't confuse them:
llama_batch_init(n_tokens, embd, n_seq_max)llama_context_params.n_seq_maxExample: 4 parallel steppers, each decoding its own branch
| ctx | Llama context (must be initialized) |
| tokens | Token array to decode |
| n_tokens | Number of tokens in array |
| n_past | Position to start decoding from (KV cache position) |
| n_batch | Batch size for chunking |
| seq_id | Sequence ID to update in KV cache (default: 0) |
| std::runtime_error | if decode fails |
CRITICAL: Call kv::remove_range() BEFORE this function, never after.
Definition at line 127 of file decoder.hpp.
|
inline |
Convenience overload for std::vector<llama_token>
Definition at line 179 of file decoder.hpp.