Functions
void	decode_tokens (llama_context ctx, const llama_token tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
	Process tokens through model to update KV cache.

void	decode_tokens (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
	Convenience overload for std::vector<llama_token>

void	decode_one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
	Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1)

Function Documentation

◆ decode_one()

void lloyal::decoder::decode_one	(	llama_context *	ctx,
		llama_token	tok,
		llama_pos	pos,
		llama_seq_id	seq_id = `0`,
		bool	want_logits = `true`
	)

inline

Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1)

Uses stack-allocated llama_batch to avoid llama_batch_init() overhead. This is the fast path for MCTS single-token expansion.

If LLOYAL_STACK_BATCH=0, uses thread_local batch for ABI safety.

Parameters

ctx	Llama context
tok	Token to decode
pos	Position in KV cache
seq_id	Sequence ID (default: 0)
want_logits	Request logits for this token (default: true)

Exceptions

std::runtime_error if decode fails

Definition at line 202 of file decoder.hpp.

void lloyal::decoder::decode_tokens	(	llama_context *	ctx,
		const llama_token *	tokens,
		int32_t	n_tokens,
		int32_t	n_past,
		int32_t	n_batch,
		llama_seq_id	seq_id = `0`
	)

inline

Process tokens through model to update KV cache.

Orchestration logic:

The seq_id parameter specifies which KV cache sequence to update. Default is 0 (single-sequence mode, backward compatible).

Use different seq_ids for:

There are TWO different n_seq_max parameters - don't confuse them:

llama_batch_init(n_tokens, embd, n_seq_max)
- Controls how many sequences A SINGLE TOKEN can belong to
- Keep at 1 for normal decode (one token → one sequence)
- Only increase for beam search where one token updates multiple branches
llama_context_params.n_seq_max
- Controls max TOTAL sequences (distinct KV cache states)
- Increase for parallel generations or tree search

Example: 4 parallel steppers, each decoding its own branch

Parameters

ctx	Llama context (must be initialized)
tokens	Token array to decode
n_tokens	Number of tokens in array
n_past	Position to start decoding from (KV cache position)
n_batch	Batch size for chunking
seq_id	Sequence ID to update in KV cache (default: 0)

Exceptions

std::runtime_error if decode fails

CRITICAL: Call kv::remove_range() BEFORE this function, never after.

Definition at line 127 of file decoder.hpp.

void lloyal::decoder::decode_tokens	(	llama_context *	ctx,
		const std::vector< llama_token > &	tokens,
		int32_t	n_past,
		int32_t	n_batch,
		llama_seq_id	seq_id = `0`
	)

inline

Convenience overload for std::vector<llama_token>

Definition at line 179 of file decoder.hpp.