Classes
struct	EachItem
	Input item for decode::each — one token for one sequence. More...

struct	PackedChunk
	A chunk of item indices produced by bin_pack() More...

struct	ScatterItem
	Input item for decode::scatter — multiple tokens for one sequence. More...

struct	Scratch
	Reusable scratch buffers for multi-sequence batch construction. More...

Functions
int	many (llama_context ctx, const llama_token tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
	Decode multiple tokens into the KV cache with auto-chunking.

int	many (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

int	one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
	Decode a single token into the KV cache.

int	each (llama_context ctx, const EachItem items, int32_t n, Scratch &scratch)
	Decode one token per sequence in a single llama_decode() call.

int	each (llama_context *ctx, const std::vector< EachItem > &items, Scratch &scratch)
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

int	scatter (llama_context ctx, const ScatterItem items, int32_t n, Scratch &scratch)
	Decode multiple tokens per sequence in a single llama_decode() call.

int	scatter (llama_context *ctx, const std::vector< ScatterItem > &items, Scratch &scratch)
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

std::vector< PackedChunk >	bin_pack (const std::span< const llama_token > *items, int32_t n, int32_t n_batch)
	Greedy first-fit bin-packing of token spans into n_batch-sized chunks.

Function Documentation

◆ bin_pack()

std::vector< PackedChunk > lloyal::decode::bin_pack	(	const std::span< const llama_token > *	items,
		int32_t	n,
		int32_t	n_batch
	)

inline

Greedy first-fit bin-packing of token spans into n_batch-sized chunks.

Pure packing algorithm — no decoding, no logit capture, no context. Callers use the returned chunks to drive their own dispatch logic (decode::scatter for normal chunks, decode::many for oversized).

Empty spans (size 0) are skipped. Items exceeding n_batch get a solo oversized chunk.

Parameters

items	Array of token spans (only .size() is inspected)
n	Number of items
n_batch	Maximum total tokens per normal chunk

Returns: Vector of PackedChunks with indices into the input array

Examples: /home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp, and /home/runner/work/liblloyal/liblloyal/include/lloyal/logits.hpp.

Definition at line 481 of file decode.hpp.

◆ each() [1/2]

int lloyal::decode::each	(	llama_context *	ctx,
		const EachItem *	items,
		int32_t	n,
		Scratch &	scratch
	)

inline

Decode one token per sequence in a single llama_decode() call.

"each" = each sequence gets one token. Packs N tokens (each targeting a different seq_id) into one llama_batch. Amortizes GPU dispatch overhead across N sequences.

Parameters

ctx	Llama context (must not be null)
items	Array of (token, pos, seq_id, output_logits) tuples
n	Number of items
scratch	Reusable scratch buffers

Returns: 0 on success, non-zero on failure

Exceptions

std::runtime_error if ctx is NULL

See also: one() for single-sequence single-token decode; scatter() for multi-token-per-sequence variant

Examples: /home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp.

Definition at line 340 of file decode.hpp.

◆ each() [2/2]

int lloyal::decode::each	(	llama_context *	ctx,
		const std::vector< EachItem > &	items,
		Scratch &	scratch
	)

inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 371 of file decode.hpp.

◆ many() [1/2]

int lloyal::decode::many	(	llama_context *	ctx,
		const llama_token *	tokens,
		int32_t	n_tokens,
		int32_t	n_past,
		int32_t	n_batch,
		llama_seq_id	seq_id = `0`
	)

inline

Decode multiple tokens into the KV cache with auto-chunking.

Orchestration logic:

Uses a thread_local batch (heap-allocated once per thread, grows on demand)
Chunks tokens into n_batch-sized pieces
For each chunk: clear batch, add tokens, call llama_decode

Sequence ID Parameter

The seq_id parameter specifies which KV cache sequence to update. Default is 0 (single-sequence mode, backward compatible).

Use different seq_ids for:

Parallel generations (multiple steppers, each with own seq_id)
Branching/tree search (System 2)
Shared prefix optimization (decode prefix to seq_id=0, copy to others)

IMPORTANT: n_seq_max Clarification

There are TWO different n_seq_max parameters - don't confuse them:

llama_batch_init(n_tokens, embd, n_seq_max)
- Controls how many sequences A SINGLE TOKEN can belong to
- Keep at 1 for normal decode (one token → one sequence)
- Only increase for beam search where one token updates multiple branches
llama_context_params.n_seq_max
- Controls max TOTAL sequences (distinct KV cache states)
- Increase for parallel generations or tree search

Example: 4 parallel steppers, each decoding its own branch

Context n_seq_max: 4 (four distinct sequences)
Batch n_seq_max: 1 (each token belongs to one sequence)
Call: decode::many(ctx, tokens, n, pos, batch, seq_id=stepper_id)

Parameters

ctx	Llama context (must be initialized)
tokens	Token array to decode
n_tokens	Number of tokens in array
n_past	Position to start decoding from (KV cache position)
n_batch	Batch size for chunking
seq_id	Sequence ID to update in KV cache (default: 0)

Returns: 0 on success, non-zero on decode failure

Exceptions

std::runtime_error if ctx is NULL or tokens are invalid (validation errors)

CRITICAL: Call kv::remove_range() BEFORE this function, never after.

See also: one() for single-token decode (autoregressive generation); scatter() for multi-token decode across multiple sequences

Examples: /home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp, /home/runner/work/liblloyal/liblloyal/include/lloyal/kv.hpp, and /home/runner/work/liblloyal/liblloyal/include/lloyal/logits.hpp.

Definition at line 125 of file decode.hpp.

◆ many() [2/2]

int lloyal::decode::many	(	llama_context *	ctx,
		const std::vector< llama_token > &	tokens,
		int32_t	n_past,
		int32_t	n_batch,
		llama_seq_id	seq_id = `0`
	)

inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 207 of file decode.hpp.

◆ one()

int lloyal::decode::one	(	llama_context *	ctx,
		llama_token	tok,
		llama_pos	pos,
		llama_seq_id	seq_id = `0`,
		bool	want_logits = `true`
	)

inline

Decode a single token into the KV cache.

Fast path for autoregressive generation. Uses a thread_local batch (one-time init per thread) so repeated calls avoid allocation entirely.

Typical usage in a generation loop:

llama_token tok = sampler::sample(ctx, vocab);

if (decode::one(ctx, tok, n_past++) != 0) { handle error }

lloyal::decode::one

int one(llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)

Decode a single token into the KV cache.

Definition decode.hpp:239

Parameters

ctx	Llama context (must not be null)
tok	Token to decode
pos	KV cache position for this token
seq_id	Sequence ID to update (default: 0)
want_logits	Whether to compute logits after this token (default: true). Set to false when prefilling tokens that don't need sampling.

Returns: 0 on success, non-zero on decode failure

Exceptions

std::runtime_error if ctx is NULL

See also: many() for batched multi-token decode with auto-chunking; each() for single-token decode across multiple sequences

Examples: /home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp.

Definition at line 239 of file decode.hpp.

◆ scatter() [1/2]

int lloyal::decode::scatter	(	llama_context *	ctx,
		const ScatterItem *	items,
		int32_t	n,
		Scratch &	scratch
	)

inline

Decode multiple tokens per sequence in a single llama_decode() call.

Single-batch primitive: packs token runs from multiple sequences into one llama_batch. Does NOT auto-chunk — total tokens must fit in n_batch.

Parameters

ctx	Llama context (must not be null)
items	Array of (tokens_span, start_pos, seq_id) tuples
n	Number of items
scratch	Reusable scratch buffers

Returns: 0 on success, non-zero on failure

Exceptions

std::runtime_error if ctx is NULL or items are invalid

Note: Does NOT auto-chunk. Total tokens must fit in n_batch.

See also: many() for single-sequence multi-token decode with auto-chunking; each() for single-token-per-sequence variant; BranchStore::decode_scatter for auto-chunking branch-level variant

Examples: /home/runner/work/liblloyal/liblloyal/include/lloyal/branch.hpp, and /home/runner/work/liblloyal/liblloyal/include/lloyal/logits.hpp.

Definition at line 396 of file decode.hpp.

◆ scatter() [2/2]

int lloyal::decode::scatter	(	llama_context *	ctx,
		const std::vector< ScatterItem > &	items,
		Scratch &	scratch
	)

inline

This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

Definition at line 444 of file decode.hpp.

Classes

Functions

Function Documentation

◆ bin_pack()

◆ each() [1/2]

◆ each() [2/2]

◆ many() [1/2]

Sequence ID Parameter

IMPORTANT: n_seq_max Clarification

◆ many() [2/2]

◆ one()

◆ scatter() [1/2]

◆ scatter() [2/2]