Batch Decoding Operations. More...

#include "common.hpp"
#include <common.h>
#include <algorithm>
#include <cstdint>
#include <llama/llama.h>
#include <span>
#include <stdexcept>
#include <vector>

Classes
struct	lloyal::decode::EachItem
	Input item for decode::each — one token for one sequence. More...

struct	lloyal::decode::ScatterItem
	Input item for decode::scatter — multiple tokens for one sequence. More...

struct	lloyal::decode::Scratch
	Reusable scratch buffers for multi-sequence batch construction. More...

struct	lloyal::decode::PackedChunk
	A chunk of item indices produced by bin_pack() More...

Namespaces
namespace	lloyal
	Boundary Tracker Stub for OSS liblloyal.

namespace	lloyal::decode

Functions
int	lloyal::decode::many (llama_context ctx, const llama_token tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
	Decode multiple tokens into the KV cache with auto-chunking.

int	lloyal::decode::many (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

int	lloyal::decode::one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
	Decode a single token into the KV cache.

int	lloyal::decode::each (llama_context ctx, const EachItem items, int32_t n, Scratch &scratch)
	Decode one token per sequence in a single llama_decode() call.

int	lloyal::decode::each (llama_context *ctx, const std::vector< EachItem > &items, Scratch &scratch)
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

int	lloyal::decode::scatter (llama_context ctx, const ScatterItem items, int32_t n, Scratch &scratch)
	Decode multiple tokens per sequence in a single llama_decode() call.

int	lloyal::decode::scatter (llama_context *ctx, const std::vector< ScatterItem > &items, Scratch &scratch)
	This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.

std::vector< PackedChunk >	lloyal::decode::bin_pack (const std::span< const llama_token > *items, int32_t n, int32_t n_batch)
	Greedy first-fit bin-packing of token spans into n_batch-sized chunks.

Detailed Description

Batch Decoding Operations.

Wraps llama.cpp decode APIs with batch management, chunking logic, and orchestration primitives. Provides both batched and single-token decode operations.

API naming follows this grid:

                Single Sequence       Multi Sequence
               ┌─────────────────┬─────────────────┐

Single Token │ decode::one │ decode::each │ ├─────────────────┼─────────────────┤ Multi Token │ decode::many │ decode::scatter│ └─────────────────┴─────────────────┘

Uses batch utilities from llama.cpp common (common_batch_clear, common_batch_add).

Logit Indexing: How llama_get_logits_ith() Maps to Batch Positions

llama.cpp packs logits into a dense output buffer — only tokens with batch.logits[i] = true get logits computed. The internal output_ids vector translates batch positions to packed rows:

Batch:      [tok0, tok1, tok2, tok3, tok4, tok5, tok6, tok7]
logits[]:   [  0,    0,    0,    0,    1,    0,    0,    1 ]
 
output_ids: [ -1,   -1,   -1,   -1,    0,   -1,   -1,    1]
                                       ^                 ^
                                    row 0             row 1
 
llama_get_logits_ith(ctx, 4)  → output_ids[4] = 0  → logits + 0*n_vocab  ✓
llama_get_logits_ith(ctx, 7)  → output_ids[7] = 1  → logits + 1*n_vocab  ✓
llama_get_logits_ith(ctx, 0)  → output_ids[0] = -1 → throws (no logits)
llama_get_logits_ith(ctx, -1) → n_outputs - 1 = 1  → logits + 1*n_vocab  (last output)

Callers always pass batch positions, not packed indices. The output_ids indirection handles the translation. Negative indices bypass output_ids entirely: -1 means the last output row, -2 the second-to-last, etc.

This matters for logit capture in BranchStore:

Decode pattern	logits flag	Access index
decode::one	Last token only	`-1` (sole output)
decode::many	Last token of final chunk only	`-1` (sole output of last dispatch)
decode::each	All items (1:1 with branches)	`i` (batch pos = item index)
decode::scatter	Last token per item	`cursor + n_tokens[k] - 1`

For decode::many, each chunk is a separate llama_decode() call that resets the output buffer. Only the final chunk's last token has logits, so after the last dispatch n_outputs = 1 and -1 yields row 0.

Definition in file decode.hpp.

Classes

Namespaces

Functions

Detailed Description

Logit Indexing: How llama_get_logits_ith() Maps to Batch Positions