liblloyal 1.0.0
Branched Inference for llama.cpp
Loading...
Searching...
No Matches
decode.hpp File Reference

Batch Decoding Operations. More...

#include "common.hpp"
#include <common.h>
#include <algorithm>
#include <cstdint>
#include <llama/llama.h>
#include <span>
#include <stdexcept>
#include <vector>

Go to the source code of this file.

Classes

struct  lloyal::decode::EachItem
 Input item for decode::each — one token for one sequence. More...
 
struct  lloyal::decode::ScatterItem
 Input item for decode::scatter — multiple tokens for one sequence. More...
 
struct  lloyal::decode::Scratch
 Reusable scratch buffers for multi-sequence batch construction. More...
 
struct  lloyal::decode::PackedChunk
 A chunk of item indices produced by bin_pack() More...
 

Namespaces

namespace  lloyal
 Boundary Tracker Stub for OSS liblloyal.
 
namespace  lloyal::decode
 

Functions

int lloyal::decode::many (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 Decode multiple tokens into the KV cache with auto-chunking.
 
int lloyal::decode::many (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
int lloyal::decode::one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
 Decode a single token into the KV cache.
 
int lloyal::decode::each (llama_context *ctx, const EachItem *items, int32_t n, Scratch &scratch)
 Decode one token per sequence in a single llama_decode() call.
 
int lloyal::decode::each (llama_context *ctx, const std::vector< EachItem > &items, Scratch &scratch)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
int lloyal::decode::scatter (llama_context *ctx, const ScatterItem *items, int32_t n, Scratch &scratch)
 Decode multiple tokens per sequence in a single llama_decode() call.
 
int lloyal::decode::scatter (llama_context *ctx, const std::vector< ScatterItem > &items, Scratch &scratch)
 This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
 
std::vector< PackedChunklloyal::decode::bin_pack (const std::span< const llama_token > *items, int32_t n, int32_t n_batch)
 Greedy first-fit bin-packing of token spans into n_batch-sized chunks.
 

Detailed Description

Batch Decoding Operations.

Wraps llama.cpp decode APIs with batch management, chunking logic, and orchestration primitives. Provides both batched and single-token decode operations.

API naming follows this grid:

                Single Sequence       Multi Sequence
               ┌─────────────────┬─────────────────┐

Single Token │ decode::one │ decode::each │ ├─────────────────┼─────────────────┤ Multi Token │ decode::many │ decode::scatter│ └─────────────────┴─────────────────┘

Uses batch utilities from llama.cpp common (common_batch_clear, common_batch_add).

Logit Indexing: How llama_get_logits_ith() Maps to Batch Positions

llama.cpp packs logits into a dense output buffer — only tokens with batch.logits[i] = true get logits computed. The internal output_ids vector translates batch positions to packed rows:

Batch: [tok0, tok1, tok2, tok3, tok4, tok5, tok6, tok7]
logits[]: [ 0, 0, 0, 0, 1, 0, 0, 1 ]
output_ids: [ -1, -1, -1, -1, 0, -1, -1, 1]
^ ^
row 0 row 1
llama_get_logits_ith(ctx, 4) → output_ids[4] = 0 → logits + 0*n_vocab ✓
llama_get_logits_ith(ctx, 7) → output_ids[7] = 1 → logits + 1*n_vocab ✓
llama_get_logits_ith(ctx, 0) → output_ids[0] = -1 → throws (no logits)
llama_get_logits_ith(ctx, -1) → n_outputs - 1 = 1 → logits + 1*n_vocab (last output)

Callers always pass batch positions, not packed indices. The output_ids indirection handles the translation. Negative indices bypass output_ids entirely: -1 means the last output row, -2 the second-to-last, etc.

This matters for logit capture in BranchStore:

Decode pattern logits flag Access index
decode::one Last token only -1 (sole output)
decode::many Last token of final chunk only -1 (sole output of last dispatch)
decode::each All items (1:1 with branches) i (batch pos = item index)
decode::scatter Last token per item cursor + n_tokens[k] - 1

For decode::many, each chunk is a separate llama_decode() call that resets the output buffer. Only the final chunk's last token has logits, so after the last dispatch n_outputs = 1 and -1 yields row 0.

Definition in file decode.hpp.