|
liblloyal 1.0.0
Branched Inference for llama.cpp
|
Batch Decoding Operations. More...
#include "common.hpp"#include <common.h>#include <algorithm>#include <cstdint>#include <llama/llama.h>#include <span>#include <stdexcept>#include <vector>Go to the source code of this file.
Classes | |
| struct | lloyal::decode::EachItem |
| Input item for decode::each — one token for one sequence. More... | |
| struct | lloyal::decode::ScatterItem |
| Input item for decode::scatter — multiple tokens for one sequence. More... | |
| struct | lloyal::decode::Scratch |
| Reusable scratch buffers for multi-sequence batch construction. More... | |
| struct | lloyal::decode::PackedChunk |
| A chunk of item indices produced by bin_pack() More... | |
Namespaces | |
| namespace | lloyal |
| Boundary Tracker Stub for OSS liblloyal. | |
| namespace | lloyal::decode |
Functions | |
| int | lloyal::decode::many (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0) |
| Decode multiple tokens into the KV cache with auto-chunking. | |
| int | lloyal::decode::many (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
| int | lloyal::decode::one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true) |
| Decode a single token into the KV cache. | |
| int | lloyal::decode::each (llama_context *ctx, const EachItem *items, int32_t n, Scratch &scratch) |
| Decode one token per sequence in a single llama_decode() call. | |
| int | lloyal::decode::each (llama_context *ctx, const std::vector< EachItem > &items, Scratch &scratch) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
| int | lloyal::decode::scatter (llama_context *ctx, const ScatterItem *items, int32_t n, Scratch &scratch) |
| Decode multiple tokens per sequence in a single llama_decode() call. | |
| int | lloyal::decode::scatter (llama_context *ctx, const std::vector< ScatterItem > &items, Scratch &scratch) |
| This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
| std::vector< PackedChunk > | lloyal::decode::bin_pack (const std::span< const llama_token > *items, int32_t n, int32_t n_batch) |
| Greedy first-fit bin-packing of token spans into n_batch-sized chunks. | |
Batch Decoding Operations.
Wraps llama.cpp decode APIs with batch management, chunking logic, and orchestration primitives. Provides both batched and single-token decode operations.
API naming follows this grid:
Single Sequence Multi Sequence
┌─────────────────┬─────────────────┐
Single Token │ decode::one │ decode::each │ ├─────────────────┼─────────────────┤ Multi Token │ decode::many │ decode::scatter│ └─────────────────┴─────────────────┘
Uses batch utilities from llama.cpp common (common_batch_clear, common_batch_add).
llama.cpp packs logits into a dense output buffer — only tokens with batch.logits[i] = true get logits computed. The internal output_ids vector translates batch positions to packed rows:
Callers always pass batch positions, not packed indices. The output_ids indirection handles the translation. Negative indices bypass output_ids entirely: -1 means the last output row, -2 the second-to-last, etc.
This matters for logit capture in BranchStore:
| Decode pattern | logits flag | Access index |
|---|---|---|
| decode::one | Last token only | -1 (sole output) |
| decode::many | Last token of final chunk only | -1 (sole output of last dispatch) |
| decode::each | All items (1:1 with branches) | i (batch pos = item index) |
| decode::scatter | Last token per item | cursor + n_tokens[k] - 1 |
For decode::many, each chunk is a separate llama_decode() call that resets the output buffer. Only the final chunk's last token has logits, so after the last dispatch n_outputs = 1 and -1 yields row 0.
Definition in file decode.hpp.