liblloyal 1.0.0
Composable primitives for llama.cpp inference
Loading...
Searching...
No Matches
decoder.hpp File Reference

Batch Decoding Operations. More...

#include "common.hpp"
#include "helpers.hpp"
#include <algorithm>
#include <cstdint>
#include <llama/llama.h>
#include <stdexcept>
#include <vector>

Go to the source code of this file.

Classes

struct  lloyal::detail::BatchGuard
 RAII guard for automatic batch cleanup Ensures llama_batch_free is called even if exceptions occur. More...
 

Namespaces

namespace  lloyal
 JSON Schema to Grammar Converter (Header-Only)
 
namespace  lloyal::detail
 
namespace  lloyal::decoder
 

Macros

#define LLOYAL_STACK_BATCH   1
 LLOYAL_STACK_BATCH - Controls llama_batch construction strategy.
 

Functions

void lloyal::detail::add_tokens_to_batch (llama_batch &batch, const llama_token *tokens, int32_t start_idx, int32_t n_eval, int32_t n_past, int32_t capacity, llama_seq_id seq_id=0)
 Add tokens to batch with position info.
 
void lloyal::decoder::decode_tokens (llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 Process tokens through model to update KV cache.
 
void lloyal::decoder::decode_tokens (llama_context *ctx, const std::vector< llama_token > &tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
 Convenience overload for std::vector<llama_token>
 
void lloyal::decoder::decode_one (llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
 Decode a single token with zero heap allocation (when LLOYAL_STACK_BATCH=1)
 

Detailed Description

Batch Decoding Operations.

Wraps llama.cpp decode APIs with batch management, chunking logic, and orchestration primitives. Provides both batched and single-token decode operations.

Uses batch utilities from helpers.hpp (batch_clear, batch_add) for token management.

Definition in file decoder.hpp.

Macro Definition Documentation

◆ LLOYAL_STACK_BATCH

#define LLOYAL_STACK_BATCH   1

LLOYAL_STACK_BATCH - Controls llama_batch construction strategy.

When 1 (default): Use zero-allocation stack-constructed batch in decode_one()

  • Fastest: no heap allocation per decode
  • Risk: breaks if llama_batch struct layout changes

When 0: Use thread_local batch via llama_batch_init()

  • Slightly slower: one-time init per thread
  • Safe: uses llama.cpp's own initializer, handles new fields

If build breaks after llama.cpp update due to llama_batch changes:

  1. Set LLOYAL_STACK_BATCH=0 to unblock immediately
  2. Update decode_one() to match new struct layout
  3. Update ABI stability test assertions
  4. Re-enable LLOYAL_STACK_BATCH=1

Definition at line 32 of file decoder.hpp.