Introduction
liblloyal is a C++20 header-only library providing composable building blocks for llama.cpp inference. It offers clean abstractions over llama.cpp primitives with support for tokenization, sampling, embeddings, KV cache management, and advanced patterns like multi-sequence operations and handle-based APIs.
Core Features
- Tokenization - Two-pass safe buffer sizing, special token handling
- Decoding - Batch orchestration, sequence-aware operations
- KV Cache - Sequence operations, state snapshots, long-context patterns
- Sampling - Grammar-constrained, persistent chains, 52 parameters
- Metrics - Dual-level entropy/surprisal, rolling perplexity, cloneable state
- Embeddings - Pooled extraction, L2 normalization, similarity
- Chat Templates - Jinja2 formatting with fallbacks
Advanced Patterns
Handle-Based APIs
Create reusable sampler chains and grammar handles for efficient token generation:
void apply(llama_sampler *chain, llama_token_data_array *cur_p)
Apply a sampler chain to a candidate array.
llama_sampler * create_chain(const P ¶ms)
Create a persistent sampler chain from parameters.
Shared Model Weights
Multiple contexts can share the same loaded model via ModelRegistry:
static std::shared_ptr< llama_model > acquire(const std::string &fsPath, const llama_model_params ¶ms)
Acquire a model from cache or load if not present.
Multi-Sequence Operations
All primitives support sequence IDs for parallel execution paths:
void seq_cp(llama_context *ctx, llama_seq_id src, llama_seq_id dst, llama_pos p0=0, llama_pos p1=-1)
Copy KV cache from one sequence to another.
Quick Start
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);
Batch Decoding Operations.
void decode_tokens(llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
Process tokens through model to update KV cache.
llama_token sample_with_params(llama_context *ctx, const llama_vocab *vocab, const P ¶ms, llama_sampler *grammarSampler=nullptr)
Sample with configurable parameters (template accepts any SamplingParams type)
std::vector< llama_token > tokenize(const llama_vocab *vocab, const std::string &text, bool add_special, bool parse_special)
Tokenize text to token array.
Token Sampling Operations.
Text Tokenization Operations.
Architecture
- Header-only - All implementations inline in
include/lloyal/*.hpp
- Composable primitives - Building blocks combine into diverse patterns
- Handle-based APIs - Persistent samplers, grammar chains for efficiency
- Shared model weights - Thread-safe registry enables multi-context with single model load
- Multi-sequence support - All primitives sequence-aware (default seq=0)
- llama.cpp binding - Compile-time dependency, validated by build system
- Zero runtime dependencies - Only requires C++20 standard library
Key Namespaces
Documentation
- Usage Guide: See
docs/guide.md for comprehensive patterns, examples, and best practices
- API Reference: Navigate using the tabs above (Namespaces, Classes, Files)
- Examples: Check the Examples tab for usage patterns
- Headers: All APIs are fully documented inline in
include/lloyal/*.hpp
Installation
Add as git submodule:
git submodule add -b v0.1.0 https://github.com/lloyal-ai/liblloyal.git
CMake integration:
add_subdirectory(liblloyal)
target_link_libraries(your_target PRIVATE lloyal llama)
License
Apache 2.0 - See LICENSE file for details