liblloyal 1.0.0
Composable primitives for llama.cpp inference
Loading...
Searching...
No Matches
liblloyal - Composable Primitives for llama.cpp Inference

Introduction

liblloyal is a C++20 header-only library providing composable building blocks for llama.cpp inference. It offers clean abstractions over llama.cpp primitives with support for tokenization, sampling, embeddings, KV cache management, and advanced patterns like multi-sequence operations and handle-based APIs.

Core Features

  • Tokenization - Two-pass safe buffer sizing, special token handling
  • Decoding - Batch orchestration, sequence-aware operations
  • KV Cache - Sequence operations, state snapshots, long-context patterns
  • Sampling - Grammar-constrained, persistent chains, 52 parameters
  • Metrics - Dual-level entropy/surprisal, rolling perplexity, cloneable state
  • Embeddings - Pooled extraction, L2 normalization, similarity
  • Chat Templates - Jinja2 formatting with fallbacks

Advanced Patterns

Handle-Based APIs

Create reusable sampler chains and grammar handles for efficient token generation:

auto chain = lloyal::sampler::create_chain(model, params);
lloyal::sampler::apply(chain, ctx, vocab); // Reuse across tokens
void apply(llama_sampler *chain, llama_token_data_array *cur_p)
Apply a sampler chain to a candidate array.
Definition sampler.hpp:581
llama_sampler * create_chain(const P &params)
Create a persistent sampler chain from parameters.
Definition sampler.hpp:465

Shared Model Weights

Multiple contexts can share the same loaded model via ModelRegistry:

auto model1 = lloyal::ModelRegistry::acquire(path, params); // Loads model
auto model2 = lloyal::ModelRegistry::acquire(path, params); // Cache hit
// model1 and model2 share weights, independent KV caches
static std::shared_ptr< llama_model > acquire(const std::string &fsPath, const llama_model_params &params)
Acquire a model from cache or load if not present.

Multi-Sequence Operations

All primitives support sequence IDs for parallel execution paths:

lloyal::kv::seq_cp(ctx, 0, 1); // Branch sequence 0 to sequence 1
lloyal::kv::seq_cp(ctx, 0, 2); // Branch sequence 0 to sequence 2
// Each sequence maintains independent state
void seq_cp(llama_context *ctx, llama_seq_id src, llama_seq_id dst, llama_pos p0=0, llama_pos p1=-1)
Copy KV cache from one sequence to another.
Definition kv.hpp:114

Quick Start

// Load model (shared weights)
auto model = lloyal::ModelRegistry::acquire("model.gguf", params);
llama_context* ctx = llama_init_from_model(model.get(), ctx_params);
// Tokenize and decode
auto tokens = lloyal::tokenizer::tokenize(model.get(), "Hello, world!");
lloyal::decoder::decode_tokens(ctx, tokens, 0, n_batch);
// Sample next token
auto token = lloyal::sampler::sample_with_params(ctx, vocab, params);
Batch Decoding Operations.
Thread-Safe Model Cache.
void decode_tokens(llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
Process tokens through model to update KV cache.
Definition decoder.hpp:127
llama_token sample_with_params(llama_context *ctx, const llama_vocab *vocab, const P &params, llama_sampler *grammarSampler=nullptr)
Sample with configurable parameters (template accepts any SamplingParams type)
Definition sampler.hpp:179
std::vector< llama_token > tokenize(const llama_vocab *vocab, const std::string &text, bool add_special, bool parse_special)
Tokenize text to token array.
Definition tokenizer.hpp:38
Token Sampling Operations.
Text Tokenization Operations.

Architecture

  • Header-only - All implementations inline in include/lloyal/*.hpp
  • Composable primitives - Building blocks combine into diverse patterns
  • Handle-based APIs - Persistent samplers, grammar chains for efficiency
  • Shared model weights - Thread-safe registry enables multi-context with single model load
  • Multi-sequence support - All primitives sequence-aware (default seq=0)
  • llama.cpp binding - Compile-time dependency, validated by build system
  • Zero runtime dependencies - Only requires C++20 standard library

Key Namespaces

Documentation

  • Usage Guide: See docs/guide.md for comprehensive patterns, examples, and best practices
  • API Reference: Navigate using the tabs above (Namespaces, Classes, Files)
  • Examples: Check the Examples tab for usage patterns
  • Headers: All APIs are fully documented inline in include/lloyal/*.hpp

Installation

Add as git submodule:

git submodule add -b v0.1.0 https://github.com/lloyal-ai/liblloyal.git

CMake integration:

add_subdirectory(liblloyal)
target_link_libraries(your_target PRIVATE lloyal llama)

License

Apache 2.0 - See LICENSE file for details