[
](LICENSE)

Covalent Inference for llama.cpp
Composable C++ primitives for forkable decode state and shared-prefix (KV) branching in llama.cpp. Fork a generation into a tree — branches share a prefix while keeping independent machinery (sampler chain, seed, grammar, logits snapshot, perplexity tracker) for controlled divergence at decode time.
Continuous Tree Batching
Tree search with N branches means N calls to llama_decode() — each paying GPU dispatch overhead, memory barriers, and PCIe round-trips. Continuous tree batching eliminates this: BranchStore packs tokens from N branches — each at a different position, different seq_id, each needing independent logits captured — into a single llama_batch and dispatches once. N branches, 1 GPU call.
store.decode_each({{child1.handle(), tok1},
{child2.handle(), tok2},
{child3.handle(), tok3}});
Two packing strategies for different access patterns:
store.decode_each(items);
store.decode_scatter({
{branchA.handle(), system_tokens},
{branchB.handle(), query_tokens},
{branchC.handle(), doc_tokens},
});
The underlying decode grid (decode.hpp):
| Single Sequence | Multi Sequence |
| Single Token | decode::one | decode::each |
| Multi Token | decode::many | decode::scatter |
The Branch API
Each branch owns a KV cache lease, sampler chain, grammar, metrics tracker, and logits snapshot — everything needed for independent generation. fork() deep-clones all of it. Branches compose into best-of-N, speculative decoding, tree search, beam search.
auto root = Branch::create(ctx, model, store, 0, params);
root.prefill(prompt_tokens.data(), prompt_tokens.size());
auto analogy = root.fork();
auto formal = root.fork();
auto socratic = root.fork();
auto visual = root.fork();
{analogy.handle(), tokenize("Think of it like two coins...")},
{formal.handle(), tokenize("In quantum mechanics, the...")},
{socratic.handle(),
tokenize(
"What happens when you measure...")},
{visual.handle(),
tokenize(
"Imagine two particles...")},
});
std::vector<Branch*> branches = {&analogy, &formal, &socratic, &visual};
while (!branches.empty()) {
std::vector<decode::EachItem> items;
for (auto* b : branches) {
auto tok = b->sample();
if (b->is_eog(tok)) { b->prune(); continue; }
b->accept(tok);
items.push_back({b->handle(), tok});
}
std::erase_if(branches, [](auto* b) { return !b->valid(); });
}
auto* winner = *std::min_element(branches.begin(), branches.end(),
[](auto* a, auto* b) { return a->perplexity() < b->perplexity(); });
Handle table and batched decode orchestrator for branch management.
void decode_each(std::span< const DecodeEachItem > items)
Decode one token per branch in a single GPU dispatch.
void retainOnly(BranchHandle winner)
Keep only the winner — nuclear KV + CPU cleanup.
void init_tenancy(llama_context *ctx)
Initialize KV tenancy after context creation.
void decode_scatter(std::span< const DecodeScatterItem > items)
Decode variable token counts per branch with auto-chunking.
std::vector< llama_token > tokenize(const llama_vocab *vocab, const std::string &text, bool add_special, bool parse_special)
Tokenize text to token array.
What fork() clones: KV cache sequence, sampler chain handle (penalties, PRNG, filters), grammar handle (GBNF parser state), metrics handle (model + sampling perplexity), logits snapshot, logit bias, cached sampler params.
What fork() does NOT clone: steer callback (captures references, unsafe to copy).
Hot-Swap Sampler & Grammar
Sampler chains, grammars, and metrics live in handle-based registries on BranchStore — instance-scoped, no global state. set_sampler_params() rebuilds the sampler chain with memoization (no-op if params unchanged). set_grammar() hot-swaps the grammar constraint.
for (int i = 0; i < max_tokens; i++) {
float entropy = metrics::model_entropy(root.logits(), root.n_vocab());
float temp = T0 * std::pow(N, THETA / std::max(entropy, 0.1f));
root.setSamplerParams(MyParams{.temperature = temp});
auto tok = root.sample();
if (root.is_eog(tok)) break;
root.accept(tok);
root.step(tok);
}
root.setGrammar(json_gbnf);
auto tok = root.sample();
root.setGrammar(nullptr);
Handles are freed automatically on prune() — no manual cleanup. fork() deep-clones all registry entries.
KV Tenancy
Two resources, two scales. Slots (65K) are how many branches can exist — cheap CPU state. Leases (256) are how many can decode — scarce KV cache residency. kv::tenancy manages the scarce resource as leases, acquired on create()/fork(), evicted on prune(), rebuilt on retainOnly(). No manual seq_id tracking, ever.
size_t available() const
Number of vacant seq_ids available for acquisition.
void drain()
Explicit teardown — evict all leases while context is alive.
The turn lifecycle: search is surgical (N × prune()), promotion is nuclear (1 × retainOnly()). Per turn, fork → expand → evaluate → prune losers → repeat. Between turns, promote winner → tree is gone → next turn starts fresh.
Topology
Parent/child edges are always-on. Simple chat → best-of-N → deep search is one continuum — the library provides topology queries at every point on the spectrum.
bool isActive(BranchHandle h) const
Test whether a branch holds a KV lease.
bool isLeaf(BranchHandle h) const
Test whether a branch is a leaf (no children)
const std::vector< BranchHandle > & children(BranchHandle h) const
Get a branch's child handles.
BranchHandle parent(BranchHandle h) const
Get a branch's parent handle.
| Method | FK analogy | Behavior |
prune() | RESTRICT | Throws if children exist |
pruneSubtree() | CASCADE | Iterative post-order traversal |
RAII ~Branch() uses CASCADE — cleanup always succeeds, even with deep trees. Multi-tag KV cells ensure pruning a parent doesn't corrupt children's cache — a cell is freed only when ALL tags are removed.
Primitives
The building blocks that compose into the above:
- Tokenization — Two-pass safe buffer sizing, special token handling
- Decoding — Continuous tree batching, cross-sequence dispatch packing
- KV Cache — Tenancy (vacancy manager), sequence ops, state snapshots, long-context compression
- Sampling — Grammar-constrained, persistent chains, hot-swap with memoization
- Metrics — Dual-level entropy/surprisal, rolling perplexity, cloneable state (BranchStore-scoped)
- Embeddings — Pooled extraction, L2 normalization, similarity
- Chat Templates — Jinja2 formatting with fallbacks
Lower-level handles for fine-grained control:
llama_sampler * init_sampler(const llama_model *model, const std::string &grammar_str, const std::string &root_rule="root")
Initialize a grammar sampler from GBNF grammar string.
llama_sampler * create_chain(const P ¶ms)
Create a persistent sampler chain from parameters.
Shared model weights — multiple contexts, one model load:
static std::shared_ptr< llama_model > acquire(const std::string &fsPath, const llama_model_params ¶ms)
Acquire a model from cache, or load from disk on cache miss.
From Simple to Complex
Single-sequence streaming — the baseline everyone has:
while (!done) {
}
int one(llama_context *ctx, llama_token tok, llama_pos pos, llama_seq_id seq_id=0, bool want_logits=true)
Decode a single token into the KV cache.
int many(llama_context *ctx, const llama_token *tokens, int32_t n_tokens, int32_t n_past, int32_t n_batch, llama_seq_id seq_id=0)
Decode multiple tokens into the KV cache with auto-chunking.
llama_token sample_with_params(llama_context *ctx, const llama_vocab *vocab, const P ¶ms, llama_sampler *grammarSampler=nullptr)
Sample with configurable parameters (template accepts any SamplingParams type)
Best-of-N — fork once, diverge everywhere, keep the best:
auto root = Branch::create(ctx, model, store, 0, params);
root.prefill(prompt_tokens.data(), prompt_tokens.size());
std::vector<Branch> candidates;
for (int i = 0; i < 8; i++) {
candidates.push_back(root.fork());
reseed_chain(candidates.back().handle(), store, 1000 + i);
}
for (int t = 0; t < 64; t++) {
std::vector<decode::EachItem> items;
for (auto& c : candidates) {
auto tok = c.sample();
c.accept(tok);
items.push_back({c.handle(), tok});
}
}
auto& winner = *std::min_element(candidates.begin(), candidates.end(),
[](auto& a, auto& b) { return a.perplexity() < b.perplexity(); });
Tree search with continuous tree batching — the full show:
auto root = Branch::create(ctx, model, store, 0, params);
root.prefill(prompt_tokens.data(), prompt_tokens.size());
for (int turn = 0; turn < max_turns; turn++) {
std::vector<Branch> leaves;
int width = std::min((
int)store.
available(), max_width);
for (int i = 0; i < width; i++) {
auto leaf = root.fork();
reseed_chain(leaf.handle(), store, turn * 1000 + i);
leaves.push_back(std::move(leaf));
}
for (int d = 0; d < depth; d++) {
std::vector<decode::EachItem> items;
for (auto& leaf : leaves) {
auto tok = leaf.sample();
leaf.accept(tok);
items.push_back({leaf.handle(), tok});
}
}
std::sort(leaves.begin(), leaves.end(),
[](auto& a, auto& b) { return a.perplexity() < b.perplexity(); });
for (size_t i = 1; i < leaves.size(); i++)
root = std::move(leaves[0]);
}
void prune(BranchHandle handle, BranchStore &s)
Prune a leaf branch (RESTRICT — throws if children exist)
Architecture
- Header-only — All implementations inline in
include/lloyal/*.hpp
- Managed KV residency —
kv::tenancy tracks seq_id leases; consumers never see raw seq_ids
- Handle-based APIs — Generation counters prevent ABA bugs on slot reuse; sampler chains, grammars, and metrics live in BranchStore-scoped registries (no global state)
- Shared model weights — Thread-safe registry enables multi-context with single model load
- Zero runtime dependencies — Only requires C++20 standard library + llama.cpp
- Multi-binding — C++20 concepts decouple from binding-specific types (Node.js, React Native, CLI)
Integration
Git Submodule
git submodule add -b v0.1.0 https://github.com/lloyal-ai/liblloyal.git
CMake
add_subdirectory(liblloyal)
target_link_libraries(your_target PRIVATE lloyal llama)
CocoaPods (iOS)
s.header_dir = "lloyal"
s.source_files = "liblloyal/include/**/*.{hpp,h}"
Documentation
Usage Guide: `docs/guide.md`
API Reference: Auto-generated from Doxygen-annotated headers
- Online: lloyal-ai.github.io/liblloyal
- Local:
./scripts/generate-docs.sh → docs/api/html/index.html
- Headers:
include/lloyal/*.hpp — fully documented with Doxygen
Testing
- 256 unit tests (tenancy, topology, continuous tree batching, RESTRICT/CASCADE, handle registries)
- 128 integration tests with real llama.cpp (multi-step generation, ABA prevention, batch error paths, retainOnly, hot-swap sampler/grammar)
- Sanitizer validation (ASan, UBSan, LeakSan)
# Unit tests (stub-based, no model required)
cd tests && cmake -B build && cmake --build build && ./build/TestRunner
# Integration tests (real llama.cpp)
cd tests && cmake -B build -DLLOYAL_BUILD_INTEGRATION_TESTS=ON \
-DLLAMA_CPP_DIR=../llama.cpp && cmake --build build
LLAMA_TEST_MODEL=path/to/model.gguf ./build/IntegrationRunner
Design Principles
- Primitives, not opinions — Build your patterns, we provide the tools
- Managed scarcity — KV leases are automatic; capacity is queryable
- Explicit over implicit — No hidden state, clear contracts
- Testable — No framework coupling, works standalone
- Version-isolated — Absorbs llama.cpp API changes
Contributing
See CONTRIBUTING.md for development guidelines.
License
Apache 2.0 — See LICENSE file for details