ReadonlymemoryMemory used by this context (bytes)
Reports native memory for monitoring. Includes model weights, KV cache, and context state.
ReadonlyvocabModel vocabulary size (number of possible tokens)
This is the length of the logits array from Branch.getLogits().
InternalInternalInternalInternalInternalOptionalparams: SamplingParamsOptionalnBatch: numberOptionalgrammar: stringInternalInternalInternalInternalInternalInternalInternalInternalInternalOptionalbase: stringInternalOptionalbase: stringInternalInternalInternalInternalInternalInternalInternalInternalInternalInternal— processes ≤ n_seq_max prompts in a single group
InternalInternalKV cache pressure snapshot from native BranchStore. cells_used is a monotonic counter reset on drain/retainOnly.
InternalInternalBlink KV — cache-local reconstruction for bounded-memory streaming
Implements the Blink KV
protocol (Naqvi, 2026): when the KV cache fills, clear it entirely and
re-decode retained tokens at contiguous positions [0, 1, ..., N-1].
This achieves cache-local position IDs — the operative requirement for
stable bounded-memory streaming — without backend-specific knowledge of
key storage format. Works on post-RoPE engines (where StreamingLLM's
pos-shift is unavailable) and any backend exposing clear() + decode().
Why not naive eviction? Selective eviction (kvCacheRemove) preserves
original position IDs, which grow without bound. Across 5 architectures,
naive eviction produces PPL spanning 3 orders of magnitude — ranging from
1.15x baseline (Llama, lucky config) to 198x (Phi, sinks present).
Under Blink KV reconstruction, all 5 converge to 3-16% of baseline.
Sinks are optional. Under reconstruction, the 0+N (sinkless) config matches 4+N (with sinks) within <2% across all tested architectures. Pass an empty sinks array if you don't need them.
Algorithm:
sinks at position 0 (optional attention anchors)tail at position sinks.length (recent context)Cost: Re-decodes sinks.length + tail.length tokens. At per-boundary
trigger (reconstruct when cache reaches nCtx), amortized cost is
O(cacheSize / interval) decode ops per token — ~0.14 at typical settings.
First N tokens from conversation start (typically 4, or empty). Must be the same tokens every reseed — reusing different tokens degrades any attention-sink patterns the model may have learned for early positions.
Recent M tokens to preserve (typically 252-1020)
Promise that resolves when reconstruction completes.
Next decode continues at position sinks.length + tail.length.
// Capture sinks once at conversation start
const SINKS = allTokens.slice(0, 4);
// On cache fill: compress to 512 tokens (4 sinks + 508 tail)
if (position >= ctx.nCtx) {
const tail = allTokens.slice(-508);
await ctx.clearAndReseed(SINKS, tail);
position = 512; // sinks.length + tail.length
}
Detokenize array of tokens back to text
Inverse of tokenize(). Use for reconstructing complete text from token sequences (e.g., after KV cache operations).
Optimized for batch conversion of many tokens. For single-token conversion during generation, use tokenToText().
Cost: ~1ms per 100 tokens
Array of token IDs
Complete text representation
Free native resources
Call when done with context to release model and KV cache memory. Context becomes unusable after disposal.
Encode tokens for embedding extraction
Unlike decode(), this marks ALL tokens with logits=true which is required for embedding extraction. Use with embeddings=true context.
Workflow:
Cost: ~5-50ms depending on text length and model
Token IDs from tokenize()
// Create embedding context
const ctx = await createContext({
modelPath: './nomic-embed.gguf',
embeddings: true,
poolingType: PoolingType.MEAN
});
// Get embedding for text
const tokens = await ctx.tokenize("Hello world");
await ctx.kvCacheClear(); // Important between texts!
await ctx.encode(tokens);
const embedding = ctx.getEmbeddings();
Format messages using model's chat template
Converts [{role, content}] -> formatted prompt string with full format awareness. Uses model's built-in template (ChatML, Llama, Mistral, etc.).
The returned format and reasoningFormat fields should be passed to
parseChatOutput() after generation to correctly decode the response.
Cost: ~1-5ms depending on message count
JSON string containing array of messages
Optionaloptions: string | FormatChatOptionsFormatting options (tools, reasoning, grammar, etc.)
Formatted prompt with format-awareness metadata
const result = await ctx.formatChat(JSON.stringify([
{ role: "system", content: "You are a helpful assistant" },
{ role: "user", content: "Hello!" }
]));
const tokens = await ctx.tokenize(result.prompt);
const branch = Branch.create(ctx, 0, { temperature: 0.7 });
await branch.prefill(tokens);
Format messages using model's chat template (sync — inline on main thread)
Same as formatChat but synchronous. Use from Effection generators
to avoid yield* call() overhead for CPU-only work.
JSON string containing array of messages
Optionaloptions: string | FormatChatOptionsFormatting options (tools, reasoning, grammar, etc.)
Formatted prompt with format-awareness metadata
Get embedding dimension for model
Returns the size of embedding vectors this model produces. Common values: 768 (BERT-like), 1024, 2048, 4096.
Cost: <0.01ms (fast model property lookup)
Embedding dimension
Get embedding vector from context (after encode)
Returns the embedding vector for the encoded text. Call after encode() to extract embeddings.
The vector dimension depends on the model (e.g., 768 for nomic-embed). Use getEmbeddingDimension() to get the size.
Cost: ~0.5ms (extraction from model state)
Optionalnormalize: booleanApply L2 normalization (default: true for cosine similarity)
Float32Array of embedding values
Get the model's end-of-generation token ID
Returns the EOT token (e.g. <|im_end|> for ChatML), falling back to EOS (e.g. ) for Zephyr-style models. This is the inverse of isStopToken() — "what IS the stop token?" vs "is this a stop token?"
Use case: warm multi-turn continuation prepends this token to close the previous assistant turn before injecting new user content.
Token ID (integer)
Get the model's turn separator token IDs
Returns the tokens that close an assistant turn and transition to the next message, as determined by the model's chat template. Computed once per model, cached.
For ChatML templates: [im_end_id, newline_id] (e.g., [2, 198]) For Llama 3 templates: [eot_id] (e.g., [128009])
Use case: warm multi-turn prefill to achieve exact parity with cold path.
Array of token IDs (cached after first call)
Check if context has pooling enabled
Returns true if context was created with embeddings=true and a pooling type other than NONE.
Cost: <0.01ms
True if pooling is enabled
Check if token is a model stop token
Returns true for built-in end-of-generation tokens:
Note: This checks vocabulary stop tokens, not custom stop sequences. For custom stops (e.g., "\n\n", "###"), compare generated text against your stop strings in application code.
Cost: <0.01ms (fast vocabulary lookup)
Token ID to check
Convert JSON schema to GBNF grammar
Generates grammar string for constrained JSON generation. Use with Branch.create grammar parameter for constrained generation.
Cost: ~1-10ms depending on schema complexity
JSON schema string
GBNF grammar string
Convert JSON schema to GBNF grammar (sync — inline on main thread)
Same as jsonSchemaToGrammar but synchronous. Use from Effection
generators to avoid yield* call() overhead for CPU-only work.
JSON schema string
GBNF grammar string
Clear all KV cache (fresh start)
Removes all cached tokens. Model returns to initial state as if no text has been processed.
Use when starting a completely new conversation.
Cost: ~1ms
Restore KV cache from previous snapshot
Loads saved model state. Context returns to exact state when snapshot was taken.
Cost: ~100-500ms depending on snapshot size
Sequence ID (use 0 for single sequence)
Buffer from kvCacheSave()
Read KV cache state + tokens from file
Restores KV cache state from a previous kvCacheWriteFile call.
Sequence ID to restore to
Path to saved file
Promise resolving to tokens and bytes read
Remove token range from KV cache
Deletes tokens from model's memory. Use cases:
CRITICAL: Call BEFORE next decode(), not after! The model needs to know about the removal before processing new tokens.
Cost: ~1-5ms depending on range
Sequence ID (use 0 for single sequence)
Start position (inclusive)
End position (exclusive), -1 = to end
Snapshot KV cache state for branching/undo
Serializes entire model state to Buffer. Restore later with kvCacheLoad() for:
Size: ~500MB-2GB depending on context length and model
Cost: ~100-500ms depending on cache size
OptionalsequenceId: numberSequence ID (use 0 for single sequence)
Serialized state buffer
Get max position in the KV cache for a sequence
Returns the highest position index in the specified sequence, or -1 if the sequence is empty. This is the same value as kvSeqPosMax. To get the token count, add 1.
Think of this as: "How much has the model read so far?"
Cost: <0.01ms (fast sync operation - safe to call frequently)
OptionalsequenceId: numberSequence ID (defaults to 0 for single conversation)
Highest position index, or -1 if empty
Write KV cache state + tokens to file
Persists KV cache state for later restoration. Useful for checkpointing long conversations.
Sequence ID to save
Path to save file
Tokens that were decoded into this sequence
Promise resolving to bytes written
Fork a KV cache sequence — the primitive behind Branch.fork
Copies all KV cache entries from srcSeqId to dstSeqId. Under
llama.cpp's unified KV cache, this is a metadata-only operation —
no key/value tensors are copied. Both sequences reference the same
physical KV entries for the shared prefix; only tokens decoded after
the fork point allocate new storage. This is what makes tree-structured
generation (best-of-N, beam search, speculative decoding) memory-efficient:
N branches sharing a 1000-token prefix cost ~1000 KV entries, not N*1000.
The higher-level Branch.fork wraps this and additionally clones
the sampler chain, grammar state, logits snapshot, and perplexity tracker.
Use kvSeqCopy directly when you need raw sequence management without
the Branch abstraction.
NOTE: Only full-sequence copies are supported. The p0/p1 parameters must use default values (0 and -1).
Cost: O(1) metadata — no tensor copy under unified KV
Source sequence to copy from
Destination sequence to copy to
Optionalp0: numberStart position (must be 0, default: 0)
Optionalp1: numberEnd position (must be -1 for full copy, default: -1)
Keep only specified sequence, remove all others
Removes all sequences except the one specified. For complete cleanup of unwanted sequences, consider using kvCacheRemove(seqId, 0, -1) on each sequence instead.
Sequence ID to keep
Get max position in sequence
Returns the highest position index in the specified sequence, or -1 if the sequence is empty.
Cost: <0.01ms (fast sync operation)
Sequence ID to query
Max position index, or -1 if empty
Parse model output into structured content
Extracts plain text, reasoning/thinking blocks, and tool calls from raw model output. Uses the format detected by formatChat to apply the correct parser for the model's output format.
Cost: <0.1ms (synchronous string parsing, no I/O)
Raw model output text
Chat format enum (from FormattedChatResult.format)
Optionaloptions: ParseChatOutputOptionsOptional parsing parameters
Parsed content with tool calls and reasoning
const fmt = await ctx.formatChat(JSON.stringify(messages), { tools: toolsJson });
// ... generate tokens ...
const parsed = ctx.parseChatOutput(generatedText, fmt.format, {
reasoningFormat: fmt.reasoningFormat,
thinkingForcedOpen: fmt.thinkingForcedOpen,
parser: fmt.parser
});
if (parsed.toolCalls.length > 0) {
// Handle tool calls
}
// parseChatOutput separates <think>...</think> blocks into reasoningContent.
// This is REQUIRED for correct warm continuation on thinking models (e.g. Qwen3):
// if raw output containing <think> tags is stored as content, re-formatting
// the conversation produces different tokens, breaking cold/warm parity.
const messages: Array<{role: string; content: string; reasoning_content?: string}> = [];
const sep = ctx.getTurnSeparator();
let branch: Branch | null = null;
let fmt: FormattedChatResult;
async function handleTurn(userContent: string) {
messages.push({ role: 'user', content: userContent });
if (!branch) {
// Cold path: format full conversation, tokenize with BOS, prefill
fmt = await ctx.formatChat(JSON.stringify(messages));
const tokens = await ctx.tokenize(fmt.prompt);
branch = Branch.create(ctx, 0, { temperature: 0.7 });
await branch.prefill(tokens);
} else {
// Warm path: string-diff for delta tokens
const { prompt: full } = await ctx.formatChat(JSON.stringify(messages));
const { prompt: prefix } = await ctx.formatChat(
JSON.stringify(messages.slice(0, -1)),
{ addGenerationPrompt: false }
);
const delta = await ctx.tokenize(full.substring(prefix.length), false);
await branch.prefill([...sep, ...delta]);
}
// Generate
let rawOutput = '';
while (true) {
const { token, text, isStop } = await branch.produce();
if (isStop) break;
rawOutput += text;
await branch.commit(token);
}
// Parse output: separates reasoning from content
const parsed = ctx.parseChatOutput(rawOutput, fmt.format, {
reasoningFormat: fmt.reasoningFormat,
thinkingForcedOpen: fmt.thinkingForcedOpen,
parser: fmt.parser
});
// Store parsed fields — formatChat reconstructs thinking blocks correctly
messages.push({
role: 'assistant',
content: parsed.content,
reasoning_content: parsed.reasoningContent || undefined
});
}
Tokenize text into model's vocabulary
Converts human text → token IDs for decode(). Same text always produces same tokens for a given model.
Cost: ~1ms per 100 characters
Text to tokenize
OptionaladdSpecial: booleanWhether to add special tokens (BOS/EOS). Defaults to model metadata setting (typically true). Pass false for mid-sequence tokenization (e.g., warm multi-turn continuation deltas).
Array of token IDs
Tokenize text into model's vocabulary (sync — inline on main thread)
Same as tokenize but synchronous. Use from Effection generators
to avoid yield* call() overhead for CPU-only work.
Text to tokenize
OptionaladdSpecial: booleanWhether to add special tokens (BOS/EOS). Defaults to model metadata setting (typically true). Pass false for mid-sequence tokenization.
Array of token IDs
Convert token ID to text piece
Fast synchronous lookup in vocabulary table. Call this on each generated token for streaming display.
Optimized for per-token conversion during generation. For batch conversion of many tokens, use detokenize() instead.
Cost: ~0.05ms
Token ID
Text string for this token
Validate chat template syntax
Checks if template string is valid before using.
Cost: ~0.1-1ms
Template string to validate
True if template syntax is valid
Inference context — the runtime surface for a loaded model
A SessionContext owns a llama_context (KV cache + compute graph) bound to a shared model. It provides tokenization, logit access, KV cache management, chat template formatting, and embedding extraction.
All generation flows through Branch. Create a branch at position 0, prefill prompt tokens, then use the produce/commit loop or async iterator:
For tree-structured generation (best-of-N, beam search, speculative decoding), use Branch.fork and BranchStore — they manage per-branch KV sequences, sampler chains, and logits snapshots with O(1) GPU dispatches via batched decode.
Logits: For branch-level logits, use Branch.getLogits which returns an independent copy of the branch's snapshot. For metrics, use Branch.modelEntropy and Branch.modelSurprisal which operate directly on the branch's logits without JS round-trips.
KV cache: Supports multi-sequence operation (
nSeqMax > 1), per-sequence copy/clear/eviction, file-based persistence, and context compression viaclearAndReseed().Chat templates:
formatChat()andparseChatOutput()handle the full round-trip of chat formatting, including tool calls, reasoning blocks, and grammar-constrained generation — using the model's native Jinja template.Use createContext to initialize, and
dispose()when done to free GPU/CPU memory.