ReadonlymemoryMemory used by this context (bytes)
Reports native memory for monitoring. Includes model weights, KV cache, and context state.
ReadonlyvocabModel vocabulary size (number of possible tokens)
This is the length of the scores buffer from getTokenScores().
InternalAccept token (update sampler state for penalties)
InternalCapture logits into branch's snapshot
InternalCreate a new branch for parallel generation
Optionalparams: SamplingParamsInternalDecode a single token and capture logits
InternalDestroy branch (free handle without removing KV cache)
InternalFork a branch to a new sequence
InternalGet branch's perplexity
InternalGet branch's current position
InternalGet branch's sequence ID
InternalPrune branch (remove KV cache entries and free handle)
InternalSample next token from branch's logits snapshot
InternalReseed branch sampler PRNG for diversity after fork
Accept token to advance grammar parser state (handle-based)
Must be called after sampling to advance the grammar parser.
Sampler handle from createSampler()
Token that was sampled
Add a surprisal value to the rolling tracker.
Tracker handle from createPerplexityTracker()
Surprisal value (from modelSurprisal or computed)
Apply grammar constraints using handle-based sampler
Masks invalid tokens with -Infinity based on parser state. Modifies the logits buffer in-place.
Sampler handle from createSampler()
ArrayBuffer or TypedArray containing logits
Atomic clear+reseed operation
Implements a KV cache compression strategy:
ORIGINAL first N tokens from conversation start (typically 4)
Recent M tokens to preserve (typically 508-1020)
Promise that resolves when reseed completes
Clone a perplexity tracker (for fork/branch scenarios).
Handle to clone from
New handle with same accumulated state
// Branch A and B start from same base perplexity
const baseTracker = ctx.createPerplexityTracker();
// ... accumulate base surprisals ...
const branchA = ctx.clonePerplexityTracker(baseTracker);
const branchB = ctx.clonePerplexityTracker(baseTracker);
// Branch A and B now track independently
ctx.addSurprisal(branchA, surprisalA);
ctx.addSurprisal(branchB, surprisalB);
Clone a grammar sampler
Creates a copy of the sampler with identical parser state. Both handles can then be used independently with their own state.
Sampler handle to clone
New handle to cloned sampler
const original = ctx.createSampler(jsonGrammar);
ctx.acceptSamplerToken(original, openBrace);
// Clone preserves parser state (already accepted openBrace)
const copy = ctx.cloneSampler(original);
// Both can now continue independently
ctx.acceptSamplerToken(original, tokenA);
ctx.acceptSamplerToken(copy, tokenB);
Create a new perplexity tracker.
Integer handle to the tracker
const tracker = ctx.createPerplexityTracker();
// Add surprisals during generation
for (let i = 0; i < tokens.length; i++) {
const surprisal = ctx.modelSurprisal(tokens[i]);
ctx.addSurprisal(tracker, surprisal);
}
const ppl = ctx.getPerplexity(tracker);
console.log(`Sequence perplexity: ${ppl.toFixed(2)}`);
ctx.freePerplexityTracker(tracker);
Create a new grammar sampler (returns handle)
Creates an independent grammar sampler instance with its own state. Returns a handle that can be used with applySampler/acceptSamplerToken. Multiple handles can coexist with independent parser states.
Cost: ~0.1-1ms depending on grammar complexity
GBNF grammar string
Handle to the created sampler
const grammarHandle = ctx.createSampler(jsonGrammar);
// Apply grammar constraints to logits
ctx.applySampler(grammarHandle, logitsBuffer);
ctx.acceptSamplerToken(grammarHandle, token);
// Create independent copy with same grammar
const clonedHandle = ctx.cloneSampler(grammarHandle);
// Cleanup when done
ctx.freeSamplerHandle(grammarHandle);
ctx.freeSamplerHandle(clonedHandle);
STEP 1: Process tokens through the model (forward pass)
This feeds tokens through the transformer and updates the KV cache. After decoding, the model has "read" this text and is ready to predict.
Think of this as: "the model reads your prompt"
Why async? Model inference takes time (~45ms per token) Why position? Model needs to know where in conversation this text appears
Cost: ~45ms per token (generation), ~120ms for 50 tokens (prompt)
Token IDs from tokenize()
Where these tokens start in the sequence
OptionalseqId: numberSequence ID (default: 0)
const tokens = await ctx.tokenize("Hello world");
await ctx.decode(tokens, 0);
let position = tokens.length;
// Generate next token
await ctx.decode([nextToken], position++);
// Multi-sequence: decode to different sequences
await ctx.decode(tokens, 0, 0); // Sequence 0
await ctx.decode(tokens, 0, 1); // Sequence 1
Decode tokens and capture logits atomically
Performs decode and logits capture as a single atomic operation, ensuring the captured logits correspond exactly to the decoded tokens.
Use this instead of separate decode() + getLogits() calls when you need guaranteed consistency between decode and logits capture.
Token IDs to decode
Start position in sequence
Sequence ID
Pre-allocated buffer to receive logits (vocabSize floats)
Detokenize array of tokens back to text
Inverse of tokenize(). Use for reconstructing complete text from token sequences (e.g., after KV cache operations).
Optimized for batch conversion of many tokens. For single-token conversion during generation, use tokenToText().
Cost: ~1ms per 100 tokens
Array of token IDs
Complete text representation
Free native resources
Call when done with context to release model and KV cache memory. Context becomes unusable after disposal.
Encode tokens for embedding extraction
Unlike decode(), this marks ALL tokens with logits=true which is required for embedding extraction. Use with embeddings=true context.
Workflow:
Cost: ~5-50ms depending on text length and model
Token IDs from tokenize()
// Create embedding context
const ctx = await createContext({
modelPath: './nomic-embed.gguf',
embeddings: true,
poolingType: PoolingType.MEAN
});
// Get embedding for text
const tokens = await ctx.tokenize("Hello world");
await ctx.kvCacheClear(); // Important between texts!
await ctx.encode(tokens);
const embedding = ctx.getEmbeddings();
Format messages using model's chat template
Converts [{role, content}] → formatted prompt string. Uses model's built-in template (ChatML, Llama, Mistral, etc.).
Cost: ~1-5ms depending on message count
JSON string containing array of messages
OptionaltemplateOverride: stringOptional custom template string
Formatted prompt and stop tokens from template
Free perplexity tracker resources.
Tracker handle to free
NOTE: Auto-freed in dispose() if not manually freed
Free a grammar sampler handle
Releases memory for the specified sampler. Handle becomes invalid after this call.
Sampler handle to free
Get embedding dimension for model
Returns the size of embedding vectors this model produces. Common values: 768 (BERT-like), 1024, 2048, 4096.
Cost: <0.01ms (fast model property lookup)
Embedding dimension
Get embedding vector from context (after encode)
Returns the embedding vector for the encoded text. Call after encode() to extract embeddings.
The vector dimension depends on the model (e.g., 768 for nomic-embed). Use getEmbeddingDimension() to get the size.
Cost: ~0.5ms (extraction from model state)
Optionalnormalize: booleanApply L2 normalization (default: true for cosine similarity)
Float32Array of embedding values
STEP 2b: Get logits for reading (zero-copy, readonly usage pattern)
Returns Float32Array for computational tasks like entropy calculation. For custom sampling or grammar, use getTokenScores() instead.
WARNING: Buffer is only valid until next decode() call!
Float32Array of unnormalized logits (vocabSize elements)
Get current perplexity value.
Tracker handle
Perplexity = exp(average_surprisal_in_nats)
Get number of tokens tracked.
Tracker handle
Number of surprisal values added
STEP 2a: Get token scores for custom sampling (zero-copy, mutable)
Returns unnormalized scores for every possible next token. Higher score = model thinks this token is more likely.
Use this for custom sampling logic or grammar-constrained generation. For reading scores (entropy computation), use getLogits() instead.
⚠️ CRITICAL LIFETIME CONSTRAINTS:
Cost: ~0.5ms (zero-copy pointer)
Buffer containing vocabSize floats (Float32Array compatible)
const buffer = ctx.getTokenScores();
const scores = new Float32Array(buffer.buffer, buffer.byteOffset, buffer.length / 4);
// Modify immediately (safe - still on JS thread)
scores[BANNED_TOKEN] = -Infinity;
// Use immediately
const token = customSample(scores);
// Now decode invalidates the buffer
await ctx.decode([token], position++);
// Buffer is now INVALID - do not access!
Sample greedily from current logits
Selects token with highest logit value (deterministic). Equivalent to sample() with temperature=0.
Token ID with highest probability
Check if context has pooling enabled
Returns true if context was created with embeddings=true and a pooling type other than NONE.
Cost: <0.01ms
True if pooling is enabled
Check if token is a model stop token
Returns true for built-in end-of-generation tokens:
Note: This checks vocabulary stop tokens, not custom stop sequences. For custom stops (e.g., "\n\n", "###"), compare generated text against your stop strings in application code.
Cost: <0.01ms (fast vocabulary lookup)
Token ID to check
Convert JSON schema to GBNF grammar
Generates grammar string for constrained JSON generation. Use with createSampler() for grammar-constrained generation.
Cost: ~1-10ms depending on schema complexity
JSON schema string
GBNF grammar string
Clear all KV cache (fresh start)
Removes all cached tokens. Model returns to initial state as if no text has been processed.
Use when starting a completely new conversation.
Cost: ~1ms
Restore KV cache from previous snapshot
Loads saved model state. Context returns to exact state when snapshot was taken.
Cost: ~100-500ms depending on snapshot size
Sequence ID (use 0 for single sequence)
Buffer from kvCacheSave()
Read KV cache state + tokens from file
Restores KV cache state from a previous kvCacheWriteFile call.
Sequence ID to restore to
Path to saved file
Promise resolving to tokens and bytes read
Remove token range from KV cache
Deletes tokens from model's memory. Use cases:
⚠️ CRITICAL: Call BEFORE next decode(), not after! The model needs to know about the removal before processing new tokens.
Cost: ~1-5ms depending on range
Sequence ID (use 0 for single sequence)
Start position (inclusive)
End position (exclusive), -1 = to end
Snapshot KV cache state for branching/undo
Serializes entire model state to Buffer. Restore later with kvCacheLoad() for:
Size: ~500MB-2GB depending on context length and model
Cost: ~100-500ms depending on cache size
OptionalsequenceId: numberSequence ID (use 0 for single sequence)
Serialized state buffer
Get current sequence length (number of decoded tokens)
The KV cache stores model state for all decoded tokens. This tells you how many tokens are currently in memory.
Think of this as: "How much has the model read so far?"
Cost: <0.01ms (fast sync operation - safe to call frequently)
OptionalsequenceId: numberSequence ID (defaults to 0 for single conversation)
Number of tokens in cache, or -1 if empty
Write KV cache state + tokens to file
Persists KV cache state for later restoration. Useful for checkpointing long conversations.
Sequence ID to save
Path to save file
Tokens that were decoded into this sequence
Promise resolving to bytes written
Copy KV cache from one sequence to another
Duplicates the KV cache state from source to destination sequence. After copying, both sequences can continue independently.
NOTE: Only full sequence copies are currently supported. The p0/p1 parameters must use default values (0 and -1).
Cost: ~1-5ms depending on sequence length
Source sequence to copy from
Destination sequence to copy to
Optionalp0: numberStart position (must be 0, default: 0)
Optionalp1: numberEnd position (must be -1 for full copy, default: -1)
Keep only specified sequence, remove all others
Removes all sequences except the one specified. For complete cleanup of unwanted sequences, consider using kvCacheRemove(seqId, 0, -1) on each sequence instead.
Sequence ID to keep
Get max position in sequence
Returns the highest position index in the specified sequence, or -1 if the sequence is empty.
Cost: <0.01ms (fast sync operation)
Sequence ID to query
Max position index, or -1 if empty
Compute entropy of the entire logits distribution.
Measures model uncertainty:
Call after decode() to analyze the current prediction distribution, or pass captured logits for offline analysis.
Optionalbase: "nats" | "bits"Logarithm base: "nats" (default), "bits", or "base10"
Optionallogits: Float32ArrayOptional Float32Array of logits (uses current context logits if omitted)
Entropy value in specified base
Compute surprisal (negative log-likelihood) for a specific token.
Measures how "surprising" the model finds the given token:
Call after decode() to compute surprisal for any token based on the current logits distribution, or pass captured logits for offline computation (e.g., best-of-n scoring from prefill logits).
Token ID to compute surprisal for
Optionalbase: "nats" | "bits"Logarithm base: "nats" (default) or "bits"
Optionallogits: Float32ArrayOptional Float32Array of logits (uses current context logits if omitted)
Surprisal value in specified base
await ctx.decode(tokens, position);
const token = ctx.sample();
const surprisal = ctx.modelSurprisal(token, "bits");
console.log(`Model surprise: ${surprisal.toFixed(2)} bits`);
// Capture logits after prefill
const capturedLogits = new Float32Array(ctx.getLogits());
// Later: compute surprisal from captured logits
const surprisal = ctx.modelSurprisal(token, "nats", capturedLogits);
COST: O(n_vocab) - softmax normalization required
Reset tracker to initial state (count=0, sum=0).
Tracker handle to reset
STEP 3: Sample a token from scores
Converts raw scores into a token decision using:
This is where generation strategy happens.
Cost: ~0.1ms (native sampling)
Optionalparams: SamplingParamsSampling strategy (greedy if omitted)
Selected token ID
// Greedy (always pick most likely)
const token = ctx.sample();
// Creative generation
const token = ctx.sample({ temperature: 0.9 });
// Constrained to valid JSON (handle-based API)
const grammarHandle = ctx.createSampler(grammar);
ctx.applySampler(grammarHandle, ctx.getLogits());
const token = ctx.sample({ temperature: 0.7 });
ctx.acceptSamplerToken(grammarHandle, token);
Tokenize text into model's vocabulary
Converts human text → token IDs for decode(). Same text always produces same tokens for a given model.
Cost: ~1ms per 100 characters
Text to tokenize
Array of token IDs
Convert token ID to text piece
Fast synchronous lookup in vocabulary table. Call this on each generated token for streaming display.
Optimized for per-token conversion during generation. For batch conversion of many tokens, use detokenize() instead.
Cost: ~0.05ms
Token ID from sample()
Text string for this token
Validate chat template syntax
Checks if template string is valid before using.
Cost: ~0.1-1ms
Template string to validate
True if template syntax is valid
A llama.cpp context for text generation
Represents a loaded model with KV cache for maintaining conversation state. Use createContext() to initialize, and dispose() when done to free memory.