lloyal.node API Reference - v1.0.7
    Preparing search index...

    Interface SessionContext

    A llama.cpp context for text generation

    Represents a loaded model with KV cache for maintaining conversation state. Use createContext() to initialize, and dispose() when done to free memory.

    interface SessionContext {
        memorySize: number;
        vocabSize: number;
        _branchAccept(handle: number, token: number): void;
        _branchCaptureLogits(handle: number): void;
        _branchCreate(
            seqId: number,
            position: number,
            params?: SamplingParams,
        ): number;
        _branchDecodeAndCaptureOne(handle: number, token: number): void;
        _branchDestroy(handle: number): void;
        _branchFork(handle: number, newSeqId: number): number;
        _branchGetPerplexity(handle: number): number;
        _branchGetPosition(handle: number): number;
        _branchGetSeqId(handle: number): number;
        _branchPrune(handle: number): void;
        _branchSample(handle: number): number;
        _branchSamplerChainReseed(handle: number, seed: number): void;
        acceptSamplerToken(handle: number, tokenId: number): void;
        addSurprisal(handle: number, surprisal: number): void;
        applySampler(
            handle: number,
            logitsBuffer: ArrayBuffer | Float32Array<ArrayBuffer>,
        ): void;
        clearAndReseed(sinks: number[], tail: number[]): Promise<void>;
        clonePerplexityTracker(sourceHandle: number): number;
        cloneSampler(handle: number): number;
        createPerplexityTracker(): number;
        createSampler(grammarStr: string): number;
        decode(tokens: number[], position: number, seqId?: number): Promise<void>;
        decodeAndCapture(
            tokens: number[],
            position: number,
            seqId: number,
            destBuffer: ArrayBuffer | Float32Array<ArrayBuffer>,
        ): void;
        detokenize(tokens: number[]): Promise<string>;
        dispose(): void;
        encode(tokens: number[]): Promise<void>;
        formatChat(
            messagesJson: string,
            templateOverride?: string,
        ): Promise<FormattedChatResult>;
        freePerplexityTracker(handle: number): void;
        freeSamplerHandle(handle: number): void;
        getEmbeddingDimension(): number;
        getEmbeddings(normalize?: boolean): Float32Array;
        getLogits(): Float32Array;
        getPerplexity(handle: number): number;
        getPerplexityCount(handle: number): number;
        getTokenScores(): Buffer;
        greedySample(): number;
        hasPooling(): boolean;
        isStopToken(token: number): boolean;
        jsonSchemaToGrammar(schemaJson: string): string;
        kvCacheClear(): Promise<void>;
        kvCacheLoad(sequenceId: number, state: Buffer): Promise<void>;
        kvCacheReadFile(
            sequenceId: number,
            filepath: string,
        ): Promise<{ bytesRead: number; tokens: number[] }>;
        kvCacheRemove(
            sequenceId: number,
            start: number,
            end: number,
        ): Promise<void>;
        kvCacheSave(sequenceId?: number): Promise<Buffer>;
        kvCacheSize(sequenceId?: number): number;
        kvCacheWriteFile(
            sequenceId: number,
            filepath: string,
            tokens: number[],
        ): Promise<number>;
        kvSeqCopy(
            srcSeqId: number,
            dstSeqId: number,
            p0?: number,
            p1?: number,
        ): void;
        kvSeqKeep(seqId: number): void;
        kvSeqPosMax(seqId: number): number;
        modelEntropy(base?: "nats" | "bits", logits?: Float32Array): number;
        modelSurprisal(
            pickedTokenId: number,
            base?: "nats" | "bits",
            logits?: Float32Array,
        ): number;
        resetPerplexityTracker(handle: number): void;
        sample(params?: SamplingParams): number;
        tokenize(text: string): Promise<number[]>;
        tokenToText(token: number): string;
        validateChatTemplate(templateString: string): Promise<boolean>;
    }
    Index

    Properties

    memorySize: number

    Memory used by this context (bytes)

    Reports native memory for monitoring. Includes model weights, KV cache, and context state.

    vocabSize: number

    Model vocabulary size (number of possible tokens)

    This is the length of the scores buffer from getTokenScores().

    Methods

    • Internal

      Accept token (update sampler state for penalties)

      Parameters

      • handle: number
      • token: number

      Returns void

    • Internal

      Capture logits into branch's snapshot

      Parameters

      • handle: number

      Returns void

    • Internal

      Create a new branch for parallel generation

      Parameters

      Returns number

    • Internal

      Decode a single token and capture logits

      Parameters

      • handle: number
      • token: number

      Returns void

    • Internal

      Destroy branch (free handle without removing KV cache)

      Parameters

      • handle: number

      Returns void

    • Internal

      Fork a branch to a new sequence

      Parameters

      • handle: number
      • newSeqId: number

      Returns number

    • Internal

      Get branch's perplexity

      Parameters

      • handle: number

      Returns number

    • Internal

      Get branch's current position

      Parameters

      • handle: number

      Returns number

    • Internal

      Get branch's sequence ID

      Parameters

      • handle: number

      Returns number

    • Internal

      Prune branch (remove KV cache entries and free handle)

      Parameters

      • handle: number

      Returns void

    • Internal

      Sample next token from branch's logits snapshot

      Parameters

      • handle: number

      Returns number

    • Internal

      Reseed branch sampler PRNG for diversity after fork

      Parameters

      • handle: number
      • seed: number

      Returns void

    • Accept token to advance grammar parser state (handle-based)

      Must be called after sampling to advance the grammar parser.

      Parameters

      • handle: number

        Sampler handle from createSampler()

      • tokenId: number

        Token that was sampled

      Returns void

    • Add a surprisal value to the rolling tracker.

      Parameters

      • handle: number

        Tracker handle from createPerplexityTracker()

      • surprisal: number

        Surprisal value (from modelSurprisal or computed)

      Returns void

      const surprisal = ctx.modelSurprisal(tokenId, "nats");
      ctx.addSurprisal(tracker, surprisal);

      COST: O(1) - numerically stable accumulation THREAD-SAFETY: Not thread-safe (handle is session-local)

    • Apply grammar constraints using handle-based sampler

      Masks invalid tokens with -Infinity based on parser state. Modifies the logits buffer in-place.

      Parameters

      • handle: number

        Sampler handle from createSampler()

      • logitsBuffer: ArrayBuffer | Float32Array<ArrayBuffer>

        ArrayBuffer or TypedArray containing logits

      Returns void

    • Atomic clear+reseed operation

      Implements a KV cache compression strategy:

      1. Clear entire KV cache
      2. Re-decode original sinks (first N tokens from conversation start)
      3. Re-decode tail (last M recent tokens)

      Parameters

      • sinks: number[]

        ORIGINAL first N tokens from conversation start (typically 4)

      • tail: number[]

        Recent M tokens to preserve (typically 508-1020)

      Returns Promise<void>

      Promise that resolves when reseed completes

      const ORIGINAL_SINKS = allTokens.slice(0, 4);

      const tail = allTokens.slice(-508); // Last 508 tokens
      await ctx.clearAndReseed(ORIGINAL_SINKS, tail);

      const nextToken = ctx.greedySample();
      await ctx.decode([nextToken], 512);
    • Clone a perplexity tracker (for fork/branch scenarios).

      Parameters

      • sourceHandle: number

        Handle to clone from

      Returns number

      New handle with same accumulated state

      // Branch A and B start from same base perplexity
      const baseTracker = ctx.createPerplexityTracker();
      // ... accumulate base surprisals ...

      const branchA = ctx.clonePerplexityTracker(baseTracker);
      const branchB = ctx.clonePerplexityTracker(baseTracker);

      // Branch A and B now track independently
      ctx.addSurprisal(branchA, surprisalA);
      ctx.addSurprisal(branchB, surprisalB);
    • Clone a grammar sampler

      Creates a copy of the sampler with identical parser state. Both handles can then be used independently with their own state.

      Parameters

      • handle: number

        Sampler handle to clone

      Returns number

      New handle to cloned sampler

      const original = ctx.createSampler(jsonGrammar);
      ctx.acceptSamplerToken(original, openBrace);

      // Clone preserves parser state (already accepted openBrace)
      const copy = ctx.cloneSampler(original);

      // Both can now continue independently
      ctx.acceptSamplerToken(original, tokenA);
      ctx.acceptSamplerToken(copy, tokenB);
    • Create a new perplexity tracker.

      Returns number

      Integer handle to the tracker

      const tracker = ctx.createPerplexityTracker();

      // Add surprisals during generation
      for (let i = 0; i < tokens.length; i++) {
      const surprisal = ctx.modelSurprisal(tokens[i]);
      ctx.addSurprisal(tracker, surprisal);
      }

      const ppl = ctx.getPerplexity(tracker);
      console.log(`Sequence perplexity: ${ppl.toFixed(2)}`);

      ctx.freePerplexityTracker(tracker);
    • Create a new grammar sampler (returns handle)

      Creates an independent grammar sampler instance with its own state. Returns a handle that can be used with applySampler/acceptSamplerToken. Multiple handles can coexist with independent parser states.

      Cost: ~0.1-1ms depending on grammar complexity

      Parameters

      • grammarStr: string

        GBNF grammar string

      Returns number

      Handle to the created sampler

      const grammarHandle = ctx.createSampler(jsonGrammar);

      // Apply grammar constraints to logits
      ctx.applySampler(grammarHandle, logitsBuffer);
      ctx.acceptSamplerToken(grammarHandle, token);

      // Create independent copy with same grammar
      const clonedHandle = ctx.cloneSampler(grammarHandle);

      // Cleanup when done
      ctx.freeSamplerHandle(grammarHandle);
      ctx.freeSamplerHandle(clonedHandle);
    • STEP 1: Process tokens through the model (forward pass)

      This feeds tokens through the transformer and updates the KV cache. After decoding, the model has "read" this text and is ready to predict.

      Think of this as: "the model reads your prompt"

      Why async? Model inference takes time (~45ms per token) Why position? Model needs to know where in conversation this text appears

      Cost: ~45ms per token (generation), ~120ms for 50 tokens (prompt)

      Parameters

      • tokens: number[]

        Token IDs from tokenize()

      • position: number

        Where these tokens start in the sequence

      • OptionalseqId: number

        Sequence ID (default: 0)

      Returns Promise<void>

      const tokens = await ctx.tokenize("Hello world");
      await ctx.decode(tokens, 0);
      let position = tokens.length;

      // Generate next token
      await ctx.decode([nextToken], position++);

      // Multi-sequence: decode to different sequences
      await ctx.decode(tokens, 0, 0); // Sequence 0
      await ctx.decode(tokens, 0, 1); // Sequence 1
    • Decode tokens and capture logits atomically

      Performs decode and logits capture as a single atomic operation, ensuring the captured logits correspond exactly to the decoded tokens.

      Use this instead of separate decode() + getLogits() calls when you need guaranteed consistency between decode and logits capture.

      Parameters

      • tokens: number[]

        Token IDs to decode

      • position: number

        Start position in sequence

      • seqId: number

        Sequence ID

      • destBuffer: ArrayBuffer | Float32Array<ArrayBuffer>

        Pre-allocated buffer to receive logits (vocabSize floats)

      Returns void

      // Pre-allocate buffer (reuse across calls)
      const logitsBuffer = new Float32Array(ctx.vocabSize);

      // Atomic decode + capture
      ctx.decodeAndCapture([token], position, seqId, logitsBuffer);

      // Safe to process logitsBuffer - it's an independent copy
      const nextToken = sampleFromLogits(logitsBuffer);
    • Detokenize array of tokens back to text

      Inverse of tokenize(). Use for reconstructing complete text from token sequences (e.g., after KV cache operations).

      Optimized for batch conversion of many tokens. For single-token conversion during generation, use tokenToText().

      Cost: ~1ms per 100 tokens

      Parameters

      • tokens: number[]

        Array of token IDs

      Returns Promise<string>

      Complete text representation

      const tokens = [15496, 1917]; // "Hello world"
      const text = await ctx.detokenize(tokens);
      console.log(text); // "Hello world"
    • Free native resources

      Call when done with context to release model and KV cache memory. Context becomes unusable after disposal.

      Returns void

    • Encode tokens for embedding extraction

      Unlike decode(), this marks ALL tokens with logits=true which is required for embedding extraction. Use with embeddings=true context.

      Workflow:

      1. Create context with { embeddings: true, poolingType: PoolingType.MEAN }
      2. Tokenize your text
      3. Clear KV cache (important between different texts!)
      4. Call encode() with tokens
      5. Call getEmbeddings() to get the vector

      Cost: ~5-50ms depending on text length and model

      Parameters

      • tokens: number[]

        Token IDs from tokenize()

      Returns Promise<void>

      // Create embedding context
      const ctx = await createContext({
      modelPath: './nomic-embed.gguf',
      embeddings: true,
      poolingType: PoolingType.MEAN
      });

      // Get embedding for text
      const tokens = await ctx.tokenize("Hello world");
      await ctx.kvCacheClear(); // Important between texts!
      await ctx.encode(tokens);
      const embedding = ctx.getEmbeddings();
    • Format messages using model's chat template

      Converts [{role, content}] → formatted prompt string. Uses model's built-in template (ChatML, Llama, Mistral, etc.).

      Cost: ~1-5ms depending on message count

      Parameters

      • messagesJson: string

        JSON string containing array of messages

      • OptionaltemplateOverride: string

        Optional custom template string

      Returns Promise<FormattedChatResult>

      Formatted prompt and stop tokens from template

      const result = await ctx.formatChat(JSON.stringify([
      { role: "system", content: "You are a helpful assistant" },
      { role: "user", content: "Hello!" }
      ]));

      const tokens = await ctx.tokenize(result.prompt);
      await ctx.decode(tokens, 0);
    • Free perplexity tracker resources.

      Parameters

      • handle: number

        Tracker handle to free

        NOTE: Auto-freed in dispose() if not manually freed

      Returns void

    • Free a grammar sampler handle

      Releases memory for the specified sampler. Handle becomes invalid after this call.

      Parameters

      • handle: number

        Sampler handle to free

      Returns void

    • Get embedding dimension for model

      Returns the size of embedding vectors this model produces. Common values: 768 (BERT-like), 1024, 2048, 4096.

      Cost: <0.01ms (fast model property lookup)

      Returns number

      Embedding dimension

      const dim = ctx.getEmbeddingDimension();
      console.log(`Model produces ${dim}-dimensional embeddings`);
    • Get embedding vector from context (after encode)

      Returns the embedding vector for the encoded text. Call after encode() to extract embeddings.

      The vector dimension depends on the model (e.g., 768 for nomic-embed). Use getEmbeddingDimension() to get the size.

      Cost: ~0.5ms (extraction from model state)

      Parameters

      • Optionalnormalize: boolean

        Apply L2 normalization (default: true for cosine similarity)

      Returns Float32Array

      Float32Array of embedding values

      await ctx.encode(tokens);

      // Get L2-normalized embedding (for cosine similarity)
      const embedding = ctx.getEmbeddings();

      // Or raw embedding without normalization
      const rawEmbedding = ctx.getEmbeddings(false);
    • STEP 2b: Get logits for reading (zero-copy, readonly usage pattern)

      Returns Float32Array for computational tasks like entropy calculation. For custom sampling or grammar, use getTokenScores() instead.

      WARNING: Buffer is only valid until next decode() call!

      Returns Float32Array

      Float32Array of unnormalized logits (vocabSize elements)

    • Get current perplexity value.

      Parameters

      • handle: number

        Tracker handle

      Returns number

      Perplexity = exp(average_surprisal_in_nats)

      const ppl = ctx.getPerplexity(tracker);
      console.log(`Current PPL: ${ppl.toFixed(2)}`);

      FORMULA: PPL = exp(sum_surprisals / count) RANGE: [1, ∞) where 1 = perfect prediction

    • Get number of tokens tracked.

      Parameters

      • handle: number

        Tracker handle

      Returns number

      Number of surprisal values added

    • STEP 2a: Get token scores for custom sampling (zero-copy, mutable)

      Returns unnormalized scores for every possible next token. Higher score = model thinks this token is more likely.

      Use this for custom sampling logic or grammar-constrained generation. For reading scores (entropy computation), use getLogits() instead.

      ⚠️ CRITICAL LIFETIME CONSTRAINTS:

      • This is a zero-copy buffer (points directly to model memory)
      • Valid ONLY until next decode() call
      • NOT thread-safe - use only on JS thread
      • DO NOT retain reference across async boundaries
      • Buffer is invalidated by: decode(), sample() with grammar

      Cost: ~0.5ms (zero-copy pointer)

      Returns Buffer

      Buffer containing vocabSize floats (Float32Array compatible)

      const buffer = ctx.getTokenScores();
      const scores = new Float32Array(buffer.buffer, buffer.byteOffset, buffer.length / 4);

      // Modify immediately (safe - still on JS thread)
      scores[BANNED_TOKEN] = -Infinity;

      // Use immediately
      const token = customSample(scores);

      // Now decode invalidates the buffer
      await ctx.decode([token], position++);
      // Buffer is now INVALID - do not access!
    • Sample greedily from current logits

      Selects token with highest logit value (deterministic). Equivalent to sample() with temperature=0.

      Returns number

      Token ID with highest probability

    • Check if context has pooling enabled

      Returns true if context was created with embeddings=true and a pooling type other than NONE.

      Cost: <0.01ms

      Returns boolean

      True if pooling is enabled

    • Check if token is a model stop token

      Returns true for built-in end-of-generation tokens:

      • (Llama 2)
      • <|endoftext|> (GPT)
      • <|eot_id|> (Llama 3)
      • Model-specific EOS tokens

      Note: This checks vocabulary stop tokens, not custom stop sequences. For custom stops (e.g., "\n\n", "###"), compare generated text against your stop strings in application code.

      Cost: <0.01ms (fast vocabulary lookup)

      Parameters

      • token: number

        Token ID to check

      Returns boolean

      const token = ctx.sample();
      if (ctx.isStopToken(token)) {
      console.log('Generation complete');
      break;
      }
    • Convert JSON schema to GBNF grammar

      Generates grammar string for constrained JSON generation. Use with createSampler() for grammar-constrained generation.

      Cost: ~1-10ms depending on schema complexity

      Parameters

      • schemaJson: string

        JSON schema string

      Returns string

      GBNF grammar string

      const schema = {
      type: "object",
      properties: {
      name: { type: "string" },
      age: { type: "number" }
      },
      required: ["name"]
      };

      const grammar = ctx.jsonSchemaToGrammar(JSON.stringify(schema));
      const handle = ctx.createSampler(grammar);
    • Clear all KV cache (fresh start)

      Removes all cached tokens. Model returns to initial state as if no text has been processed.

      Use when starting a completely new conversation.

      Cost: ~1ms

      Returns Promise<void>

      // Start fresh conversation
      await ctx.kvCacheClear();

      const tokens = await ctx.tokenize("New conversation");
      await ctx.decode(tokens, 0);
    • Restore KV cache from previous snapshot

      Loads saved model state. Context returns to exact state when snapshot was taken.

      Cost: ~100-500ms depending on snapshot size

      Parameters

      • sequenceId: number

        Sequence ID (use 0 for single sequence)

      • state: Buffer

        Buffer from kvCacheSave()

      Returns Promise<void>

      const snapshot = await ctx.kvCacheSave(0);

      // ... many operations later ...

      // Restore to saved state
      await ctx.kvCacheLoad(0, snapshot);
    • Read KV cache state + tokens from file

      Restores KV cache state from a previous kvCacheWriteFile call.

      Parameters

      • sequenceId: number

        Sequence ID to restore to

      • filepath: string

        Path to saved file

      Returns Promise<{ bytesRead: number; tokens: number[] }>

      Promise resolving to tokens and bytes read

    • Remove token range from KV cache

      Deletes tokens from model's memory. Use cases:

      • Removing old context when hitting limit (sliding window)
      • Implementing conversation pruning
      • Forgetting specific messages
      • Preparing for injection of new context

      ⚠️ CRITICAL: Call BEFORE next decode(), not after! The model needs to know about the removal before processing new tokens.

      Cost: ~1-5ms depending on range

      Parameters

      • sequenceId: number

        Sequence ID (use 0 for single sequence)

      • start: number

        Start position (inclusive)

      • end: number

        End position (exclusive), -1 = to end

      Returns Promise<void>

      // Remove old tokens to stay under context limit
      const currentLength = ctx.kvCacheSize(0);
      if (currentLength > 2000) {
      // Remove oldest 500 tokens
      await ctx.kvCacheRemove(0, 0, 500);

      // THEN decode new tokens
      await ctx.decode(newTokens, currentLength - 500);
      }
    • Snapshot KV cache state for branching/undo

      Serializes entire model state to Buffer. Restore later with kvCacheLoad() for:

      • Conversation branching ("what if I said X instead?")
      • Undo/redo functionality
      • Checkpointing long conversations

      Size: ~500MB-2GB depending on context length and model

      Cost: ~100-500ms depending on cache size

      Parameters

      • OptionalsequenceId: number

        Sequence ID (use 0 for single sequence)

      Returns Promise<Buffer>

      Serialized state buffer

      // Save state before risky operation
      const snapshot = await ctx.kvCacheSave(0);

      // Try something
      await ctx.decode(riskyTokens, position);

      // Didn't work - restore previous state
      await ctx.kvCacheLoad(0, snapshot);
    • Get current sequence length (number of decoded tokens)

      The KV cache stores model state for all decoded tokens. This tells you how many tokens are currently in memory.

      Think of this as: "How much has the model read so far?"

      Cost: <0.01ms (fast sync operation - safe to call frequently)

      Parameters

      • OptionalsequenceId: number

        Sequence ID (defaults to 0 for single conversation)

      Returns number

      Number of tokens in cache, or -1 if empty

      const tokens = await ctx.tokenize("Hello world");
      await ctx.decode(tokens, 0);

      const length = ctx.kvCacheSize(0);
      console.log(length); // 2 (number of tokens)
    • Write KV cache state + tokens to file

      Persists KV cache state for later restoration. Useful for checkpointing long conversations.

      Parameters

      • sequenceId: number

        Sequence ID to save

      • filepath: string

        Path to save file

      • tokens: number[]

        Tokens that were decoded into this sequence

      Returns Promise<number>

      Promise resolving to bytes written

    • Copy KV cache from one sequence to another

      Duplicates the KV cache state from source to destination sequence. After copying, both sequences can continue independently.

      NOTE: Only full sequence copies are currently supported. The p0/p1 parameters must use default values (0 and -1).

      Cost: ~1-5ms depending on sequence length

      Parameters

      • srcSeqId: number

        Source sequence to copy from

      • dstSeqId: number

        Destination sequence to copy to

      • Optionalp0: number

        Start position (must be 0, default: 0)

      • Optionalp1: number

        End position (must be -1 for full copy, default: -1)

      Returns void

      // Decode initial prompt to seq 0
      await ctx.decode(promptTokens, 0);

      // Copy seq 0 -> seq 1
      ctx.kvSeqCopy(0, 1);

      // Now both sequences can continue independently
      await ctx.decode([tokenA], position, 0);
      await ctx.decode([tokenB], position, 1);
    • Keep only specified sequence, remove all others

      Removes all sequences except the one specified. For complete cleanup of unwanted sequences, consider using kvCacheRemove(seqId, 0, -1) on each sequence instead.

      Parameters

      • seqId: number

        Sequence ID to keep

      Returns void

    • Get max position in sequence

      Returns the highest position index in the specified sequence, or -1 if the sequence is empty.

      Cost: <0.01ms (fast sync operation)

      Parameters

      • seqId: number

        Sequence ID to query

      Returns number

      Max position index, or -1 if empty

      const pos = ctx.kvSeqPosMax(0);
      if (pos === -1) {
      console.log('Sequence is empty');
      } else {
      console.log(`Sequence has ${pos + 1} tokens`);
      }
    • Compute entropy of the entire logits distribution.

      Measures model uncertainty:

      • Low entropy: Model is confident (peaked distribution)
      • High entropy: Model is uncertain (flat distribution)

      Call after decode() to analyze the current prediction distribution, or pass captured logits for offline analysis.

      Parameters

      • Optionalbase: "nats" | "bits"

        Logarithm base: "nats" (default), "bits", or "base10"

      • Optionallogits: Float32Array

        Optional Float32Array of logits (uses current context logits if omitted)

      Returns number

      Entropy value in specified base

      await ctx.decode(tokens, position);
      const entropy = ctx.modelEntropy("bits");
      if (entropy > 5.0) {
      console.log("Model is very uncertain - consider adjusting parameters");
      }
      const capturedLogits = new Float32Array(ctx.getLogits());
      const entropy = ctx.modelEntropy("nats", capturedLogits);

      COST: O(n_vocab) - must sum over all token probabilities

    • Compute surprisal (negative log-likelihood) for a specific token.

      Measures how "surprising" the model finds the given token:

      • Low surprisal: Model expected this token (high probability)
      • High surprisal: Model didn't expect this token (low probability)

      Call after decode() to compute surprisal for any token based on the current logits distribution, or pass captured logits for offline computation (e.g., best-of-n scoring from prefill logits).

      Parameters

      • pickedTokenId: number

        Token ID to compute surprisal for

      • Optionalbase: "nats" | "bits"

        Logarithm base: "nats" (default) or "bits"

      • Optionallogits: Float32Array

        Optional Float32Array of logits (uses current context logits if omitted)

      Returns number

      Surprisal value in specified base

      await ctx.decode(tokens, position);
      const token = ctx.sample();
      const surprisal = ctx.modelSurprisal(token, "bits");
      console.log(`Model surprise: ${surprisal.toFixed(2)} bits`);
      // Capture logits after prefill
      const capturedLogits = new Float32Array(ctx.getLogits());

      // Later: compute surprisal from captured logits
      const surprisal = ctx.modelSurprisal(token, "nats", capturedLogits);

      COST: O(n_vocab) - softmax normalization required

    • Reset tracker to initial state (count=0, sum=0).

      Parameters

      • handle: number

        Tracker handle to reset

      Returns void

      // Reuse tracker for multiple sequences
      const tracker = ctx.createPerplexityTracker();

      for (const sequence of sequences) {
      ctx.resetPerplexityTracker(tracker);
      // ... process sequence ...
      const ppl = ctx.getPerplexity(tracker);
      }
    • STEP 3: Sample a token from scores

      Converts raw scores into a token decision using:

      • Temperature: controls randomness
      • Top-K/Top-P: filters unlikely tokens
      • Grammar: enforces format constraints (if grammar initialized)

      This is where generation strategy happens.

      Cost: ~0.1ms (native sampling)

      Parameters

      Returns number

      Selected token ID

      // Greedy (always pick most likely)
      const token = ctx.sample();

      // Creative generation
      const token = ctx.sample({ temperature: 0.9 });

      // Constrained to valid JSON (handle-based API)
      const grammarHandle = ctx.createSampler(grammar);
      ctx.applySampler(grammarHandle, ctx.getLogits());
      const token = ctx.sample({ temperature: 0.7 });
      ctx.acceptSamplerToken(grammarHandle, token);
    • Tokenize text into model's vocabulary

      Converts human text → token IDs for decode(). Same text always produces same tokens for a given model.

      Cost: ~1ms per 100 characters

      Parameters

      • text: string

        Text to tokenize

      Returns Promise<number[]>

      Array of token IDs

      const tokens = await ctx.tokenize("Hello world");
      console.log(tokens); // [15496, 1917] for Llama models

      await ctx.decode(tokens, 0);
    • Convert token ID to text piece

      Fast synchronous lookup in vocabulary table. Call this on each generated token for streaming display.

      Optimized for per-token conversion during generation. For batch conversion of many tokens, use detokenize() instead.

      Cost: ~0.05ms

      Parameters

      • token: number

        Token ID from sample()

      Returns string

      Text string for this token

      while (true) {
      const token = ctx.sample({ temperature: 0.8 });
      if (ctx.isStopToken(token)) break;

      const text = ctx.tokenToText(token);
      process.stdout.write(text); // Stream to output

      await ctx.decode([token], position++);
      }
    • Validate chat template syntax

      Checks if template string is valid before using.

      Cost: ~0.1-1ms

      Parameters

      • templateString: string

        Template string to validate

      Returns Promise<boolean>

      True if template syntax is valid