lloyal-agents API Reference
    Preparing search index...

    Interface SessionContext

    Inference context — the runtime surface for a loaded model

    A SessionContext owns a llama_context (KV cache + compute graph) bound to a shared model. It provides tokenization, logit access, KV cache management, chat template formatting, and embedding extraction.

    All generation flows through Branch. Create a branch at position 0, prefill prompt tokens, then use the produce/commit loop or async iterator:

    const branch = Branch.create(ctx, 0, { temperature: 0.7 });
    await branch.prefill(promptTokens);
    for await (const { token, text } of branch) {
    process.stdout.write(text);
    }

    For tree-structured generation (best-of-N, beam search, speculative decoding), use Branch.fork and BranchStore — they manage per-branch KV sequences, sampler chains, and logits snapshots with O(1) GPU dispatches via batched decode.

    Logits: For branch-level logits, use Branch.getLogits which returns an independent copy of the branch's snapshot. For metrics, use Branch.modelEntropy and Branch.modelSurprisal which operate directly on the branch's logits without JS round-trips.

    KV cache: Supports multi-sequence operation (nSeqMax > 1), per-sequence copy/clear/eviction, file-based persistence, and context compression via clearAndReseed().

    Chat templates: formatChat() and parseChatOutput() handle the full round-trip of chat formatting, including tool calls, reasoning blocks, and grammar-constrained generation — using the model's native Jinja template.

    Use createContext to initialize, and dispose() when done to free GPU/CPU memory.

    interface SessionContext {
        memorySize: number;
        vocabSize: number;
        _branchAccept(handle: number, token: number): void;
        _branchChildren(handle: number): number[];
        _branchClearLogitBias(handle: number): void;
        _branchClearSteer(handle: number): void;
        _branchCreate(
            position: number,
            params?: SamplingParams,
            nBatch?: number,
            grammar?: string,
        ): number;
        _branchFork(handle: number): number;
        _branchForkHead(handle: number): number;
        _branchGetLogits(handle: number): Float32Array;
        _branchGetPerplexity(handle: number): number;
        _branchGetPosition(handle: number): number;
        _branchGetSamplingPerplexity(handle: number): number;
        _branchIsActive(handle: number): boolean;
        _branchIsLeaf(handle: number): boolean;
        _branchModelEntropy(handle: number, base?: string): number;
        _branchModelSurprisal(handle: number, token: number, base?: string): number;
        _branchParent(handle: number): number;
        _branchPrefill(handle: number, tokens: number[]): Promise<void>;
        _branchPrune(handle: number): void;
        _branchPruneSubtree(handle: number): void;
        _branchSample(handle: number): number;
        _branchSamplerChainReseed(handle: number, seed: number): void;
        _branchSetGrammar(handle: number, grammarStr: string): void;
        _branchSetGrammarLazy(
            handle: number,
            grammar: string,
            patterns: string[],
            tokens: number[],
        ): void;
        _branchSetLogitBias(
            handle: number,
            biases: { bias: number; token: number }[],
        ): void;
        _branchSetSamplerParams(handle: number, params: SamplingParams): void;
        _branchSteer(
            handle: number,
            biases: { bias: number; token: number }[],
        ): void;
        _scoreGroup(
            tokenArrays: number[][],
        ): Promise<Float32Array<ArrayBufferLike>[]>;
        _storeAvailable(): number;
        _storeCommit(handles: number[], tokens: number[]): Promise<void>;
        _storeKvPressure(): { cellsUsed: number; nCtx: number; remaining: number };
        _storePrefill(handles: number[], tokenArrays: number[][]): Promise<void>;
        _storeRetainOnly(handle: number): void;
        clearAndReseed(sinks: number[], tail: number[]): Promise<void>;
        detokenize(tokens: number[]): Promise<string>;
        dispose(): void;
        encode(tokens: number[]): Promise<void>;
        formatChat(
            messagesJson: string,
            options?: string | FormatChatOptions,
        ): Promise<FormattedChatResult>;
        formatChatSync(
            messagesJson: string,
            options?: string | FormatChatOptions,
        ): FormattedChatResult;
        getEmbeddingDimension(): number;
        getEmbeddings(normalize?: boolean): Float32Array;
        getEogToken(): number;
        getTurnSeparator(): number[];
        hasPooling(): boolean;
        isStopToken(token: number): boolean;
        jsonSchemaToGrammar(schemaJson: string): Promise<string>;
        jsonSchemaToGrammarSync(schemaJson: string): string;
        kvCacheClear(): Promise<void>;
        kvCacheLoad(sequenceId: number, state: Buffer): Promise<void>;
        kvCacheReadFile(
            sequenceId: number,
            filepath: string,
        ): Promise<{ bytesRead: number; tokens: number[] }>;
        kvCacheRemove(
            sequenceId: number,
            start: number,
            end: number,
        ): Promise<void>;
        kvCacheSave(sequenceId?: number): Promise<Buffer<ArrayBufferLike>>;
        kvCacheSize(sequenceId?: number): number;
        kvCacheWriteFile(
            sequenceId: number,
            filepath: string,
            tokens: number[],
        ): Promise<number>;
        kvSeqCopy(
            srcSeqId: number,
            dstSeqId: number,
            p0?: number,
            p1?: number,
        ): void;
        kvSeqKeep(seqId: number): void;
        kvSeqPosMax(seqId: number): number;
        parseChatOutput(
            output: string,
            format: number,
            options?: ParseChatOutputOptions,
        ): ParseChatOutputResult;
        tokenize(text: string, addSpecial?: boolean): Promise<number[]>;
        tokenizeSync(text: string, addSpecial?: boolean): number[];
        tokenToText(token: number): string;
        validateChatTemplate(templateString: string): Promise<boolean>;
    }
    Index

    Properties

    memorySize: number

    Memory used by this context (bytes)

    Reports native memory for monitoring. Includes model weights, KV cache, and context state.

    vocabSize: number

    Model vocabulary size (number of possible tokens)

    This is the length of the logits array from Branch.getLogits().

    Methods

    • Internal

      Parameters

      • handle: number
      • token: number

      Returns void

    • Internal

      Parameters

      • handle: number

      Returns number[]

    • Internal

      Parameters

      • handle: number

      Returns void

    • Internal

      Parameters

      • handle: number

      Returns void

    • Internal

      Parameters

      • position: number
      • Optionalparams: SamplingParams
      • OptionalnBatch: number
      • Optionalgrammar: string

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns Float32Array

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns boolean

    • Internal

      Parameters

      • handle: number

      Returns boolean

    • Internal

      Parameters

      • handle: number
      • Optionalbase: string

      Returns number

    • Internal

      Parameters

      • handle: number
      • token: number
      • Optionalbase: string

      Returns number

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number
      • tokens: number[]

      Returns Promise<void>

    • Internal

      Parameters

      • handle: number

      Returns void

    • Internal

      Parameters

      • handle: number

      Returns void

    • Internal

      Parameters

      • handle: number

      Returns number

    • Internal

      Parameters

      • handle: number
      • seed: number

      Returns void

    • Internal

      Parameters

      • handle: number
      • grammarStr: string

      Returns void

    • Internal

      Parameters

      • handle: number
      • grammar: string
      • patterns: string[]
      • tokens: number[]

      Returns void

    • Internal

      Parameters

      • handle: number
      • biases: { bias: number; token: number }[]

      Returns void

    • Internal

      Parameters

      • handle: number
      • biases: { bias: number; token: number }[]

      Returns void

    • Internal

      — processes ≤ n_seq_max prompts in a single group

      Parameters

      • tokenArrays: number[][]

      Returns Promise<Float32Array<ArrayBufferLike>[]>

    • Internal

      Parameters

      • handles: number[]
      • tokens: number[]

      Returns Promise<void>

    • KV cache pressure snapshot from native BranchStore. cells_used is a monotonic counter reset on drain/retainOnly.

      Returns { cellsUsed: number; nCtx: number; remaining: number }

    • Internal

      Parameters

      • handles: number[]
      • tokenArrays: number[][]

      Returns Promise<void>

    • Internal

      Parameters

      • handle: number

      Returns void

    • Blink KV — cache-local reconstruction for bounded-memory streaming

      Implements the Blink KV protocol (Naqvi, 2026): when the KV cache fills, clear it entirely and re-decode retained tokens at contiguous positions [0, 1, ..., N-1]. This achieves cache-local position IDs — the operative requirement for stable bounded-memory streaming — without backend-specific knowledge of key storage format. Works on post-RoPE engines (where StreamingLLM's pos-shift is unavailable) and any backend exposing clear() + decode().

      Why not naive eviction? Selective eviction (kvCacheRemove) preserves original position IDs, which grow without bound. Across 5 architectures, naive eviction produces PPL spanning 3 orders of magnitude — ranging from 1.15x baseline (Llama, lucky config) to 198x (Phi, sinks present). Under Blink KV reconstruction, all 5 converge to 3-16% of baseline.

      Sinks are optional. Under reconstruction, the 0+N (sinkless) config matches 4+N (with sinks) within <2% across all tested architectures. Pass an empty sinks array if you don't need them.

      Algorithm:

      1. Clear entire KV cache (zero fragmentation)
      2. Re-decode sinks at position 0 (optional attention anchors)
      3. Re-decode tail at position sinks.length (recent context)

      Cost: Re-decodes sinks.length + tail.length tokens. At per-boundary trigger (reconstruct when cache reaches nCtx), amortized cost is O(cacheSize / interval) decode ops per token — ~0.14 at typical settings.

      Parameters

      • sinks: number[]

        First N tokens from conversation start (typically 4, or empty). Must be the same tokens every reseed — reusing different tokens degrades any attention-sink patterns the model may have learned for early positions.

      • tail: number[]

        Recent M tokens to preserve (typically 252-1020)

      Returns Promise<void>

      Promise that resolves when reconstruction completes. Next decode continues at position sinks.length + tail.length.

      // Capture sinks once at conversation start
      const SINKS = allTokens.slice(0, 4);

      // On cache fill: compress to 512 tokens (4 sinks + 508 tail)
      if (position >= ctx.nCtx) {
      const tail = allTokens.slice(-508);
      await ctx.clearAndReseed(SINKS, tail);
      position = 512; // sinks.length + tail.length
      }
      const tail = allTokens.slice(-256);
      await ctx.clearAndReseed([], tail); // No sinks needed
      position = 256;
    • Detokenize array of tokens back to text

      Inverse of tokenize(). Use for reconstructing complete text from token sequences (e.g., after KV cache operations).

      Optimized for batch conversion of many tokens. For single-token conversion during generation, use tokenToText().

      Cost: ~1ms per 100 tokens

      Parameters

      • tokens: number[]

        Array of token IDs

      Returns Promise<string>

      Complete text representation

      const tokens = [15496, 1917]; // "Hello world"
      const text = await ctx.detokenize(tokens);
      console.log(text); // "Hello world"
    • Free native resources

      Call when done with context to release model and KV cache memory. Context becomes unusable after disposal.

      Returns void

    • Encode tokens for embedding extraction

      Unlike decode(), this marks ALL tokens with logits=true which is required for embedding extraction. Use with embeddings=true context.

      Workflow:

      1. Create context with { embeddings: true, poolingType: PoolingType.MEAN }
      2. Tokenize your text
      3. Clear KV cache (important between different texts!)
      4. Call encode() with tokens
      5. Call getEmbeddings() to get the vector

      Cost: ~5-50ms depending on text length and model

      Parameters

      • tokens: number[]

        Token IDs from tokenize()

      Returns Promise<void>

      // Create embedding context
      const ctx = await createContext({
      modelPath: './nomic-embed.gguf',
      embeddings: true,
      poolingType: PoolingType.MEAN
      });

      // Get embedding for text
      const tokens = await ctx.tokenize("Hello world");
      await ctx.kvCacheClear(); // Important between texts!
      await ctx.encode(tokens);
      const embedding = ctx.getEmbeddings();
    • Format messages using model's chat template

      Converts [{role, content}] -> formatted prompt string with full format awareness. Uses model's built-in template (ChatML, Llama, Mistral, etc.).

      The returned format and reasoningFormat fields should be passed to parseChatOutput() after generation to correctly decode the response.

      Cost: ~1-5ms depending on message count

      Parameters

      • messagesJson: string

        JSON string containing array of messages

      • Optionaloptions: string | FormatChatOptions

        Formatting options (tools, reasoning, grammar, etc.)

      Returns Promise<FormattedChatResult>

      Formatted prompt with format-awareness metadata

      const result = await ctx.formatChat(JSON.stringify([
      { role: "system", content: "You are a helpful assistant" },
      { role: "user", content: "Hello!" }
      ]));

      const tokens = await ctx.tokenize(result.prompt);
      const branch = Branch.create(ctx, 0, { temperature: 0.7 });
      await branch.prefill(tokens);
    • Format messages using model's chat template (sync — inline on main thread)

      Same as formatChat but synchronous. Use from Effection generators to avoid yield* call() overhead for CPU-only work.

      Parameters

      • messagesJson: string

        JSON string containing array of messages

      • Optionaloptions: string | FormatChatOptions

        Formatting options (tools, reasoning, grammar, etc.)

      Returns FormattedChatResult

      Formatted prompt with format-awareness metadata

    • Get embedding dimension for model

      Returns the size of embedding vectors this model produces. Common values: 768 (BERT-like), 1024, 2048, 4096.

      Cost: <0.01ms (fast model property lookup)

      Returns number

      Embedding dimension

      const dim = ctx.getEmbeddingDimension();
      console.log(`Model produces ${dim}-dimensional embeddings`);
    • Get embedding vector from context (after encode)

      Returns the embedding vector for the encoded text. Call after encode() to extract embeddings.

      The vector dimension depends on the model (e.g., 768 for nomic-embed). Use getEmbeddingDimension() to get the size.

      Cost: ~0.5ms (extraction from model state)

      Parameters

      • Optionalnormalize: boolean

        Apply L2 normalization (default: true for cosine similarity)

      Returns Float32Array

      Float32Array of embedding values

      await ctx.encode(tokens);

      // Get L2-normalized embedding (for cosine similarity)
      const embedding = ctx.getEmbeddings();

      // Or raw embedding without normalization
      const rawEmbedding = ctx.getEmbeddings(false);
    • Get the model's end-of-generation token ID

      Returns the EOT token (e.g. <|im_end|> for ChatML), falling back to EOS (e.g. ) for Zephyr-style models. This is the inverse of isStopToken() — "what IS the stop token?" vs "is this a stop token?"

      Use case: warm multi-turn continuation prepends this token to close the previous assistant turn before injecting new user content.

      Returns number

      Token ID (integer)

      If model has neither EOT nor EOS token

    • Get the model's turn separator token IDs

      Returns the tokens that close an assistant turn and transition to the next message, as determined by the model's chat template. Computed once per model, cached.

      For ChatML templates: [im_end_id, newline_id] (e.g., [2, 198]) For Llama 3 templates: [eot_id] (e.g., [128009])

      Use case: warm multi-turn prefill to achieve exact parity with cold path.

      Returns number[]

      Array of token IDs (cached after first call)

      const separator = ctx.getTurnSeparator();
      console.log(separator.map(t => ctx.tokenToText(t)).join('')); // "<|im_end|>\n"

      // Warm prefill with exact cold/warm parity
      const deltaTokens = await ctx.tokenize(deltaPrompt, false);
      await branch.prefill([...separator, ...deltaTokens]);
    • Check if context has pooling enabled

      Returns true if context was created with embeddings=true and a pooling type other than NONE.

      Cost: <0.01ms

      Returns boolean

      True if pooling is enabled

    • Check if token is a model stop token

      Returns true for built-in end-of-generation tokens:

      • (Llama 2)
      • <|endoftext|> (GPT)
      • <|eot_id|> (Llama 3)
      • Model-specific EOS tokens

      Note: This checks vocabulary stop tokens, not custom stop sequences. For custom stops (e.g., "\n\n", "###"), compare generated text against your stop strings in application code.

      Cost: <0.01ms (fast vocabulary lookup)

      Parameters

      • token: number

        Token ID to check

      Returns boolean

    • Convert JSON schema to GBNF grammar

      Generates grammar string for constrained JSON generation. Use with Branch.create grammar parameter for constrained generation.

      Cost: ~1-10ms depending on schema complexity

      Parameters

      • schemaJson: string

        JSON schema string

      Returns Promise<string>

      GBNF grammar string

      const schema = {
      type: "object",
      properties: {
      name: { type: "string" },
      age: { type: "number" }
      },
      required: ["name"]
      };

      const grammar = await ctx.jsonSchemaToGrammar(JSON.stringify(schema));
      const branch = Branch.create(ctx, 0, params, undefined, grammar);
    • Convert JSON schema to GBNF grammar (sync — inline on main thread)

      Same as jsonSchemaToGrammar but synchronous. Use from Effection generators to avoid yield* call() overhead for CPU-only work.

      Parameters

      • schemaJson: string

        JSON schema string

      Returns string

      GBNF grammar string

    • Clear all KV cache (fresh start)

      Removes all cached tokens. Model returns to initial state as if no text has been processed.

      Use when starting a completely new conversation.

      Cost: ~1ms

      Returns Promise<void>

    • Restore KV cache from previous snapshot

      Loads saved model state. Context returns to exact state when snapshot was taken.

      Cost: ~100-500ms depending on snapshot size

      Parameters

      • sequenceId: number

        Sequence ID (use 0 for single sequence)

      • state: Buffer

        Buffer from kvCacheSave()

      Returns Promise<void>

      const snapshot = await ctx.kvCacheSave(0);

      // ... many operations later ...

      // Restore to saved state
      await ctx.kvCacheLoad(0, snapshot);
    • Read KV cache state + tokens from file

      Restores KV cache state from a previous kvCacheWriteFile call.

      Parameters

      • sequenceId: number

        Sequence ID to restore to

      • filepath: string

        Path to saved file

      Returns Promise<{ bytesRead: number; tokens: number[] }>

      Promise resolving to tokens and bytes read

    • Remove token range from KV cache

      Deletes tokens from model's memory. Use cases:

      • Removing old context when hitting limit (sliding window)
      • Implementing conversation pruning
      • Forgetting specific messages
      • Preparing for injection of new context

      CRITICAL: Call BEFORE next decode(), not after! The model needs to know about the removal before processing new tokens.

      Cost: ~1-5ms depending on range

      Parameters

      • sequenceId: number

        Sequence ID (use 0 for single sequence)

      • start: number

        Start position (inclusive)

      • end: number

        End position (exclusive), -1 = to end

      Returns Promise<void>

    • Snapshot KV cache state for branching/undo

      Serializes entire model state to Buffer. Restore later with kvCacheLoad() for:

      • Conversation branching ("what if I said X instead?")
      • Undo/redo functionality
      • Checkpointing long conversations

      Size: ~500MB-2GB depending on context length and model

      Cost: ~100-500ms depending on cache size

      Parameters

      • OptionalsequenceId: number

        Sequence ID (use 0 for single sequence)

      Returns Promise<Buffer<ArrayBufferLike>>

      Serialized state buffer

    • Get max position in the KV cache for a sequence

      Returns the highest position index in the specified sequence, or -1 if the sequence is empty. This is the same value as kvSeqPosMax. To get the token count, add 1.

      Think of this as: "How much has the model read so far?"

      Cost: <0.01ms (fast sync operation - safe to call frequently)

      Parameters

      • OptionalsequenceId: number

        Sequence ID (defaults to 0 for single conversation)

      Returns number

      Highest position index, or -1 if empty

    • Write KV cache state + tokens to file

      Persists KV cache state for later restoration. Useful for checkpointing long conversations.

      Parameters

      • sequenceId: number

        Sequence ID to save

      • filepath: string

        Path to save file

      • tokens: number[]

        Tokens that were decoded into this sequence

      Returns Promise<number>

      Promise resolving to bytes written

    • Fork a KV cache sequence — the primitive behind Branch.fork

      Copies all KV cache entries from srcSeqId to dstSeqId. Under llama.cpp's unified KV cache, this is a metadata-only operation — no key/value tensors are copied. Both sequences reference the same physical KV entries for the shared prefix; only tokens decoded after the fork point allocate new storage. This is what makes tree-structured generation (best-of-N, beam search, speculative decoding) memory-efficient: N branches sharing a 1000-token prefix cost ~1000 KV entries, not N*1000.

      The higher-level Branch.fork wraps this and additionally clones the sampler chain, grammar state, logits snapshot, and perplexity tracker. Use kvSeqCopy directly when you need raw sequence management without the Branch abstraction.

      NOTE: Only full-sequence copies are supported. The p0/p1 parameters must use default values (0 and -1).

      Cost: O(1) metadata — no tensor copy under unified KV

      Parameters

      • srcSeqId: number

        Source sequence to copy from

      • dstSeqId: number

        Destination sequence to copy to

      • Optionalp0: number

        Start position (must be 0, default: 0)

      • Optionalp1: number

        End position (must be -1 for full copy, default: -1)

      Returns void

    • Keep only specified sequence, remove all others

      Removes all sequences except the one specified. For complete cleanup of unwanted sequences, consider using kvCacheRemove(seqId, 0, -1) on each sequence instead.

      Parameters

      • seqId: number

        Sequence ID to keep

      Returns void

    • Get max position in sequence

      Returns the highest position index in the specified sequence, or -1 if the sequence is empty.

      Cost: <0.01ms (fast sync operation)

      Parameters

      • seqId: number

        Sequence ID to query

      Returns number

      Max position index, or -1 if empty

      const pos = ctx.kvSeqPosMax(0);
      if (pos === -1) {
      console.log('Sequence is empty');
      } else {
      console.log(`Sequence has ${pos + 1} tokens`);
      }
    • Parse model output into structured content

      Extracts plain text, reasoning/thinking blocks, and tool calls from raw model output. Uses the format detected by formatChat to apply the correct parser for the model's output format.

      Cost: <0.1ms (synchronous string parsing, no I/O)

      Parameters

      Returns ParseChatOutputResult

      Parsed content with tool calls and reasoning

      const fmt = await ctx.formatChat(JSON.stringify(messages), { tools: toolsJson });
      // ... generate tokens ...
      const parsed = ctx.parseChatOutput(generatedText, fmt.format, {
      reasoningFormat: fmt.reasoningFormat,
      thinkingForcedOpen: fmt.thinkingForcedOpen,
      parser: fmt.parser
      });
      if (parsed.toolCalls.length > 0) {
      // Handle tool calls
      }
      // parseChatOutput separates <think>...</think> blocks into reasoningContent.
      // This is REQUIRED for correct warm continuation on thinking models (e.g. Qwen3):
      // if raw output containing <think> tags is stored as content, re-formatting
      // the conversation produces different tokens, breaking cold/warm parity.

      const messages: Array<{role: string; content: string; reasoning_content?: string}> = [];
      const sep = ctx.getTurnSeparator();
      let branch: Branch | null = null;
      let fmt: FormattedChatResult;

      async function handleTurn(userContent: string) {
      messages.push({ role: 'user', content: userContent });

      if (!branch) {
      // Cold path: format full conversation, tokenize with BOS, prefill
      fmt = await ctx.formatChat(JSON.stringify(messages));
      const tokens = await ctx.tokenize(fmt.prompt);
      branch = Branch.create(ctx, 0, { temperature: 0.7 });
      await branch.prefill(tokens);
      } else {
      // Warm path: string-diff for delta tokens
      const { prompt: full } = await ctx.formatChat(JSON.stringify(messages));
      const { prompt: prefix } = await ctx.formatChat(
      JSON.stringify(messages.slice(0, -1)),
      { addGenerationPrompt: false }
      );
      const delta = await ctx.tokenize(full.substring(prefix.length), false);
      await branch.prefill([...sep, ...delta]);
      }

      // Generate
      let rawOutput = '';
      while (true) {
      const { token, text, isStop } = await branch.produce();
      if (isStop) break;
      rawOutput += text;
      await branch.commit(token);
      }

      // Parse output: separates reasoning from content
      const parsed = ctx.parseChatOutput(rawOutput, fmt.format, {
      reasoningFormat: fmt.reasoningFormat,
      thinkingForcedOpen: fmt.thinkingForcedOpen,
      parser: fmt.parser
      });

      // Store parsed fields — formatChat reconstructs thinking blocks correctly
      messages.push({
      role: 'assistant',
      content: parsed.content,
      reasoning_content: parsed.reasoningContent || undefined
      });
      }
    • Tokenize text into model's vocabulary

      Converts human text → token IDs for decode(). Same text always produces same tokens for a given model.

      Cost: ~1ms per 100 characters

      Parameters

      • text: string

        Text to tokenize

      • OptionaladdSpecial: boolean

        Whether to add special tokens (BOS/EOS). Defaults to model metadata setting (typically true). Pass false for mid-sequence tokenization (e.g., warm multi-turn continuation deltas).

      Returns Promise<number[]>

      Array of token IDs

      // Full sequence (default — includes BOS)
      const tokens = await ctx.tokenize("Hello world");

      // Mid-sequence delta (no BOS)
      const delta = await ctx.tokenize("continuation text", false);
    • Tokenize text into model's vocabulary (sync — inline on main thread)

      Same as tokenize but synchronous. Use from Effection generators to avoid yield* call() overhead for CPU-only work.

      Parameters

      • text: string

        Text to tokenize

      • OptionaladdSpecial: boolean

        Whether to add special tokens (BOS/EOS). Defaults to model metadata setting (typically true). Pass false for mid-sequence tokenization.

      Returns number[]

      Array of token IDs

    • Convert token ID to text piece

      Fast synchronous lookup in vocabulary table. Call this on each generated token for streaming display.

      Optimized for per-token conversion during generation. For batch conversion of many tokens, use detokenize() instead.

      Cost: ~0.05ms

      Parameters

      • token: number

        Token ID

      Returns string

      Text string for this token

    • Validate chat template syntax

      Checks if template string is valid before using.

      Cost: ~0.1-1ms

      Parameters

      • templateString: string

        Template string to validate

      Returns Promise<boolean>

      True if template syntax is valid