Interface ContextOptions

Configuration for context creation

Controls the resource envelope for inference: context window size (nCtx), batch throughput (nBatch), compute parallelism (nThreads), and multi-sequence capacity (nSeqMax). These map directly to llama_context_params and are fixed for the context's lifetime.

Key tradeoffs:

nCtx: Larger = longer conversations, but linear KV memory growth.
nBatch: Larger = faster prompt prefill (more tokens per GPU dispatch), but higher peak memory. Also sets the bin-packing capacity for BranchStore.prefill.
nSeqMax: Set ≥ your max concurrent branch count + 1 (root sequence). Each sequence shares the same KV cache memory pool — cost is metadata only under unified KV, not a per-sequence memory multiplier.

interface ContextOptions {
    modelPath: string;
    embeddings?: boolean;
    nBatch?: number;
    nCtx?: number;
    nSeqMax?: number;
    nThreads?: number;
    poolingType?: PoolingType;
    typeK?: KvCacheType;
    typeV?: KvCacheType;
}

Index

Properties

modelPath embeddings? nBatch? nCtx? nSeqMax? nThreads? poolingType? typeK? typeV?

Properties

modelPath

modelPath: string

Path to .gguf model file

`Optional`embeddings

embeddings?: boolean

Enable embedding extraction mode

When true, context is optimized for embedding extraction. Use with encode() and getEmbeddings() methods. Default: false (text generation mode)

`Optional`nBatch

nBatch?: number

Batch size for token processing

Controls how many tokens are processed per llama_decode call. Higher values improve throughput for prompt prefill at the cost of memory. Also sets llama_context_params.n_batch and n_ubatch at context creation. Default: 512

`Optional`nCtx

nCtx?: number

Context size (default: 2048)

`Optional`nSeqMax

nSeqMax?: number

Maximum number of sequences for multi-sequence support

Set > 1 to enable multiple independent KV cache sequences. Useful for parallel decoding or conversation branching. Default: 1 (single sequence)

`Optional`nThreads

nThreads?: number

Number of threads (default: 4)

`Optional`poolingType

poolingType?: PoolingType

Pooling type for embedding extraction

Only relevant when embeddings=true. Default: MEAN for embedding contexts, NONE otherwise

`Optional`typeK

typeK?: KvCacheType

KV cache data type for keys

Quantize the key cache to reduce GPU memory. For a Q4_K_M model, F16 cache wastes precision — Q8_0 halves memory with minimal quality loss.

Memory at nCtx=8192 (Qwen3-4B, 36 layers, 8 KV heads, 128 dim): f16: 1152 MB q8_0: ~576 MB q4_0: ~288 MB

Default: 'f16'

`Optional`typeV

typeV?: KvCacheType

KV cache data type for values

Same options as typeK. V cache is slightly more quality-sensitive than K. Default: 'f16'

Interface ContextOptions

Index

Properties

Properties

modelPath

`Optional`embeddings

`Optional`nBatch

`Optional`nCtx

`Optional`nSeqMax

`Optional`nThreads

`Optional`poolingType

`Optional`typeK

`Optional`typeV

Settings

On This Page

Interface ContextOptions

Index

Properties

Properties

modelPath

Optionalembeddings

OptionalnBatch

OptionalnCtx

OptionalnSeqMax

OptionalnThreads

OptionalpoolingType

OptionaltypeK

OptionaltypeV

Settings

On This Page

`Optional`embeddings

`Optional`nBatch

`Optional`nCtx

`Optional`nSeqMax

`Optional`nThreads

`Optional`poolingType

`Optional`typeK

`Optional`typeV