lloyal-agents API Reference
    Preparing search index...

    Interface ContextOptions

    Configuration for context creation

    Controls the resource envelope for inference: context window size (nCtx), batch throughput (nBatch), compute parallelism (nThreads), and multi-sequence capacity (nSeqMax). These map directly to llama_context_params and are fixed for the context's lifetime.

    Key tradeoffs:

    • nCtx: Larger = longer conversations, but linear KV memory growth.
    • nBatch: Larger = faster prompt prefill (more tokens per GPU dispatch), but higher peak memory. Also sets the bin-packing capacity for BranchStore.prefill.
    • nSeqMax: Set ≥ your max concurrent branch count + 1 (root sequence). Each sequence shares the same KV cache memory pool — cost is metadata only under unified KV, not a per-sequence memory multiplier.
    interface ContextOptions {
        modelPath: string;
        embeddings?: boolean;
        nBatch?: number;
        nCtx?: number;
        nSeqMax?: number;
        nThreads?: number;
        poolingType?: PoolingType;
        typeK?: KvCacheType;
        typeV?: KvCacheType;
    }
    Index

    Properties

    modelPath: string

    Path to .gguf model file

    embeddings?: boolean

    Enable embedding extraction mode

    When true, context is optimized for embedding extraction. Use with encode() and getEmbeddings() methods. Default: false (text generation mode)

    nBatch?: number

    Batch size for token processing

    Controls how many tokens are processed per llama_decode call. Higher values improve throughput for prompt prefill at the cost of memory. Also sets llama_context_params.n_batch and n_ubatch at context creation. Default: 512

    nCtx?: number

    Context size (default: 2048)

    nSeqMax?: number

    Maximum number of sequences for multi-sequence support

    Set > 1 to enable multiple independent KV cache sequences. Useful for parallel decoding or conversation branching. Default: 1 (single sequence)

    nThreads?: number

    Number of threads (default: 4)

    poolingType?: PoolingType

    Pooling type for embedding extraction

    Only relevant when embeddings=true. Default: MEAN for embedding contexts, NONE otherwise

    typeK?: KvCacheType

    KV cache data type for keys

    Quantize the key cache to reduce GPU memory. For a Q4_K_M model, F16 cache wastes precision — Q8_0 halves memory with minimal quality loss.

    Memory at nCtx=8192 (Qwen3-4B, 36 layers, 8 KV heads, 128 dim): f16: 1152 MB q8_0: ~576 MB q4_0: ~288 MB

    Default: 'f16'

    typeV?: KvCacheType

    KV cache data type for values

    Same options as typeK. V cache is slightly more quality-sensitive than K. Default: 'f16'