Path to .gguf model file
OptionalembeddingsEnable embedding extraction mode
When true, context is optimized for embedding extraction. Use with encode() and getEmbeddings() methods. Default: false (text generation mode)
OptionalnBatch size for token processing
Controls how many tokens are processed per llama_decode call. Higher values improve throughput for prompt prefill at the cost of memory. Also sets llama_context_params.n_batch and n_ubatch at context creation. Default: 512
OptionalnContext size (default: 2048)
OptionalnMaximum number of sequences for multi-sequence support
Set > 1 to enable multiple independent KV cache sequences. Useful for parallel decoding or conversation branching. Default: 1 (single sequence)
OptionalnNumber of threads (default: 4)
OptionalpoolingPooling type for embedding extraction
Only relevant when embeddings=true. Default: MEAN for embedding contexts, NONE otherwise
OptionaltypeKV cache data type for keys
Quantize the key cache to reduce GPU memory. For a Q4_K_M model, F16 cache wastes precision — Q8_0 halves memory with minimal quality loss.
Memory at nCtx=8192 (Qwen3-4B, 36 layers, 8 KV heads, 128 dim): f16: 1152 MB q8_0: ~576 MB q4_0: ~288 MB
Default: 'f16'
OptionaltypeKV cache data type for values
Same options as typeK. V cache is slightly more quality-sensitive than K. Default: 'f16'
Configuration for context creation
Controls the resource envelope for inference: context window size (
nCtx), batch throughput (nBatch), compute parallelism (nThreads), and multi-sequence capacity (nSeqMax). These map directly tollama_context_paramsand are fixed for the context's lifetime.Key tradeoffs: