pub struct LlamaParams {
pub ctx_size: Option<u32>,
pub n_gpu_layers: Option<i32>,
pub threads: Option<u32>,
pub batch_size: Option<u32>,
pub parallel: Option<u32>,
pub seed: Option<i64>,
pub flash_attn: Option<bool>,
pub mlock: Option<bool>,
pub mmap: Option<bool>,
pub cache_type_k: Option<String>,
pub cache_type_v: Option<String>,
pub enable_thinking: Option<bool>,
}Expand description
Per-model llama-server load parameters (the “Load” tab, à la LM Studio).
Absent fields fall back to the server’s own defaults (no flag passed).
Fields§
§ctx_size: Option<u32>Context window size (--ctx-size).
n_gpu_layers: Option<i32>Layers offloaded to the GPU (--n-gpu-layers); 0 = CPU only.
threads: Option<u32>CPU threads (--threads).
batch_size: Option<u32>Logical batch size for prompt eval (--batch-size).
parallel: Option<u32>Parallel sequences / max concurrent predictions (--parallel).
seed: Option<i64>RNG seed (--seed); omit for random.
flash_attn: Option<bool>Enable Flash Attention (--flash-attn).
mlock: Option<bool>Lock the model in RAM (--mlock) - “keep model in memory”.
mmap: Option<bool>Memory-map the model file (--mmap); false passes --no-mmap.
cache_type_k: Option<String>KV cache K type, e.g. f16, q8_0, q4_0 (--cache-type-k).
cache_type_v: Option<String>KV cache V type (--cache-type-v).
enable_thinking: Option<bool>For reasoning models: when Some(false), disable “thinking” by passing
--reasoning-budget 0 (much faster, lower memory). Only meaningful for
models that support it; None/Some(true) leaves the default on.
Trait Implementations§
Source§impl Clone for LlamaParams
impl Clone for LlamaParams
Source§fn clone(&self) -> LlamaParams
fn clone(&self) -> LlamaParams
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more