Local model runtime
Local inference is served by a managed llama.cpp llama-server subprocess,
reached through the OpenAI-compatible local provider. LLMs from the
Model Hub (gguf, plus mlx on Apple Silicon) load
into the server, and the app talks to it over a loopback HTTP API. This is the
“LM Studio role”. A large MoE model cannot run in-process, so the runtime
manages the process lifecycle instead. Inference data never leaves the machine.
The managed server
Section titled “The managed server”The runtime owns the llama-server lifecycle:
- Start spawns
llama-server --model <model> --host 127.0.0.1 --port <free>with the per-model parameter flags. It then polls the models endpoint until it answers, and fails fast if the child exits early. - Stop kills the child. This is idempotent and uses kill-on-drop.
- Status reports
{ running, endpoint, model_path, pid }for the Models-tab indicator.
Keeping more than one model loaded
Section titled “Keeping more than one model loaded”By default the runtime keeps a single model resident: loading a new model stops
the previous one, which is the original behavior. You can raise this with the
max_loaded_models setting (default 1) to keep several models resident at
once, each served by its own llama-server on its own loopback port. With more
than one model loaded:
- Switching to a model that is already loaded is instant, because its server is still running.
- When you hit the model limit or a memory budget (a fraction of total RAM), the least-recently-used model is evicted to make room.
- A model can be preloaded, or warmed up, ahead of time so that a later run which targets it pays no first-use loading delay.
- Each resident model keeps its own request slots, so two loaded models serve requests independently rather than competing for one shared pool.
The Model Hub shows the resident models and lets you preload or unload each one.
Engine resolution
Section titled “Engine resolution”The engine binary is resolved on first Load, in order:
- A previously fetched managed engine (
~/.flow-studio/engines/). - Your saved setting.
- An engine auto-detected on
$PATHor common directories. - Otherwise a managed engine is fetched automatically for your OS and architecture. This needs no setup, shows a determinate progress bar, and is reused on every later Load.
Per-model load parameters
Section titled “Per-model load parameters”Load settings map to llama-server flags and persist per model:
| Setting | Flag |
|---|---|
| Context size | --ctx-size |
| GPU offload layers | --n-gpu-layers |
| Threads / batch / parallel / seed | corresponding flags |
| Flash attention | --flash-attn |
| Memory lock / mmap | --mlock / --mmap |
| KV-cache types | --cache-type-k / --cache-type-v |
| Reasoning toggle | --reasoning-budget 0 disables thinking on capable models |
In Flow Code, the run-config
popover exposes the context size and the reasoning toggle per model, with the
context-size slider wired to --ctx-size.
Invocation path
Section titled “Invocation path”The ai node and the flow generator reach the server through the same
provider path used for cloud providers. They differ only by the gate and the
loopback endpoint. The local provider POSTs to the server’s
OpenAI-compatible chat endpoint and streams the response. Token deltas
stream back live, so the header token ticker and the TUI update as the
model drafts. The internal dispatch is documented in
flow-execution.
Hardware acceleration
Section titled “Hardware acceleration”GPU offload is a per-model setting. Setting offload layers above 0 engages the platform GPU, which is Metal on Apple Silicon and CUDA where available. CPU is always the safe fallback. There are no inference build features to compile.
Observability
Section titled “Observability”Tracing logs land in the daily-rotating app log under ~/.flow-studio/logs/.
FLOW_LOG overrides the default level. No model telemetry is transmitted
externally by the local AI runtime.
Related
Section titled “Related”- Model Hub - downloading and loading models.
- AI overview - how the
ainode binds a model. - Isolation boundaries - the local reasoning boundary.