Skip to content

Local model runtime

Local inference is served by a managed llama.cpp llama-server subprocess, reached through the OpenAI-compatible local provider. LLMs from the Model Hub (gguf, plus mlx on Apple Silicon) load into the server, and the app talks to it over a loopback HTTP API. This is the “LM Studio role”. A large MoE model cannot run in-process, so the runtime manages the process lifecycle instead. Inference data never leaves the machine.

The runtime owns the llama-server lifecycle:

  • Start spawns llama-server --model <model> --host 127.0.0.1 --port <free> with the per-model parameter flags. It then polls the models endpoint until it answers, and fails fast if the child exits early.
  • Stop kills the child. This is idempotent and uses kill-on-drop.
  • Status reports { running, endpoint, model_path, pid } for the Models-tab indicator.

By default the runtime keeps a single model resident: loading a new model stops the previous one, which is the original behavior. You can raise this with the max_loaded_models setting (default 1) to keep several models resident at once, each served by its own llama-server on its own loopback port. With more than one model loaded:

  • Switching to a model that is already loaded is instant, because its server is still running.
  • When you hit the model limit or a memory budget (a fraction of total RAM), the least-recently-used model is evicted to make room.
  • A model can be preloaded, or warmed up, ahead of time so that a later run which targets it pays no first-use loading delay.
  • Each resident model keeps its own request slots, so two loaded models serve requests independently rather than competing for one shared pool.

The Model Hub shows the resident models and lets you preload or unload each one.

The engine binary is resolved on first Load, in order:

  1. A previously fetched managed engine (~/.flow-studio/engines/).
  2. Your saved setting.
  3. An engine auto-detected on $PATH or common directories.
  4. Otherwise a managed engine is fetched automatically for your OS and architecture. This needs no setup, shows a determinate progress bar, and is reused on every later Load.

Load settings map to llama-server flags and persist per model:

SettingFlag
Context size--ctx-size
GPU offload layers--n-gpu-layers
Threads / batch / parallel / seedcorresponding flags
Flash attention--flash-attn
Memory lock / mmap--mlock / --mmap
KV-cache types--cache-type-k / --cache-type-v
Reasoning toggle--reasoning-budget 0 disables thinking on capable models

In Flow Code, the run-config popover exposes the context size and the reasoning toggle per model, with the context-size slider wired to --ctx-size.

The ai node and the flow generator reach the server through the same provider path used for cloud providers. They differ only by the gate and the loopback endpoint. The local provider POSTs to the server’s OpenAI-compatible chat endpoint and streams the response. Token deltas stream back live, so the header token ticker and the TUI update as the model drafts. The internal dispatch is documented in flow-execution.

GPU offload is a per-model setting. Setting offload layers above 0 engages the platform GPU, which is Metal on Apple Silicon and CUDA where available. CPU is always the safe fallback. There are no inference build features to compile.

Tracing logs land in the daily-rotating app log under ~/.flow-studio/logs/. FLOW_LOG overrides the default level. No model telemetry is transmitted externally by the local AI runtime.