Skip to main content

Model Management

miLLM Models Page

Loading a Model

  1. Navigate to Models in the sidebar
  2. Enter a HuggingFace repository ID (e.g., google/gemma-2-2b-it)
  3. Select Quantization:
ModeBitsVRAM SavingsQualityBest For
FP1616BaselineMaximumPrecision research
Q88~50%Minimal lossGood balance
Q44~75%Moderate lossConsumer GPUs (recommended)
Q22~87%Significant lossMaximum compression
  1. Select Device (auto recommended — places model on GPU with CPU offload if needed)
  2. Optionally enter a HuggingFace Token for gated models (e.g., Llama)
  3. Check Trust Remote Code if required by the model
  4. Click Download & Load Model
Hybrid Models (Mamba/SSM)

Models with Mamba/SSM layers (e.g., granite-4.0-h-*) require the mamba-ssm package for efficient inference. Without it, the naive fallback creates massive intermediate tensors that cause OOM errors. Check that mamba-ssm is installed in your deployment.

Model Locking

When an SAE is attached, the model is automatically locked — preventing accidental unloading during steering experiments. Unlock manually from the model details if needed.

Downloaded Models

Previously downloaded models appear in a list below the load form. Click Load to switch to any ready model. The previous model is unloaded first to free GPU memory.

Dynamic Architecture Support

miLLM uses dynamic layer discovery to support any transformer architecture — Llama, Gemma, GPT-2, LFM, Granite, Mistral, Phi, and more. No configuration needed.