Inference planning for unified-memory boxes and multi-GPU stacks · APEX + TurboQuant optimization toggles
System
TTM Override
Linux kernel TTM page limit override. Increases Strix Halo VRAM from 96 GB (BIOS default) to 120 GB.
GPU countPipeline-parallel assumption (llama.cpp layer-split default): VRAM adds, bandwidth stays per-card. Tensor parallel with NVLink can scale bandwidth further.
Model calculator
System VRAM
Used
Optimization toggles
APEX-Quant
APEX I-Quality mixed-precision quantization (mudler/apex-quant). Optimizes quality for MoE models — assigns higher precision to critical layers. May use more or less VRAM than standard quants. No effect on dense models.