MORIE Inference Engine¶
Part of Statistical Methods — MORIE’s statistical-methods reference.
MORIE includes its own LLM inference engine — independent of Ollama, llama.cpp, or HuggingFace. This gives full control over the inference pipeline, including TurboQuant KV-cache compression and MLX GPU acceleration on Apple Silicon.
Architecture¶
┌────────────────────────────────────────────────────┐
│ MORIEEngine │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ │
│ │ GGUFModel│ │ Tokenizer│ │ TurboQuant│ │
│ │ (loader) │ │ (BPE) │ │ KV-Cache │ │
│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ │
│ │ │ │ │
│ ┌────┴──────────────┴───────────────┴──────────┐ │
│ │ Transformer Forward Pass │ │
│ │ RMSNorm → RoPE → GQA → SwiGLU → Sampling │ │
│ │ │ │
│ │ Backend: MLX (Metal GPU) or NumPy (CPU) │ │
│ └──────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────┘
Components¶
engine.py— transformer forward pass with MLX / NumPy dual backend, text generation.tokenizer.py— BPE tokenizer from GGUF metadata or SentencePiece.modelfiles.gguf_loader.py— GGUF v2 / v3 parser, mmap tensors, dequantize Q4_K / Q8_0 / F16 / F32.kv_cache.py— TurboQuant-compressed KV cache with per-layer block storage.quant.py— TurboQuant MSE + QJL quantization (Python path).quant_ggml.c— TurboQuant C acceleration (WHT + Lloyd-Max + QJL).engine_kernels.c— C hot-path kernels: RMSNorm, RoPE, matvec, SiLU, softmax (Accelerate.framework).engine_bridge.py— ctypes bridge for C kernels, NumPy fallback.
MLX Integration — Apple Silicon GPU¶
MORIE’s inference engine detects MLX at import time and uses it for GPU-accelerated matrix operations on Apple Silicon (M1/M2/M3/M4):
from morie.engine import backend
print(backend()) # 'mlx' on Python 3.14 with MLX, 'numpy' otherwise
Setup:
# MLX requires Python ≤3.14 (not yet on 3.15)
.venv-314/bin/pip install mlx # installs mlx + mlx-metal
Dual-backend design:
MLX path (
.venv-314/): Metal GPU for matmul, RMSNorm, softmax, SiLU. Usesmlx.corearrays for all weight operations.NumPy path (
.venv/): CPU fallback, works on Python 3.15 and all platforms.KV-cache compression always uses NumPy (TurboQuant is NumPy-based).
This follows the same pattern as the vendored modules (morie.fam,
morie.emissions) — optional acceleration with zero-dependency fallback.
Five integration paths for TurboQuant on macOS (per Hannecke 2026):
Path A: mlx-optiq — drop-in
TurboQuantKVCachefor mlx-lm. Informed our design.Path B: tqkv benchmark — CLI benchmarking tool. Referenced for validation.
Path C: llama.cpp TBQ — native GGML types (PR #21089). Pending upstream merge.
Path D: oMLX — menu-bar inference server. Not applicable to MORIE.
Path E: QJL 1-bit PoC — outlier tracking + sign quantization. Implemented in
quant.py.
C Kernel Acceleration¶
The hot-path operations (RMSNorm, RoPE, matmul, SiLU, softmax) have C kernel implementations that use Apple’s Accelerate.framework (vDSP/BLAS) on macOS for hardware-tuned SIMD:
# Compile (macOS — zero warnings with -Wall -Wextra)
cc -O2 -march=native -shared -o engine_kernels.dylib engine_kernels.c -lm -framework Accelerate
# Linux
cc -O2 -march=native -shared -fPIC -o engine_kernels.so engine_kernels.c -lm
from morie.engine_bridge import is_available, matvec, rmsnorm
print(is_available()) # True if .dylib/.so compiled
# Accelerate.framework BLAS for matmul
out = matvec(weight_matrix, input_vec) # cblas_sgemv under the hood
Security design:
All C functions validate inputs (NULL pointers, size bounds)
MAX_DIMcap (16M) prevents integer overflowZero heap allocations in hot-path functions
No global mutable state (thread-safe)
Library resolved from file-relative path only
Three acceleration tiers (from fastest to slowest):
MLX (Apple Silicon Metal GPU) —
.venv-314/, Python 3.14C kernels (Accelerate.framework BLAS) — any Python, macOS
NumPy (CPU fallback) — any platform
GGUF Loader — Verified Results¶
Successfully parses real Ollama model files, now with Q4_K dequantization:
from morie.gguf_loader import GGUFModel
model = GGUFModel("~/.ollama/models/blobs/sha256-...")
print(model.config)
# {'architecture': 'llama', 'n_layers': 32, 'n_heads': 32,
# 'head_dim': 128, 'hidden_dim': 4096, 'vocab_size': 128256,
# 'context_length': 131072, 'name': 'Meta Llama 3.1 8B Instruct'}
print(len(model.tensor_names())) # 292 tensors
# Dequantize Q4_K weights (the common GGUF format)
w = model.get_tensor("blk.0.attn_q.weight") # Q4_K → float32
Supported dequantization types:
F32: Direct load
F16: Convert to float32
Q8_0: 32-element blocks with float32 scale + int8 quantized values
Q4_K: 256-element super-blocks with 6-bit sub-block scales, 4-bit values
Tokenizer¶
Loads tokenization data from GGUF metadata (tokenizer.ggml.* keys)
without requiring SentencePiece at runtime:
from morie.tokenizer import Tokenizer
from morie.gguf_loader import GGUFModel
model = GGUFModel("path/to/model.gguf")
tok = Tokenizer(gguf_model=model)
ids = tok.encode("Hello world")
print(tok.decode(ids)) # "Hello world"
print(tok.vocab_size) # e.g. 128256
Falls back to SentencePiece if a .model file is provided.
Engine — Forward Pass¶
from morie.engine import MORIEEngine
engine = MORIEEngine("path/to/model.gguf", kv_bits=3)
result = engine.generate("The capital of France is", max_tokens=20)
print(result.text)
print(f"{result.tokens_per_second:.1f} tok/s")
print(f"KV compression: {result.kv_compression_ratio:.1f}x")
print(f"Backend: {result.backend}") # 'mlx' or 'numpy'
The engine implements the full Llama-family transformer:
RMSNorm with learned weight
Rotary Position Embedding (RoPE) with configurable frequency base
Grouped-Query Attention (GQA) with TurboQuant-compressed KV-cache
SwiGLU Feed-Forward Network (gate + up + down projections)
Nucleus sampling with temperature and top-p
KV-Cache Compression — Benchmark Results¶
Tested with Llama 3.1 8B dimensions (32 layers, head_dim=128, 64 tokens):
2-bit — 7.1× compression, cosine similarity 0.938 (FP16 1.00 MB → TQ 0.14 MB).
3-bit — 4.9× compression, cosine similarity 0.983 (FP16 1.00 MB → TQ 0.20 MB).
4-bit — 3.8× compression, cosine similarity 0.996 (FP16 1.00 MB → TQ 0.27 MB).
At scale (128K context, 32 layers):
FP16 baseline: ~128 MB KV-cache
3-bit TurboQuant: ~26 MB (savings: ~102 MB)
2-bit TurboQuant: ~18 MB (savings: ~110 MB)
References¶
Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2026). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874
Karpathy, A. (2023). llama2.c — Inference of Llama 2 in pure C. GitHub
Apple (2023). MLX: An array framework for Apple Silicon. GitHub
Hannecke, M. (2026). TurboQuant on Apple macOS: Five Integration Paths for Local KV-Cache Compression. Medium.
0xSero (2026). TurboQuant: KV Cache Compression for LLM Inference. GitHub