MORIE Inference Engine¶

Part of Statistical Methods — MORIE’s statistical-methods reference.

MORIE includes its own LLM inference engine — independent of Ollama, llama.cpp, or HuggingFace. This gives full control over the inference pipeline, including TurboQuant KV-cache compression and MLX GPU acceleration on Apple Silicon.

Architecture¶

┌────────────────────────────────────────────────────┐
│                   MORIEEngine                    │
│                                                    │
│  ┌──────────┐   ┌──────────┐   ┌───────────┐       │
│  │ GGUFModel│   │ Tokenizer│   │ TurboQuant│       │
│  │ (loader) │   │  (BPE)   │   │ KV-Cache  │       │
│  └────┬─────┘   └────┬─────┘   └─────┬─────┘       │
│       │              │               │             │
│  ┌────┴──────────────┴───────────────┴──────────┐  │
│  │           Transformer Forward Pass           │  │
│  │   RMSNorm → RoPE → GQA → SwiGLU → Sampling   │  │
│  │                                              │  │
│  │   Backend: MLX (Metal GPU) or NumPy (CPU)    │  │
│  └──────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────┘

Components¶

engine.py — transformer forward pass with MLX / NumPy dual backend, text generation.
tokenizer.py — BPE tokenizer from GGUF metadata or SentencePiece .model files.
gguf_loader.py — GGUF v2 / v3 parser, mmap tensors, dequantize Q4_K / Q8_0 / F16 / F32.
kv_cache.py — TurboQuant-compressed KV cache with per-layer block storage.
quant.py — TurboQuant MSE + QJL quantization (Python path).
quant_ggml.c — TurboQuant C acceleration (WHT + Lloyd-Max + QJL).
engine_kernels.c — C hot-path kernels: RMSNorm, RoPE, matvec, SiLU, softmax (Accelerate.framework).
engine_bridge.py — ctypes bridge for C kernels, NumPy fallback.

MLX Integration — Apple Silicon GPU¶

MORIE’s inference engine detects MLX at import time and uses it for GPU-accelerated matrix operations on Apple Silicon (M1/M2/M3/M4):

from morie.engine import backend
print(backend())  # 'mlx' on Python 3.14 with MLX, 'numpy' otherwise

Setup:

# MLX requires Python ≤3.14 (not yet on 3.15)
.venv-314/bin/pip install mlx  # installs mlx + mlx-metal

Dual-backend design:

MLX path (.venv-314/): Metal GPU for matmul, RMSNorm, softmax, SiLU. Uses mlx.core arrays for all weight operations.
NumPy path (.venv/): CPU fallback, works on Python 3.15 and all platforms.
KV-cache compression always uses NumPy (TurboQuant is NumPy-based).

This follows the same pattern as the vendored modules (morie.fam, morie.emissions) — optional acceleration with zero-dependency fallback.

Five integration paths for TurboQuant on macOS (per Hannecke 2026):

Path A: mlx-optiq — drop-in TurboQuantKVCache for mlx-lm. Informed our design.
Path B: tqkv benchmark — CLI benchmarking tool. Referenced for validation.
Path C: llama.cpp TBQ — native GGML types (PR #21089). Pending upstream merge.
Path D: oMLX — menu-bar inference server. Not applicable to MORIE.
Path E: QJL 1-bit PoC — outlier tracking + sign quantization. Implemented in quant.py.

C Kernel Acceleration¶

The hot-path operations (RMSNorm, RoPE, matmul, SiLU, softmax) have C kernel implementations that use Apple’s Accelerate.framework (vDSP/BLAS) on macOS for hardware-tuned SIMD:

# Compile (macOS — zero warnings with -Wall -Wextra)
cc -O2 -march=native -shared -o engine_kernels.dylib engine_kernels.c -lm -framework Accelerate

# Linux
cc -O2 -march=native -shared -fPIC -o engine_kernels.so engine_kernels.c -lm

from morie.engine_bridge import is_available, matvec, rmsnorm
print(is_available())  # True if .dylib/.so compiled

# Accelerate.framework BLAS for matmul
out = matvec(weight_matrix, input_vec)  # cblas_sgemv under the hood

Security design:

All C functions validate inputs (NULL pointers, size bounds)
MAX_DIM cap (16M) prevents integer overflow
Zero heap allocations in hot-path functions
No global mutable state (thread-safe)
Library resolved from file-relative path only

Three acceleration tiers (from fastest to slowest):

MLX (Apple Silicon Metal GPU) — .venv-314/, Python 3.14
C kernels (Accelerate.framework BLAS) — any Python, macOS
NumPy (CPU fallback) — any platform

GGUF Loader — Verified Results¶

Successfully parses real Ollama model files, now with Q4_K dequantization:

from morie.gguf_loader import GGUFModel

model = GGUFModel("~/.ollama/models/blobs/sha256-...")
print(model.config)
# {'architecture': 'llama', 'n_layers': 32, 'n_heads': 32,
#  'head_dim': 128, 'hidden_dim': 4096, 'vocab_size': 128256,
#  'context_length': 131072, 'name': 'Meta Llama 3.1 8B Instruct'}

print(len(model.tensor_names()))  # 292 tensors

# Dequantize Q4_K weights (the common GGUF format)
w = model.get_tensor("blk.0.attn_q.weight")  # Q4_K → float32

Supported dequantization types:

F32: Direct load
F16: Convert to float32
Q8_0: 32-element blocks with float32 scale + int8 quantized values
Q4_K: 256-element super-blocks with 6-bit sub-block scales, 4-bit values

Tokenizer¶

Loads tokenization data from GGUF metadata (tokenizer.ggml.* keys) without requiring SentencePiece at runtime:

from morie.tokenizer import Tokenizer
from morie.gguf_loader import GGUFModel

model = GGUFModel("path/to/model.gguf")
tok = Tokenizer(gguf_model=model)
ids = tok.encode("Hello world")
print(tok.decode(ids))  # "Hello world"
print(tok.vocab_size)   # e.g. 128256

Falls back to SentencePiece if a .model file is provided.

Engine — Forward Pass¶

from morie.engine import MORIEEngine

engine = MORIEEngine("path/to/model.gguf", kv_bits=3)
result = engine.generate("The capital of France is", max_tokens=20)
print(result.text)
print(f"{result.tokens_per_second:.1f} tok/s")
print(f"KV compression: {result.kv_compression_ratio:.1f}x")
print(f"Backend: {result.backend}")  # 'mlx' or 'numpy'

The engine implements the full Llama-family transformer:

RMSNorm with learned weight
Rotary Position Embedding (RoPE) with configurable frequency base
Grouped-Query Attention (GQA) with TurboQuant-compressed KV-cache
SwiGLU Feed-Forward Network (gate + up + down projections)
Nucleus sampling with temperature and top-p

KV-Cache Compression — Benchmark Results¶

Tested with Llama 3.1 8B dimensions (32 layers, head_dim=128, 64 tokens):

2-bit — 7.1× compression, cosine similarity 0.938 (FP16 1.00 MB → TQ 0.14 MB).
3-bit — 4.9× compression, cosine similarity 0.983 (FP16 1.00 MB → TQ 0.20 MB).
4-bit — 3.8× compression, cosine similarity 0.996 (FP16 1.00 MB → TQ 0.27 MB).

At scale (128K context, 32 layers):

FP16 baseline: ~128 MB KV-cache
3-bit TurboQuant: ~26 MB (savings: ~102 MB)
2-bit TurboQuant: ~18 MB (savings: ~110 MB)

References¶

[Zandieh2026]

Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2026). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. ICLR 2026. arXiv:2504.19874

[Karpathy2023]

Karpathy, A. (2023). llama2.c — Inference of Llama 2 in pure C. GitHub

[Apple2023]

Apple (2023). MLX: An array framework for Apple Silicon. GitHub

[Hannecke2026]

Hannecke, M. (2026). TurboQuant on Apple macOS: Five Integration Paths for Local KV-Cache Compression. Medium.

[0xSero2026]

0xSero (2026). TurboQuant: KV Cache Compression for LLM Inference. GitHub

MORIE

Table of Contents

Related Topics