Skip to content

MerlionOS Inference

MerlionOS Inference is a from-scratch operating system that does exactly one thing: serve LLM inference as fast as the hardware allows.

No scheduler overhead, no syscall boundary, no unnecessary abstractions. Boot in under 5 seconds, load a model, serve an OpenAI-compatible API.

Linux is a general-purpose OS. When running LLM inference, you pay for that generality:

  • 10-20% throughput loss from kernel/user transitions, page faults, TLB shootdowns
  • 3-5x P99/P50 latency ratio from scheduler jitter, interrupt storms, GC pauses
  • 30-60 second boot time before first inference

MerlionOS Inference eliminates these by construction:

  • Everything runs in kernel mode — zero syscall overhead
  • All memory pre-allocated with huge pages — zero page faults
  • Static core assignment — zero scheduler jitter
  • Polling instead of interrupts on hot path — deterministic latency

OpenAI-Compatible API

Drop-in replacement for any OpenAI client. POST /v1/chat/completions with streaming SSE support.

GGUF Model Support

Load quantized models (Q4_0, Q8_0) directly. Supports Llama, SmolLM, and compatible architectures.

AVX2/AVX-512 Kernels

Hand-optimized SIMD kernels for x86_64. Automatic runtime detection and dispatch.

AMD GPU Compute

Native RDNA3 driver — no ROCm, no Linux, no DRM. Direct hardware access for maximum GPU utilization.

Terminal window
# Build
git clone https://github.com/MerlionOS/merlion-infer.git
cd merlion-infer
make build
# Download a model
./tools/download_model.sh
# Run in QEMU with disk + network
make run-full
# In the shell:
merlion> ai-load # Load GGUF model from disk
merlion> ai Hello # Generate text
merlion> ai-serve 8080 # Start OpenAI API server
Terminal window
# From any client:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"smollm-135m","messages":[{"role":"user","content":"Hello"}]}'
merlion-kernelmerlion-infer
PurposeGeneral-purpose hobby OSLLM inference server
Architecturesx86_64, aarch64, RISC-V, LoongArchx86_64 only
User modeRing 3 processes, syscallsEverything in Ring 0
Shell commands358~25 (essentials only)
Modules253~35 (stripped to minimum)
NetworkingFull stack (HTTP, SSH, MQTT, …)TCP + HTTP (API only)
GPUSoftware shadersAMD RDNA3 compute
FocusFeature completenessPerformance per watt
ComponentSupported
CPUAMD Ryzen 7000/9000, Intel 12th gen+ (AVX2 required)
GPUAMD Radeon RX 7000 series (RDNA3)
RAMDDR5, 32GB+ recommended
StorageNVMe SSD, virtio-blk (QEMU)
Networkvirtio-net (QEMU), Intel e1000e
BootUEFI via Limine bootloader