Open source · free forever

Run your model.
Know what to ship.

Benchmark HF and vLLM across FP16 down to 4-bit — NF4, FP4, HQQ, Quanto, AWQ, GPTQ — on your own GPU. Get one verdict — which backend, which precision, what it costs.

Join the waitlist
HF · 8 quantization modesvLLM · FP16, BnB, FP8, AWQ, GPTQNVIDIA GPU·GGUF · llama.cppCost predictionMulti-GPU
litmus-lab — benchmark session

The problem

Every tool tells you “INT4 uses less memory.”
That number decides nothing.

Memory reduction alone doesn't determine deployment quality — and neither does picking HF or vLLM at random. The same model on the wrong backend or wrong precision can quietly:

01

become slower

Lower precision can tank throughput instead of helping it.

02

lose coherence

Weights scramble and generations quietly fall apart.

03

spike TTFT

Latency to first token balloons under some kernels.

04

pick the wrong backend

vLLM can be 3× faster than HF — or crash your GPU. You won't know until you measure.

05

overspend on inference

Running FP16 on HF when vLLM INT4 would do is money left on the table every request.

litmus-lab exists to mathematically decide which backend, which precision, and what it costs — measured on your hardware, not someone else's.

Capabilities

Measure everything. Guess nothing.

Multi-backend benchmarking

HuggingFace (FP16, INT8, NF4, FP4, NF4+double-quant, HQQ, Quanto) and vLLM (FP16, BitsAndBytes, FP8) are live now, plus AWQ/GPTQ against a pre-quantized checkpoint on both backends. GGUF / llama.cpp support is actively being built — so CPU and Mac users can benchmark too.

Throughput comparison

HF FP16
32.7 TPS
HF NF4
26.0 TPS
vLLM FP16
111.7 TPS
GGUF
coming soon

↑ vLLM live · GGUF actively being built

Deployment verdict engine

A deterministic engine weighs your measured VRAM, throughput and perplexity delta against deployment thresholds — and outputs one verdict. No prompts. No hallucinations. Just your numbers.

VRAM isolation & cleanup

Every pass runs in an isolated worker with aggressive CUDA cache flushing, GC and IPC clearing — so memory leaks never corrupt your VRAM readings.

Cost prediction — in progress

Pass --target-users and --gpu-cost and litmus-lab will project token cost per request at your concurrency. Shipping soon.

Beautiful terminal dashboard

Every benchmark renders as a clean, rich-formatted table right in your CLI — readable at a glance, copy-paste ready for a report.

Roadmap

From a CLI to your cost autopilot.

The CLI stays free, always. Everything we build next is designed to save your team real money.

Available now

The CLI

Free. Runs on your GPU. No signup.

  • 1pip install litmus-lab on any NVIDIA GPU
  • 2Run --backend all — benchmarks HF (FP16, INT8, NF4, FP4, HQQ, Quanto) and vLLM (FP16, BitsAndBytes, FP8) in one shot, plus AWQ/GPTQ on both
  • 3Get a single verdict: which backend, which precision, and why
02
Coming soon

The Platform

$199/mo · team dashboards · new-model alerts

  • 1Benchmark results auto-upload to your team dashboard
  • 2Get alerted when a newer model beats what you're running in production
  • 3Share reports, track cost across GPU upgrades, predict $/1M tokens at your concurrency
03
Coming soon

Cost Autopilot

$2K–10K/mo · monitors your live deployment

  • 1A lightweight agent runs next to your vLLM in production
  • 2Streams live GPU utilization, request concurrency, and cost per request
  • 3Surfaces the exact format change to cut your bill — before your cloud invoice does
🔔 Stage 3 autopilot alert

DeepSeek-V3 matches your current Llama-3.1-70B quality at 58% of the cost.

Re-benchmark it on your own hardware before you switch — one click, no guesswork.

Current spend

$0

/ month

Projected savings

$0

/ month

Four signals

The numbers that actually decide a deployment.

VRAM savedlower
0 MB

FP16 → NF4 reclaimed

same model, same GPU

Tokens / sechigher
0.00

vLLM peak throughput

3.4× over HF FP16

TTFTlower
0.000s

Time to first token (HF)

lower is better

Perplexitylower
+0.00 Δ

Quality degradation

FP16 → NF4 delta

How it works

From pip install to a deployment verdict.

01/03
Install

One command. Nothing leaves your machine.

pip install litmus-lab. No signup, no API key, no telemetry you didn't opt into. It runs entirely on your GPU.

$ pip install 'litmus-lab[all]'
02/03
Profile

HF and vLLM. Side by side. One shot.

Pass --backend all and litmus-lab runs HF (FP16, INT8, NF4, FP4, HQQ, Quanto, and more) and vLLM (FP16, BitsAndBytes, FP8) in isolated passes — measuring VRAM, throughput, latency and perplexity for each. Add --awq-model / --gptq-model to include AWQ/GPTQ against a pre-quantized checkpoint.

$ litmus-lab --model Qwen/Qwen2.5-7B \
  --prompt "Explain transformers" \
  --backend all
03/03
Deploy

One verdict. No more guessing.

A deterministic engine weighs your measured numbers and outputs a single recommendation — which backend, which precision, and why.

Mode VRAM TPS PPL
HF · FP16 7297 MB 32.7 5.64
HF · NF4 2334 MB 26.0 7.34
vLLM · FP16 12687 MB 111.7 5.65
→ Deploy vLLM · FP16
3.4× faster · PPL delta 0.01

Works with any HuggingFace causal LM

Llama 3·Phi-3·Qwen 2.5·Mistral·Gemma 2·DeepSeek-V3·Falcon·TinyLlama·CodeLlama·Zephyr·Vicuna·OPT·Pythia·WizardLM·Nous-Hermes·
Llama 3·Phi-3·Qwen 2.5·Mistral·Gemma 2·DeepSeek-V3·Falcon·TinyLlama·CodeLlama·Zephyr·Vicuna·OPT·Pythia·WizardLM·Nous-Hermes·

Early access

The CLI is free.
The platform is next.

litmus-lab is available today — install it, run a benchmark, get a verdict. The waitlist is for early access to the web platform: benchmark history, team dashboards, new-model alerts, and cost prediction at scale.

CLI — free foreverPlatform early accessNew-model alertsCost autopilot (Stage 3)

No spam. Just one email when it's ready.