Run your model.
Know what to ship.
Benchmark HF and vLLM across FP16 down to 4-bit — NF4, FP4, HQQ, Quanto, AWQ, GPTQ — on your own GPU. Get one verdict — which backend, which precision, what it costs.
The problem
Every tool tells you “INT4 uses less memory.”
That number decides nothing.
Memory reduction alone doesn't determine deployment quality — and neither does picking HF or vLLM at random. The same model on the wrong backend or wrong precision can quietly:
become slower
Lower precision can tank throughput instead of helping it.
lose coherence
Weights scramble and generations quietly fall apart.
spike TTFT
Latency to first token balloons under some kernels.
pick the wrong backend
vLLM can be 3× faster than HF — or crash your GPU. You won't know until you measure.
overspend on inference
Running FP16 on HF when vLLM INT4 would do is money left on the table every request.
litmus-lab exists to mathematically decide which backend, which precision, and what it costs — measured on your hardware, not someone else's.
Capabilities
Measure everything. Guess nothing.
Multi-backend benchmarking
HuggingFace (FP16, INT8, NF4, FP4, NF4+double-quant, HQQ, Quanto) and vLLM (FP16, BitsAndBytes, FP8) are live now, plus AWQ/GPTQ against a pre-quantized checkpoint on both backends. GGUF / llama.cpp support is actively being built — so CPU and Mac users can benchmark too.
Throughput comparison
↑ vLLM live · GGUF actively being built
Deployment verdict engine
A deterministic engine weighs your measured VRAM, throughput and perplexity delta against deployment thresholds — and outputs one verdict. No prompts. No hallucinations. Just your numbers.
VRAM isolation & cleanup
Every pass runs in an isolated worker with aggressive CUDA cache flushing, GC and IPC clearing — so memory leaks never corrupt your VRAM readings.
Cost prediction — in progress
Pass --target-users and --gpu-cost and litmus-lab will project token cost per request at your concurrency. Shipping soon.
Beautiful terminal dashboard
Every benchmark renders as a clean, rich-formatted table right in your CLI — readable at a glance, copy-paste ready for a report.
Roadmap
From a CLI to your cost autopilot.
The CLI stays free, always. Everything we build next is designed to save your team real money.
The CLI
Free. Runs on your GPU. No signup.
- 1pip install litmus-lab on any NVIDIA GPU
- 2Run --backend all — benchmarks HF (FP16, INT8, NF4, FP4, HQQ, Quanto) and vLLM (FP16, BitsAndBytes, FP8) in one shot, plus AWQ/GPTQ on both
- 3Get a single verdict: which backend, which precision, and why
The Platform
$199/mo · team dashboards · new-model alerts
- 1Benchmark results auto-upload to your team dashboard
- 2Get alerted when a newer model beats what you're running in production
- 3Share reports, track cost across GPU upgrades, predict $/1M tokens at your concurrency
Cost Autopilot
$2K–10K/mo · monitors your live deployment
- 1A lightweight agent runs next to your vLLM in production
- 2Streams live GPU utilization, request concurrency, and cost per request
- 3Surfaces the exact format change to cut your bill — before your cloud invoice does
DeepSeek-V3 matches your current Llama-3.1-70B quality at 58% of the cost.
Re-benchmark it on your own hardware before you switch — one click, no guesswork.
Current spend
$0
/ month
Projected savings
$0
/ month
Four signals
The numbers that actually decide a deployment.
FP16 → NF4 reclaimed
same model, same GPU
vLLM peak throughput
3.4× over HF FP16
Time to first token (HF)
lower is better
Quality degradation
FP16 → NF4 delta
How it works
From pip install to a deployment verdict.
One command. Nothing leaves your machine.
pip install litmus-lab. No signup, no API key, no telemetry you didn't opt into. It runs entirely on your GPU.
$ pip install 'litmus-lab[all]'
HF and vLLM. Side by side. One shot.
Pass --backend all and litmus-lab runs HF (FP16, INT8, NF4, FP4, HQQ, Quanto, and more) and vLLM (FP16, BitsAndBytes, FP8) in isolated passes — measuring VRAM, throughput, latency and perplexity for each. Add --awq-model / --gptq-model to include AWQ/GPTQ against a pre-quantized checkpoint.
$ litmus-lab --model Qwen/Qwen2.5-7B \ --prompt "Explain transformers" \ --backend all
One verdict. No more guessing.
A deterministic engine weighs your measured numbers and outputs a single recommendation — which backend, which precision, and why.
Mode VRAM TPS PPLHF · FP16 7297 MB 32.7 5.64HF · NF4 2334 MB 26.0 7.34vLLM · FP16 12687 MB 111.7 5.65→ Deploy vLLM · FP163.4× faster · PPL delta 0.01
Works with any HuggingFace causal LM
Early access
The CLI is free.
The platform is next.
litmus-lab is available today — install it, run a benchmark, get a verdict. The waitlist is for early access to the web platform: benchmark history, team dashboards, new-model alerts, and cost prediction at scale.
No spam. Just one email when it's ready.