Profiles your workload pattern — chat, RAG, batch, agentic — and ranks compatible engines against your exact distribution.
Drop in any model and any hardware. Our agentic harness picks the optimal engine, tunes every config, writes custom kernels, and stress-tests the result — so your GPUs run at the limit of silicon.
$npx skill add @autoinference/skill
$pip install autoinference
$uv add autoinference
$brew install autoinference/tap/autoinference
$npm install autoinference
$cargo add autoinference
$docker pull autoinference/autoinference
Your deployment is ready.
View deployment at dashboard.autoinference.org/acme/kimi-2.6/a7f3e9c1
Want me to ship a canary first, or roll directly?
Every model has dozens of correct ways to serve it, and only one peak. Engine choice, batch size, KV-cache layout, tensor parallelism, quantization scheme, custom kernels — the search space is massive, and a wrong setting silently leaves 40–80% of your GPU on the floor.
Each stage is owned by a specialized agent. Together they take you from model weights to a hardened, peak-throughput deployment — without you writing a single config line.
Profiles your workload pattern — chat, RAG, batch, agentic — and ranks compatible engines against your exact distribution.
Bayesian + bandit search across batch size, KV-cache paging, tensor-parallel shape, speculative decoding, and quantization.
Writes custom kernels for the hot path of your model. Auto-verified bit-equivalent to reference.
Every deployment ships behind a generated test harness — functional, load, and adversarial security.
You bring the model and the hardware. We bring the rest.
Hugging Face checkpoint, local weights, custom fine-tune, or one of our pre-profiled models.
from autoinference import Deployment
deploy = Deployment(
model="zai-org/GLM-5.1",
quantize="auto",
)
Single GPU, multi-node cluster, AMD, edge Jetson, or a cloud provider. Auto-detects topology, interconnect, memory.
deploy.target(
hardware="8x B300",
interconnect="NVLink-5",
slo={"p99_latency_ms": 200, "min_qps": 120},
)
The agentic harness takes over. You get an OpenAI-compatible endpoint, with full audit trail.
endpoint = deploy.run()
# → view summary at dashboard.autoinference.org/acme/glm-5.1/4b2c8d61
# → tested, hardened, deployed on your cluster
Every CLI run streams to a versioned workspace. The web view renders the same data — audit, share, and trigger re-runs from anywhere. You never need it to ship.
autoinference is published as an MCP server on every major registry. Install it once and your agentic coding tool can deploy, re-tune, and benchmark models on your infrastructure — without you ever opening a terminal.
$ npx autoinference
$ npx skill add @autoinference/skill
autoinference deploy;
agents start the MCP server with autoinference mcp, which exposes
autoinference.deploy, autoinference.benchmark, and
autoinference.summary_url.
Every deployment ships behind a generated test harness. If something would break a real user, it breaks our test first — and the deployment is held until it's fixed.
Bit-equivalence sweeps against the reference implementation. Golden-output regression on thousands of sampled prompts per release.
Ramp tests up to 10,000 concurrent users, traffic patterns from real production traces, multi-hour soak to surface leaks before they bite.
OWASP LLM Top-10 coverage out of the box. Prompt-injection battery, jailbreak corpus, PII leakage probes, and supply-chain SBOM for every kernel artifact.
Production telemetry feeds back into the harness. When workload distribution drifts, re-tuning kicks off in shadow, and a canary swap rolls only on improvement.
Auto Inference (autoinference) is an agentic CLI for production LLM inference optimization. It runs on your own infrastructure (not a hosted API), picks the optimal inference engine for your model and hardware, tunes every config, synthesises custom kernels, and stress-tests the result — so your GPUs run at the limit of silicon.
Auto Inference ranks and tunes across vLLM, SGLang, TensorRT-LLM, tokenspeed, and LMDeploy. The agentic pipeline picks whichever engine is genuinely optimal for the chosen model + hardware combination — e.g. SGLang wins for DeepSeek-family MoE deployments on H200; NVFP4 / fused Triton paths win for Kimi 2.6 on B300.
NVIDIA H100, H200, B200, B300, GB200 (single-node and multi-node, demonstrated up to 32-node × 8-GPU = 256× H200); AMD MI300X and MI355X; Intel Gaudi 3; Google TPU v5p and v6e; and Apple Silicon for local development.
Run npx skill add @autoinference/skill, or npx -y @autoinference/skill, in any MCP-aware client (Claude Code, Codex, Cursor, Antigravity). Per-editor short forms: claude mcp add autoinference, codex mcp add autoinference, cursor mcp add autoinference, antigravity mcp install autoinference.
No. Auto Inference is a CLI tool that runs on your own GPU cluster. There is no api.autoinference.org endpoint. The mental model is wandb for inference: wandb does hyperparameter tuning for training; autoinference does parameter tuning for production inference and gives you a dashboard URL per deployment at dashboard.autoinference.org/<user>/<project>/<hash>.
We're onboarding design-partner teams running real production inference. Tell us your stack — we'll come back with a deployment plan and an early slot.
No spam. We use the email only to reply about beta access.
v0.0.1 is published as placeholders across pip, uv, npm, cargo, go, and docker. Real SDK lands with the design-partner beta.
pip install autoinferenceuv add autoinferencenpm install autoinferencecargo add autoinferencego get github.com/autoinference/autoinferencedocker pull autoinference/autoinferencebrew install autoinference/tap/autoinferencehuggingface.co/autoinference