ALPHA v0.0.1 · MIT-licensed · early access

Peak inference performance.
Zero ML engineering

Drop in any model and any hardware. Our agentic harness picks the optimal engine, tunes every config, writes custom kernels, and stress-tests the result — so your GPUs run at the limit of silicon.

autoinference · install
$npx skill add @autoinference/skill
engine-agnostic
hardware-aware
battle-tested
autoinference v0.0.1-alpha optimizing
> deploy kimi-2.6 on 8× B300
Hardware Probe(2.1s)
8× B300 detected · NVLink-5 mesh · 288 GB HBM3e/GPU
Engine Selection(0.8s)
ranked 7 compatible engines · selected optimal
Kernel Synthesis(2m 38s)
NVFP4 fused RMSNorm+RoPE · 4.2× faster · 0-ulp drift
Config Tuning(4m 12s)
+3.4× tok/s · −62% p99 latency · 94% gpu util
Validation Harness(1m 47s)
load + correctness + security · 6/6 passed

Your deployment is ready.

View deployment at dashboard.autoinference.org/acme/kimi-2.6/a7f3e9c1

Want me to ship a canary first, or roll directly?

refining recommendation...

// tested across the inference stack

engines
vvLLM SSGLang TensorRT-LLM ttokenspeed LLMDeploy
hardware
NVIDIA H100 / H200 NVIDIA B200 / B300 / GB200 AMD MI300X / MI355X Intel Gaudi 3 TPU v5p / v6e Apple Silicon
runtimes
PyTorch CCUDA · ROCm Triton NNCCL · NVSwitch Kubernetes Docker
// the_problem

Modern inference is just spinning up vLLM
a multi-axis optimization problem

Every model has dozens of correct ways to serve it, and only one peak. Engine choice, batch size, KV-cache layout, tensor parallelism, quantization scheme, custom kernels — the search space is massive, and a wrong setting silently leaves 40–80% of your GPU on the floor.

40-80%
of GPU compute wasted in default configs
3-6 weeks
average time for an ML eng team to dial-in one model on one hardware target
2× ↑
cloud spend that disappears when you re-tune for the next model
// the_platform

Multiple agents. One autonomous pipeline

Each stage is owned by a specialized agent. Together they take you from model weights to a hardened, peak-throughput deployment — without you writing a single config line.

[01] engine_selection

Profiles your workload pattern — chat, RAG, batch, agentic — and ranks compatible engines against your exact distribution.

[✓] workload fingerprinting
[✓] engine capability matrix
[✓] cross-engine A/B harness
[02] config_tuning

Bayesian + bandit search across batch size, KV-cache paging, tensor-parallel shape, speculative decoding, and quantization.

[✓] hundreds of hyperparameters
[✓] SLO-aware Pareto front
[✓] closed-loop re-tuning on drift
[03] kernel_synthesis

Writes custom kernels for the hot path of your model. Auto-verified bit-equivalent to reference.

[✓] CUDA / ROCm / Triton / Metal
[✓] 0-ulp output verification
[✓] per-arch tuning
[04] validation

Every deployment ships behind a generated test harness — functional, load, and adversarial security.

[✓] correctness vs reference
[✓] load tests up to 10k concurrent
[✓] OWASP LLM coverage
// how_it_works

Three inputs. One deployment

You bring the model and the hardware. We bring the rest.

[01] pick_a_model

Hugging Face checkpoint, local weights, custom fine-tune, or one of our pre-profiled models.

python ⎘ copy
from autoinference import Deployment

deploy = Deployment(
    model="zai-org/GLM-5.1",
    quantize="auto",
)
[02] state_your_hardware

Single GPU, multi-node cluster, AMD, edge Jetson, or a cloud provider. Auto-detects topology, interconnect, memory.

python ⎘ copy
deploy.target(
    hardware="8x B300",
    interconnect="NVLink-5",
    slo={"p99_latency_ms": 200, "min_qps": 120},
)
[03] hit_deploy

The agentic harness takes over. You get an OpenAI-compatible endpoint, with full audit trail.

python ⎘ copy
endpoint = deploy.run()
# → view summary at dashboard.autoinference.org/acme/glm-5.1/4b2c8d61
# → tested, hardened, deployed on your cluster
// performance

Numbers we're targeting at GA — on your existing hardware

autoinference · bench
$ autoinference bench --vs default
running ~5 min · 1024 prompts · concurrency ramp
throughput
0× tok/s vs default
p99 latency
0% p99 reduction
gpu util
0% steady-state
$/M tokens
0% per-token spend
[✓] passed 6 / 6 SLO targets met · 95% confidence interval
// the_webapp

The CLI is the source of truth. The web is the lens

Every CLI run streams to a versioned workspace. The web view renders the same data — audit, share, and trigger re-runs from anywhere. You never need it to ship.

// mcp_integration

Or skip the CLI — call it from your agent

autoinference is published as an MCP server on every major registry. Install it once and your agentic coding tool can deploy, re-tune, and benchmark models on your infrastructure — without you ever opening a terminal.

// universal install · pick either path
npm direct npm: autoinference
$ npx autoinference
skill registry @autoinference/skill
$ npx skill add @autoinference/skill
Both paths land the same MCP server. The CLI runs as autoinference deploy; agents start the MCP server with autoinference mcp, which exposes autoinference.deploy, autoinference.benchmark, and autoinference.summary_url.
Claude Code
$ claude mcp add autoinference
Codex
$ codex mcp add autoinference
Cursor
$ cursor mcp add autoinference
Antigravity
$ antigravity mcp install autoinference
claude code · agent session · autoinference mcp connected
> deploy kimi-2.6 on our 8× B300 box, optimize for p99 < 50ms
autoinference.deploy (model="kimi-2.6", hardware="8x B300", slo={"p99_ms": 50}) (6m 24s)
SGLang · NVFP4 · EAGLE-3 · +3.4× tok/s · 42ms p99 · 94% util
autoinference.summary_url (0.1s)
dashboard.autoinference.org/acme/kimi-2.6/a7f3e9c1
Deployment is live on your cluster. I picked SGLang over vLLM because Kimi 2.6's MoE all-to-all path is more mature in SGLang on Blackwell. Want me to wire it into your gateway?
// pulled from Anthropic MCP Registry Smithery mcp.so Hugging Face MCP Hub Glama
// validation

Battle-tested before your users ever see it

Every deployment ships behind a generated test harness. If something would break a real user, it breaks our test first — and the deployment is held until it's fixed.

autoinference · test --suite full
$ autoinference test --suite full
[✓] correctness 12,000 / 12,000
[✓] numerical equivalence 0 ulp drift
[✓] load · 10k concurrent p99 = 142ms
[✓] 4-hour soak no mem drift
[✓] OWASP LLM-01..10 147 / 147 blocked
[✓] sbom + signature attested
──────────────────────────────────────────────────────────
PASSED 6 / 6 suites · runtime 4m 18s
$
[01] correctness

Bit-equivalence sweeps against the reference implementation. Golden-output regression on thousands of sampled prompts per release.

0 ulp
max numerical drift
[02] load_and_soak

Ramp tests up to 10,000 concurrent users, traffic patterns from real production traces, multi-hour soak to surface leaks before they bite.

10k
concurrent connections
[03] security

OWASP LLM Top-10 coverage out of the box. Prompt-injection battery, jailbreak corpus, PII leakage probes, and supply-chain SBOM for every kernel artifact.

147
attack patterns covered
[04] continuous_retune

Production telemetry feeds back into the harness. When workload distribution drifts, re-tuning kicks off in shadow, and a canary swap rolls only on improvement.

< 30s
canary rollback window
// faq

Common questions

// what_is_auto_inference

Auto Inference (autoinference) is an agentic CLI for production LLM inference optimization. It runs on your own infrastructure (not a hosted API), picks the optimal inference engine for your model and hardware, tunes every config, synthesises custom kernels, and stress-tests the result — so your GPUs run at the limit of silicon.

// supported_inference_engines

Auto Inference ranks and tunes across vLLM, SGLang, TensorRT-LLM, tokenspeed, and LMDeploy. The agentic pipeline picks whichever engine is genuinely optimal for the chosen model + hardware combination — e.g. SGLang wins for DeepSeek-family MoE deployments on H200; NVFP4 / fused Triton paths win for Kimi 2.6 on B300.

// supported_hardware

NVIDIA H100, H200, B200, B300, GB200 (single-node and multi-node, demonstrated up to 32-node × 8-GPU = 256× H200); AMD MI300X and MI355X; Intel Gaudi 3; Google TPU v5p and v6e; and Apple Silicon for local development.

// install_as_mcp_skill

Run npx skill add @autoinference/skill, or npx -y @autoinference/skill, in any MCP-aware client (Claude Code, Codex, Cursor, Antigravity). Per-editor short forms: claude mcp add autoinference, codex mcp add autoinference, cursor mcp add autoinference, antigravity mcp install autoinference.

// is_it_a_hosted_api

No. Auto Inference is a CLI tool that runs on your own GPU cluster. There is no api.autoinference.org endpoint. The mental model is wandb for inference: wandb does hyperparameter tuning for training; autoinference does parameter tuning for production inference and gives you a dashboard URL per deployment at dashboard.autoinference.org/<user>/<project>/<hash>.

// status: alpha · private beta Q3 2026

Stop tuning. Start shipping

We're onboarding design-partner teams running real production inference. Tell us your stack — we'll come back with a deployment plan and an early slot.

No spam. We use the email only to reply about beta access.

// sdk · placeholder packages (real CLI ships with beta)

Already on every registry

v0.0.1 is published as placeholders across pip, uv, npm, cargo, go, and docker. Real SDK lands with the design-partner beta.

Python
pip install autoinference
uv uv
uv add autoinference
Node
npm install autoinference
Rust
cargo add autoinference
Go
go get github.com/autoinference/autoinference
Docker
docker pull autoinference/autoinference
brew
brew install autoinference/tap/autoinference
HF
huggingface.co/autoinference