ALPHA v0.0.1 · MIT-licensed · early access

Peak inference performance.
Zero ML engineering

Drop in any model and any hardware. Our agentic harness picks the optimal engine, tunes every config, writes custom kernels, and stress-tests the result — so your GPUs run at the limit of silicon.

autoinference · install

$npx skill add @autoinference/skill

join_the_waitlist see_how_it_works

engine-agnostic

hardware-aware

battle-tested

✻ autoinference v0.0.1-alpha ⠋ optimizing

> deploy kimi-2.6 on 8× B300

● Hardware Probe(2.1s)

⎿ 8× B300 detected · NVLink-5 mesh · 288 GB HBM3e/GPU

● Engine Selection(0.8s)

⎿ ranked 7 compatible engines · selected optimal

● Kernel Synthesis(2m 38s)

⎿ NVFP4 fused RMSNorm+RoPE · 4.2× faster · 0-ulp drift

● Config Tuning(4m 12s)

⎿ +3.4× tok/s · −62% p99 latency · 94% gpu util

● Validation Harness(1m 47s)

⎿ load + correctness + security · 6/6 passed

Your deployment is ready.

View deployment at dashboard.autoinference.org/acme/kimi-2.6/a7f3e9c1

Want me to ship a canary first, or roll directly?

✶ refining recommendation...

// tested across the inference stack

engines

vvLLM SSGLang

TensorRT-LLM ttokenspeed LLMDeploy

hardware

NVIDIA H100 / H200

NVIDIA B200 / B300 / GB200

AMD MI300X / MI355X

Intel Gaudi 3

TPU v5p / v6e

Apple Silicon

runtimes

PyTorch CCUDA · ROCm ▲Triton NNCCL · NVSwitch

Kubernetes

Docker

// the_problem

Modern inference is just spinning up vLLM
a multi-axis optimization problem

Every model has dozens of correct ways to serve it, and only one peak. Engine choice, batch size, KV-cache layout, tensor parallelism, quantization scheme, custom kernels — the search space is massive, and a wrong setting silently leaves 40–80% of your GPU on the floor.

40-80%

of GPU compute wasted in default configs

3-6 weeks

average time for an ML eng team to dial-in one model on one hardware target

2× ↑

cloud spend that disappears when you re-tune for the next model

// the_platform

Multiple agents. One autonomous pipeline

Each stage is owned by a specialized agent. Together they take you from model weights to a hardened, peak-throughput deployment — without you writing a single config line.

[01] engine_selection

Profiles your workload pattern — chat, RAG, batch, agentic — and ranks compatible engines against your exact distribution.

[✓] workload fingerprinting

[✓] engine capability matrix

[✓] cross-engine A/B harness

[02] config_tuning

Bayesian + bandit search across batch size, KV-cache paging, tensor-parallel shape, speculative decoding, and quantization.

[✓] hundreds of hyperparameters

[✓] SLO-aware Pareto front

[✓] closed-loop re-tuning on drift

[03] kernel_synthesis

Writes custom kernels for the hot path of your model. Auto-verified bit-equivalent to reference.

[✓] CUDA / ROCm / Triton / Metal

[✓] 0-ulp output verification

[✓] per-arch tuning

[04] validation

Every deployment ships behind a generated test harness — functional, load, and adversarial security.

[✓] correctness vs reference

[✓] load tests up to 10k concurrent

[✓] OWASP LLM coverage

// how_it_works

Three inputs. One deployment

You bring the model and the hardware. We bring the rest.

[01] pick_a_model

Hugging Face checkpoint, local weights, custom fine-tune, or one of our pre-profiled models.

python ⎘ copy

from autoinference import Deployment

deploy = Deployment(
    model="zai-org/GLM-5.1",
    quantize="auto",
)

[02] state_your_hardware

Single GPU, multi-node cluster, AMD, edge Jetson, or a cloud provider. Auto-detects topology, interconnect, memory.

python ⎘ copy

deploy.target(
    hardware="8x B300",
    interconnect="NVLink-5",
    slo={"p99_latency_ms": 200, "min_qps": 120},
)

[03] hit_deploy

The agentic harness takes over. You get an OpenAI-compatible endpoint, with full audit trail.

python ⎘ copy

endpoint = deploy.run()
# → view summary at dashboard.autoinference.org/acme/glm-5.1/4b2c8d61
# → tested, hardened, deployed on your cluster

// performance

Numbers we're targeting at GA — on your existing hardware

autoinference · bench

$ autoinference bench --vs default

running ~5 min · 1024 prompts · concurrency ramp

throughput

0× tok/s vs default

p99 latency

−0% p99 reduction

gpu util

0% steady-state

$/M tokens

−0% per-token spend

[✓] passed 6 / 6 SLO targets met · 95% confidence interval

// the_webapp

The CLI is the source of truth. The web is the lens

Every CLI run streams to a versioned workspace. The web view renders the same data — audit, share, and trigger re-runs from anywhere. You never need it to ship.

dashboard.autoinference.org/acme/deepseek-v4-pro/9e1a5c3d ⌘K

Deployments / deepseek-v4-pro

Live

deepseek-v4-pro · fp8 · 256× H200

prod-us-west · 32 nodes

Tokens/sec (agg.)

184,200

+241% vs baseline

P99 latency

42ms

−62% vs baseline

GPU utilization

94%

+38 pts

$ / 1M tokens

$0.47

−47% spend

Optimization run #1842

live · agent thinking

14:02:11planProfiling workload · MoE w/ MLA + MTP heads

14:02:18chooseEngine: SGLang · TP=8 · EP=32 across 32 nodes

14:02:31tuneSweeping DeepEP all-to-all topology…

14:03:02tuneFlashMLA + paged FP8 KV · 2 MTP heads enabled

14:03:14kernelCompiling DeepGEMM FP8 grouped GEMM…

14:03:51verifyKernel verified bit-equivalent ✓

14:04:09loadtestRamping concurrent users: 1,200 / 2,000

Engine ranking

SGLang 96

TensorRT-LLM 95

tokenspeed 94

vLLM 91

LMDeploy 88

✓OpenAI-compatible REST endpoint, out of the box
✓Streaming, function calling, structured outputs
✓Versioned deployments with one-click rollback
✓Per-tenant rate limits + audit logs
✓Self-host or fully-managed

// mcp_integration

Or skip the CLI — call it from your agent

autoinference is published as an MCP server on every major registry. Install it once and your agentic coding tool can deploy, re-tune, and benchmark models on your infrastructure — without you ever opening a terminal.

// universal install · pick either path

npm direct npm: autoinference

$ npx autoinference

skill registry @autoinference/skill

$ npx skill add @autoinference/skill

Both paths land the same MCP server. The CLI runs as autoinference deploy; agents start the MCP server with autoinference mcp, which exposes autoinference.deploy, autoinference.benchmark, and autoinference.summary_url.

Claude Code

$ claude mcp add autoinference

Codex

$ codex mcp add autoinference

Cursor

$ cursor mcp add autoinference

Antigravity

$ antigravity mcp install autoinference

claude code · agent session · autoinference mcp connected

> deploy kimi-2.6 on our 8× B300 box, optimize for p99 < 50ms

● autoinference.deploy (model="kimi-2.6", hardware="8x B300", slo={"p99_ms": 50}) (6m 24s)

⎿ SGLang · NVFP4 · EAGLE-3 · +3.4× tok/s · 42ms p99 · 94% util

● autoinference.summary_url (0.1s)

⎿ dashboard.autoinference.org/acme/kimi-2.6/a7f3e9c1

Deployment is live on your cluster. I picked SGLang over vLLM because Kimi 2.6's MoE all-to-all path is more mature in SGLang on Blackwell. Want me to wire it into your gateway?

// pulled from Anthropic MCP Registry Smithery mcp.so Hugging Face MCP Hub Glama

// validation

Battle-tested before your users ever see it

Every deployment ships behind a generated test harness. If something would break a real user, it breaks our test first — and the deployment is held until it's fixed.

autoinference · test --suite full

$ autoinference test --suite full

[✓] correctness 12,000 / 12,000

[✓] numerical equivalence 0 ulp drift

[✓] load · 10k concurrent p99 = 142ms

[✓] 4-hour soak no mem drift

[✓] OWASP LLM-01..10 147 / 147 blocked

[✓] sbom + signature attested

──────────────────────────────────────────────────────────

PASSED 6 / 6 suites · runtime 4m 18s

[01] correctness

Bit-equivalence sweeps against the reference implementation. Golden-output regression on thousands of sampled prompts per release.

0 ulp

max numerical drift

[02] load_and_soak

Ramp tests up to 10,000 concurrent users, traffic patterns from real production traces, multi-hour soak to surface leaks before they bite.

10k

concurrent connections

[03] security

OWASP LLM Top-10 coverage out of the box. Prompt-injection battery, jailbreak corpus, PII leakage probes, and supply-chain SBOM for every kernel artifact.

147

attack patterns covered

[04] continuous_retune

Production telemetry feeds back into the harness. When workload distribution drifts, re-tuning kicks off in shadow, and a canary swap rolls only on improvement.

< 30s

canary rollback window

// faq

Common questions

// what_is_auto_inference

Auto Inference (autoinference) is an agentic CLI for production LLM inference optimization. It runs on your own infrastructure (not a hosted API), picks the optimal inference engine for your model and hardware, tunes every config, synthesises custom kernels, and stress-tests the result — so your GPUs run at the limit of silicon.

// supported_inference_engines

Auto Inference ranks and tunes across vLLM, SGLang, TensorRT-LLM, tokenspeed, and LMDeploy. The agentic pipeline picks whichever engine is genuinely optimal for the chosen model + hardware combination — e.g. SGLang wins for DeepSeek-family MoE deployments on H200; NVFP4 / fused Triton paths win for Kimi 2.6 on B300.

// supported_hardware

NVIDIA H100, H200, B200, B300, GB200 (single-node and multi-node, demonstrated up to 32-node × 8-GPU = 256× H200); AMD MI300X and MI355X; Intel Gaudi 3; Google TPU v5p and v6e; and Apple Silicon for local development.

// install_as_mcp_skill

Run npx skill add @autoinference/skill, or npx -y @autoinference/skill, in any MCP-aware client (Claude Code, Codex, Cursor, Antigravity). Per-editor short forms: claude mcp add autoinference, codex mcp add autoinference, cursor mcp add autoinference, antigravity mcp install autoinference.

// is_it_a_hosted_api

No. Auto Inference is a CLI tool that runs on your own GPU cluster. There is no api.autoinference.org endpoint. The mental model is wandb for inference: wandb does hyperparameter tuning for training; autoinference does parameter tuning for production inference and gives you a dashboard URL per deployment at dashboard.autoinference.org/<user>/<project>/<hash>.

// status: alpha · private beta Q3 2026

Stop tuning. Start shipping

We're onboarding design-partner teams running real production inference. Tell us your stack — we'll come back with a deployment plan and an early slot.

No spam. We use the email only to reply about beta access.

github.com/autoinference