Performance Tier

The Performance tier provides prioritized capacity for low, consistent latency. It includes a 99.9% availability SLA and a 99% latency guarantee aligned to your enterprise agreement.

Enterprise only: The Performance tier is only available on enterprise plans. Reach out to our enterprise team to get access.

Best Use Cases

User-facing or production-critical paths where latency consistency matters most.
Workloads that need low p99 time to first token.
If you want to easily burst beyond your limits, you can pass service_tier=auto in order to use your performance tier limits if they are avaiable, and burst into on_demand. This is a great way to balance perforamcne and costs.

Packaging and pricing

Performance is delivered as provisioned throughput: you purchase input and output capacity bundles and pay for that provisioned capacity rather than per-token usage. Reach out to inquire about pricing.
Availability SLA: 99.9% uptime and 99% low latency guarantee. (Details specified in offline agreement)

Model availability

Model ID	Name
`openai/gpt-oss-120b`	GPT-OSS 120B
`openai/gpt-oss-20b`	GPT-OSS 20B
`llama-3.3-70b-versatile`	Llama 3.3 70B Versatile

Requirements

Context length (uncached) must be under 8,192 tokens

Note: If you want automatic context-length handling and tier mapping, use service_tier=auto and we'll route requests to appropriate tiers behind the scenes based on your limits and context size.

Getting Started

Core Features

Tools & Integrations

Compound (Agentic AI)

Guides

Service Tiers

Advanced

Production Readiness

Account and Console

Developer Resources

Legal

Performance Tier

Best Use Cases

Packaging and pricing

Model availability

Requirements

Was this page helpful?

On this page