Groq

Service Tiers

Groq offers multiple service tiers so you can tune for latency, throughput, and reliability. You can distinguish these by providing the service_tier parameter.

  • performance: The highest tier we have providing reliable low latency for the most critical production applications. This tier is available to our enterprise users. More info at Performance Tier.
  • on_demand: This is the default tier if you omit service_tier. This is the standard tier you are used to using and you get the predictable high speeds of Groq's LPU with occasional queue latency during peak times.
  • flex: higher throughput and provided as best effort. You have high limits but may get over capacity errors. Check out Flex Processing for more info.
  • auto: Pass this if you dont want to think about tiers and you want to leverage the best tier available to you at any given moment.
import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    service_tier="auto",
    messages=[{"role": "user", "content": "Summarize the latest release highlights."}],
)

print(completion.choices[0].message.content)

Batch and asynchronous workloads

The Batch API has its own processing window and rate limits and does not accept the service_tier parameter. Use synchronous requests when you need explicit tier control; batch jobs run independently of your per-model synchronous limits.

Was this page helpful?