Flex Processing is a service tier optimized for high-throughput workloads that prioritizes fast inference and can handle occasional request failures. This tier offers significantly higher rate limits while maintaining the same pricing as on-demand processing.
Flex processing is available for all models to paid customers only with 10x higher rate limits compared to on-demand processing. Pricing matches the on-demand tier.
498 and error capacity_exceeded. Add jittered backoff and retries to smooth spikes.import os
import requests
GROQ_API_KEY = os.environ.get("GROQ_API_KEY")
def main():
try:
response = requests.post(
"https://api.groq.com/openai/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {GROQ_API_KEY}"
},
json={
"service_tier": "flex",
"model": "llama-3.3-70b-versatile",
"messages": [{
"role": "user",
"content": "whats 2 + 2"
}]
}
)
print(response.json())
except Exception as e:
print(f"Error: {str(e)}")
if __name__ == "__main__":
main()