Documentation

Rate Limits

Rate limits act as control measures to regulate how frequently users and applications can access our API within specified timeframes. These limits help ensure service stability, fair access, and protection against misuse so that we can serve reliable and fast inference for all.

Understanding Rate Limits

Rate limits are measured in:

  • RPM: Requests per minute
  • RPD: Requests per day
  • TPM: Tokens per minute
  • TPD: Tokens per day

Rate limits apply at the organization level, not individual users. You can hit any limit type depemding on which threshold you reach first.

Example: Let's say your RPM = 50 and your TPM = 200K. If you were to send 50 requests with only 100 tokens within a minute, you would reach your limit even though you did not send 200K tokens within those 50 requests.

Rate Limits

The following is a high level summary and there may be exceptions to these limits. You can view the current, exact rate limits for your organization on the limits page in your account settings.

MODEL IDRPMRPDTPMTPDASHASD
deepseek-r1-distill-llama-70b301,0006,000---
llama-3.3-70b-versatile301,0006,000100,000--
llama-3.3-70b-specdec301,0006,000100,000--
llama-3.2-1b-preview307,0007,000500,000--
llama-3.2-3b-preview307,0007,000500,000--
llama-3.1-8b-instant3014,4006,000500,000--
llama3-70b-81923014,4006,000500,000--
llama3-8b-81923014,4006,000500,000--
llama-guard-3-8b3014,40015,000500,000--
mixtral-8x7b-327683014,4005,000500,000--
gemma2-9b-it3014,40015,000500,000--
whisper-large-v3202,000--7,20028,800
whisper-large-v3-turbo202,000--7,20028,800
distil-whisper-large-v3-en202,000--7,20028,800
llama-3.2-11b-vision-preview307,0007,000500,000--
llama-3.2-90b-vision-preview153,5007,000250,000--

Rate Limit Headers

In addition to viewing your limits on your account's limits page, you can also view rate limit information such as remaining requests and tokens in HTTP response headers as follows:

The following headers are set (values are illustrative):

HeaderValueNotes
retry-after2In seconds
x-ratelimit-limit-requests14400Always refers to Requests Per Day (RPD)
x-ratelimit-limit-tokens18000Always refers to Tokens Per Minute (TPM)
x-ratelimit-remaining-requests14370Always refers to Requests Per Day (RPD)
x-ratelimit-remaining-tokens17997Always refers to Tokens Per Minute (TPM)
x-ratelimit-reset-requests2m59.56sAlways refers to Requests Per Day (RPD)
x-ratelimit-reset-tokens7.66sAlways refers to Tokens Per Minute (TPM)

Handling Rate Limits

When you exceed rate limits, our API returns a 429 Too Many Requests HTTP status code.

Note: retry-after is only set if you hit the rate limit and status code 429 is returned. The other headers are always included.