Rate limits act as control measures to regulate how frequently users and applications can access our API within specified timeframes. These limits help ensure service stability, fair access, and protection against misuse so that we can serve reliable and fast inference for all.

Understanding Rate Limits

Rate limits are measured in:

  • RPM: Requests per minute
  • RPD: Requests per day
  • TPM: Tokens per minute
  • TPD: Tokens per day

Rate limits apply at the organization level, not individual users. You can hit any limit type depemding on which threshold you reach first.

Example: Let's say your RPM = 50 and your TPM = 200K. If you were to send 50 requests with only 100 tokens within a minute, you would reach your limit even though you did not send 200K tokens within those 50 requests.

The following is a high level summary and there may be exceptions to these limits. You can view the current, exact rate limits for your organization on the limits page in your account settings.

MODEL IDRPMRPDTPMTPDASHASD
deepseek-r1-distill-llama-70b
30
1,000
6,000
-
-
-
deepseek-r1-distill-qwen-32b
30
1,000
6,000
-
-
-
distil-whisper-large-v3-en
20
2,000
-
-
7,200
28,800
gemma2-9b-it
30
14,400
15,000
500,000
-
-
llama-3.1-8b-instant
30
14,400
6,000
500,000
-
-
llama-3.2-1b-preview
30
7,000
7,000
500,000
-
-
llama-3.2-3b-preview
30
7,000
7,000
500,000
-
-
llama-3.2-11b-vision-preview
30
7,000
7,000
500,000
-
-
llama-3.2-90b-vision-preview
15
3,500
7,000
250,000
-
-
llama-3.3-70b-specdec
30
1,000
6,000
100,000
-
-
llama-3.3-70b-versatile
30
1,000
6,000
100,000
-
-
llama-guard-3-8b
30
14,400
15,000
500,000
-
-
llama3-8b-8192
30
14,400
6,000
500,000
-
-
llama3-70b-8192
30
14,400
6,000
500,000
-
-
mistral-saba-24b
30
1,000
6,000
-
-
-
qwen-2.5-32b
30
1,000
6,000
-
-
-
qwen-2.5-coder-32b
30
1,000
6,000
-
-
-
qwen-qwq-32b
30
1,000
6,000
-
-
-
whisper-large-v3
20
2,000
-
-
7,200
28,800
whisper-large-v3-turbo
20
2,000
-
-
7,200
28,800

In addition to viewing your limits on your account's limits page, you can also view rate limit information such as remaining requests and tokens in HTTP response headers as follows:

The following headers are set (values are illustrative):

HeaderValueNotes
retry-after2In seconds
x-ratelimit-limit-requests14400Always refers to Requests Per Day (RPD)
x-ratelimit-limit-tokens18000Always refers to Tokens Per Minute (TPM)
x-ratelimit-remaining-requests14370Always refers to Requests Per Day (RPD)
x-ratelimit-remaining-tokens17997Always refers to Tokens Per Minute (TPM)
x-ratelimit-reset-requests2m59.56sAlways refers to Requests Per Day (RPD)
x-ratelimit-reset-tokens7.66sAlways refers to Tokens Per Minute (TPM)

When you exceed rate limits, our API returns a 429 Too Many Requests HTTP status code.

Note: retry-after is only set if you hit the rate limit and status code 429 is returned. The other headers are always included.