Model prompts often contain repetitive content, such as system prompts and tool definitions. Prompt caching automatically reuses computation from recent requests when they share a common prefix, delivering significant cost savings and improved response times while maintaining data privacy through volatile-only storage that expires automatically.
Prompt caching works automatically on all your API requests with no code changes required and no additional fees.
Prefix Matching: When you send a request, the system examines and identifies matching prefixes from recently processed requests stored temporarily in volatile memory. Prefixes can include system prompts, tool definitions, few-shot examples, and more.
Cache Hit: If a matching prefix is found, cached computation is reused, dramatically reducing latency and token costs by 50% for cached portions.
Cache Miss: If no match exists, your prompt is processed normally, with the prefix temporarily cached for potential future matches.
Automatic Expiration: All cached data automatically expires within a few hours, which helps ensure privacy while maintaining the benefits.
Prompt caching works automatically on all your API requests to supported models with no code changes required and no additional fees. Groq tries to maximize cache hits, but this is not guaranteed. Pricing discount will only apply on successful cache hits.
Prompt caching is currently only supported for the following models:
Model ID | Model |
---|---|
moonshotai/kimi-k2-instruct | Kimi K2 |
We're starting with a limited selection of models and will roll out prompt caching to more models soon.
Prompt caching is provided at no additional cost. There is a 50% discount for cached input tokens.
Cache hits are only possible for exact prefix matches within a prompt. To realize caching benefits, you need to think strategically about prompt organization:
Place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end. This maximizes the length of the reusable prefix across different requests.
If you put variable information (like timestamps or user IDs) at the beginning, even identical system instructions later in the prompt won't benefit from caching because the prefixes won't match.
Place static content first:
Place dynamic content last:
[SYSTEM PROMPT - Static]
[TOOL DEFINITIONS - Static]
[FEW-SHOT EXAMPLES - Static]
[COMMON INSTRUCTIONS - Static]
[USER QUERY - Dynamic]
[SESSION DATA - Dynamic]
This structure maximizes the likelihood that the static prefix portion will match across different requests, enabling cache hits while keeping user-specific content at the end.
import Groq from "groq-sdk";
const groq = new Groq();
async function multiTurnConversation() {
// Initial conversation with system message and first user input
const initialMessages = [
{
role: "system",
content: "You are a helpful AI assistant that provides detailed explanations about complex topics. Always provide comprehensive answers with examples and context."
},
{
role: "user",
content: "What is quantum computing?"
}
];
// First request - creates cache for system message
const firstResponse = await groq.chat.completions.create({
messages: initialMessages,
model: "moonshotai/kimi-k2-instruct"
});
console.log("First response:", firstResponse.choices[0].message.content);
console.log("Usage:", firstResponse.usage);
// Continue conversation - system message and previous context will be cached
const conversationMessages = [
...initialMessages,
firstResponse.choices[0].message,
{
role: "user",
content: "Can you give me a simple example of how quantum superposition works?"
}
];
const secondResponse = await groq.chat.completions.create({
messages: conversationMessages,
model: "moonshotai/kimi-k2-instruct"
});
console.log("Second response:", secondResponse.choices[0].message.content);
console.log("Usage:", secondResponse.usage);
// Continue with third turn
const thirdTurnMessages = [
...conversationMessages,
secondResponse.choices[0].message,
{
role: "user",
content: "How does this relate to quantum entanglement?"
}
];
const thirdResponse = await groq.chat.completions.create({
messages: thirdTurnMessages,
model: "moonshotai/kimi-k2-instruct"
});
console.log("Third response:", thirdResponse.choices[0].message.content);
console.log("Usage:", thirdResponse.usage);
}
multiTurnConversation().catch(console.error);
In this example, we demonstrate how to use prompt caching in a multi-turn conversation.
During each turn, the system automatically caches the longest matching prefix from previous requests. The system message and conversation history that remain unchanged between requests will be cached, while only new user messages and assistant responses need fresh processing.
This approach is useful for maintaining context in ongoing conversations without repeatedly processing the same information.
For the first request:
prompt_tokens
: Number of tokens in the system message and first user messagecached_tokens
: 0 (no cache hit on first request)For subsequent requests within the cache lifetime:
prompt_tokens
: Total number of tokens in the entire conversation (system message + conversation history + new user message)cached_tokens
: Number of tokens in the system message and previous conversation history that were served from cacheWhen set up properly, you should see increasing cache efficiency as the conversation grows, with the system message and earlier conversation turns being served from cache while only new content requires processing.
To check how much of your prompt was cached, see the response usage fields.
You can monitor how many tokens are being served from cache by examining the usage
field in your API response. The response includes detailed token usage information, including how many tokens were cached.
{
"id": "chatcmpl-...",
"model": "moonshotai/kimi-k2-instruct",
"usage": {
"prompt_tokens": 2006,
"completion_tokens": 300,
"total_tokens": 2306,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
... other fields
}
prompt_tokens
: Total number of tokens in your input promptcached_tokens
: Number of input tokens that were served from cache (within prompt_tokens_details
)completion_tokens
: Number of tokens in the model's responsetotal_tokens
: Sum of prompt and completion tokensIn the example above, out of 2,006 prompt tokens, 1,920 tokens (95.7%) were served from cache, resulting in significant cost savings and improved response time.
To calculate your cache hit rate:
Cache Hit Rate = cached_tokens / prompt_tokens × 100%
For the example above: 1920 / 2006 × 100% = 95.7%
A higher cache hit rate indicates better prompt structure optimization leading to lower latency and more cost savings.
tool_choice
, tool usage, and image usage remain consistent between callsChanges to cached sections, including tool_choice
and image usage, will invalidate the cache and require a new cache to be created. Subsequent calls will use the new cache.
All cached data exists only in volatile memory and automatically expires within a few hours. No prompt or response content is ever stored in persistent storage or shared between organizations.
No. Prompt caching only affects the processing of the input prompt, not the generation of responses. The actual model inference and response generation occur normally, maintaining identical output quality whether caching is used or not.
Prompt caching is automatically enabled and cannot be manually disabled. This helps customers benefit from reduced costs and latency. Prompts are not stored in persistent storage.
You can track cache usage by examining the usage
field in your API responses. Cache hits are not guaranteed, but Groq tries to maximize them. See the Tracking Cache Usage section above for detailed information on how to monitor cached tokens and calculate your cache hit rate.
No. Prompt caching is provided at no additional cost and can help to reduce your costs by 50% for cached tokens while improving response times.
Cached tokens still count toward your rate limits, but the improved processing speed may allow you to achieve higher effective throughput within your limits.
No manual cache management is available. All cache expiration and cleanup happens automatically.
No, prompt caching is not supported for batch requests. Batch requests have a 50% discount on all tokens, and caching is not applied to batch requests.