Prompt Caching

Model prompts often contain repetitive content, such as system prompts and tool definitions. Prompt caching automatically reuses computation from recent requests when they share a common prefix, delivering significant cost savings and improved response times while maintaining data privacy through volatile-only storage that expires automatically.


Prompt caching works automatically on all your API requests with no code changes required and no additional fees.

How It Works

  1. Prefix Matching: When you send a request, the system examines and identifies matching prefixes from recently processed requests stored temporarily in volatile memory. Prefixes can include system prompts, tool definitions, few-shot examples, and more.

  2. Cache Hit: If a matching prefix is found, cached computation is reused, dramatically reducing latency and token costs by 50% for cached portions.

  3. Cache Miss: If no match exists, your prompt is processed normally, with the prefix temporarily cached for potential future matches.

  4. Automatic Expiration: All cached data automatically expires within a few hours, which helps ensure privacy while maintaining the benefits.


Prompt caching works automatically on all your API requests to supported models with no code changes required and no additional fees. Groq tries to maximize cache hits, but this is not guaranteed. Pricing discount will only apply on successful cache hits.

Supported Models

Prompt caching is currently only supported for the following models:

Model IDModel
moonshotai/kimi-k2-instruct
Kimi K2

We're starting with a limited selection of models and will roll out prompt caching to more models soon.

Pricing

Prompt caching is provided at no additional cost. There is a 50% discount for cached input tokens.

Structuring Prompts for Optimal Caching

Cache hits are only possible for exact prefix matches within a prompt. To realize caching benefits, you need to think strategically about prompt organization:

Optimal Prompt Structure

Place static content like instructions and examples at the beginning of your prompt, and put variable content, such as user-specific information, at the end. This maximizes the length of the reusable prefix across different requests.


If you put variable information (like timestamps or user IDs) at the beginning, even identical system instructions later in the prompt won't benefit from caching because the prefixes won't match.


Place static content first:

  • System prompts and instructions
  • Few-shot examples
  • Tool definitions
  • Schema definitions
  • Common context or background information

Place dynamic content last:

  • User-specific queries
  • Variable data
  • Timestamps
  • Session-specific information
  • Unique identifiers

Example Structure

text
[SYSTEM PROMPT - Static]
[TOOL DEFINITIONS - Static]  
[FEW-SHOT EXAMPLES - Static]
[COMMON INSTRUCTIONS - Static]
[USER QUERY - Dynamic]
[SESSION DATA - Dynamic]

This structure maximizes the likelihood that the static prefix portion will match across different requests, enabling cache hits while keeping user-specific content at the end.

Prompt Caching Examples

import Groq from "groq-sdk";

const groq = new Groq();

async function multiTurnConversation() {
  // Initial conversation with system message and first user input
  const initialMessages = [
    {
      role: "system",
      content: "You are a helpful AI assistant that provides detailed explanations about complex topics. Always provide comprehensive answers with examples and context."
    },
    {
      role: "user",
      content: "What is quantum computing?"
    }
  ];

  // First request - creates cache for system message
  const firstResponse = await groq.chat.completions.create({
    messages: initialMessages,
    model: "moonshotai/kimi-k2-instruct"
  });

  console.log("First response:", firstResponse.choices[0].message.content);
  console.log("Usage:", firstResponse.usage);

  // Continue conversation - system message and previous context will be cached
  const conversationMessages = [
    ...initialMessages,
    firstResponse.choices[0].message,
    {
      role: "user",
      content: "Can you give me a simple example of how quantum superposition works?"
    }
  ];

  const secondResponse = await groq.chat.completions.create({
    messages: conversationMessages,
    model: "moonshotai/kimi-k2-instruct"
  });

  console.log("Second response:", secondResponse.choices[0].message.content);
  console.log("Usage:", secondResponse.usage);

  // Continue with third turn
  const thirdTurnMessages = [
    ...conversationMessages,
    secondResponse.choices[0].message,
    {
      role: "user",
      content: "How does this relate to quantum entanglement?"
    }
  ];

  const thirdResponse = await groq.chat.completions.create({
    messages: thirdTurnMessages,
    model: "moonshotai/kimi-k2-instruct"
  });

  console.log("Third response:", thirdResponse.choices[0].message.content);
  console.log("Usage:", thirdResponse.usage);
}

multiTurnConversation().catch(console.error);

How Prompt Caching Works in Multi-Turn Conversations

In this example, we demonstrate how to use prompt caching in a multi-turn conversation.


During each turn, the system automatically caches the longest matching prefix from previous requests. The system message and conversation history that remain unchanged between requests will be cached, while only new user messages and assistant responses need fresh processing.


This approach is useful for maintaining context in ongoing conversations without repeatedly processing the same information.


For the first request:

  • prompt_tokens: Number of tokens in the system message and first user message
  • cached_tokens: 0 (no cache hit on first request)

For subsequent requests within the cache lifetime:

  • prompt_tokens: Total number of tokens in the entire conversation (system message + conversation history + new user message)
  • cached_tokens: Number of tokens in the system message and previous conversation history that were served from cache

When set up properly, you should see increasing cache efficiency as the conversation grows, with the system message and earlier conversation turns being served from cache while only new content requires processing.

Requirements and Limitations

Caching Requirements

  • Exact Prefix Matching: Cache hits require exact matches of the beginning of your prompt
  • Minimum Prompt Length: The minimum cacheable prompt length varies by model, ranging from 128 to 1024 tokens depending on the specific model used

To check how much of your prompt was cached, see the response usage fields.

What Can Be Cached

  • Complete message arrays including system, user, and assistant messages
  • Tool definitions and function schemas
  • System instructions and prompt templates
  • One-shot and few-shot examples
  • Structured output schemas
  • Large static content like legal documents, research papers, or extensive context that remains constant across multiple queries
  • Image inputs, including image URLs and base64-encoded images

Limitations

  • Exact Matching: Even minor changes in cached portions prevent cache hits and force a new cache to be created
  • No Manual Control: Cache clearing and management is automatic only

Tracking Cache Usage

You can monitor how many tokens are being served from cache by examining the usage field in your API response. The response includes detailed token usage information, including how many tokens were cached.

Response Usage Structure

JSON
{
  "id": "chatcmpl-...",
  "model": "moonshotai/kimi-k2-instruct",
  "usage": {
    "prompt_tokens": 2006,
    "completion_tokens": 300,
    "total_tokens": 2306,
    "prompt_tokens_details": {
      "cached_tokens": 1920
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  ... other fields
}

Understanding the Fields

  • prompt_tokens: Total number of tokens in your input prompt
  • cached_tokens: Number of input tokens that were served from cache (within prompt_tokens_details)
  • completion_tokens: Number of tokens in the model's response
  • total_tokens: Sum of prompt and completion tokens

In the example above, out of 2,006 prompt tokens, 1,920 tokens (95.7%) were served from cache, resulting in significant cost savings and improved response time.

Calculating Cache Hit Rate

To calculate your cache hit rate:

Cache Hit Rate = cached_tokens / prompt_tokens × 100%

For the example above: 1920 / 2006 × 100% = 95.7%


A higher cache hit rate indicates better prompt structure optimization leading to lower latency and more cost savings.

Troubleshooting

  • Verify that sections that you want to cache are identical between requests
  • Check that calls are made within the cache lifetime (a few hours). Calls that are too far apart will not benefit from caching.
  • Ensure that tool_choice, tool usage, and image usage remain consistent between calls
  • Validate that you are caching at least the minimum number of tokens through the usage fields.

Changes to cached sections, including tool_choice and image usage, will invalidate the cache and require a new cache to be created. Subsequent calls will use the new cache.

Frequently Asked Questions

How is data privacy maintained with caching?

All cached data exists only in volatile memory and automatically expires within a few hours. No prompt or response content is ever stored in persistent storage or shared between organizations.

Does caching affect the quality or consistency of responses?

No. Prompt caching only affects the processing of the input prompt, not the generation of responses. The actual model inference and response generation occur normally, maintaining identical output quality whether caching is used or not.

Can I disable prompt caching?

Prompt caching is automatically enabled and cannot be manually disabled. This helps customers benefit from reduced costs and latency. Prompts are not stored in persistent storage.

How do I know if my requests are benefiting from caching?

You can track cache usage by examining the usage field in your API responses. Cache hits are not guaranteed, but Groq tries to maximize them. See the Tracking Cache Usage section above for detailed information on how to monitor cached tokens and calculate your cache hit rate.

Are there any additional costs for using prompt caching?

No. Prompt caching is provided at no additional cost and can help to reduce your costs by 50% for cached tokens while improving response times.

Does caching affect rate limits?

Cached tokens still count toward your rate limits, but the improved processing speed may allow you to achieve higher effective throughput within your limits.

Can I manually clear or refresh caches?

No manual cache management is available. All cache expiration and cleanup happens automatically.

Does the prompt caching discount work with batch requests?

No, prompt caching is not supported for batch requests. Batch requests have a 50% discount on all tokens, and caching is not applied to batch requests.

Was this page helpful?