Documentation

Chat Completion Models

The Groq Chat Completions API processes a series of messages and generates output responses. These models can perform multi-turn discussions or tasks that require only one interaction.


For details about the parameters, visit the reference page.

JSON mode (beta)

JSON mode is a beta feature that guarantees all chat completions are valid JSON.

Usage:

  • Set "response_format": {"type": "json_object"} in your chat completion request
  • Add a description of the desired JSON structure within the system prompt (see below for example system prompts)

Recommendations for best beta results:

  • Mixtral performs best at generating JSON, followed by Gemma, then Llama
  • Use pretty-printed JSON instead of compact JSON
  • Keep prompts concise

Beta Limitations:

  • Does not support streaming
  • Does not support stop sequences

Error Code:

  • Groq will return a 400 error with an error code of json_validate_failed if JSON generation fails.

Example system prompts:


You are a legal advisor who summarizes documents in JSON

You are a data analyst API capable of sentiment analysis that responds in JSON.  The JSON schema should include
{
  "sentiment_analysis": {
    "sentiment": "string (positive, negative, neutral)",
    "confidence_score": "number (0-1)"
    # Include additional fields as required
  }
}

Generating Chat Completions with groq SDK

Code Overview

pip install groq

Performing a basic Chat Completion

1from groq import Groq
2
3client = Groq()
4
5chat_completion = client.chat.completions.create(
6    #
7    # Required parameters
8    #
9    messages=[
10        # Set an optional system message. This sets the behavior of the
11        # assistant and can be used to provide specific instructions for
12        # how it should behave throughout the conversation.
13        {
14            "role": "system",
15            "content": "you are a helpful assistant."
16        },
17        # Set a user message for the assistant to respond to.
18        {
19            "role": "user",
20            "content": "Explain the importance of fast language models",
21        }
22    ],
23
24    # The language model which will generate the completion.
25    model="llama-3.3-70b-versatile",
26
27    #
28    # Optional parameters
29    #
30
31    # Controls randomness: lowering results in less random completions.
32    # As the temperature approaches zero, the model will become deterministic
33    # and repetitive.
34    temperature=0.5,
35
36    # The maximum number of tokens to generate. Requests can use up to
37    # 32,768 tokens shared between prompt and completion.
38    max_completion_tokens=1024,
39
40    # Controls diversity via nucleus sampling: 0.5 means half of all
41    # likelihood-weighted options are considered.
42    top_p=1,
43
44    # A stop sequence is a predefined or user-specified text string that
45    # signals an AI to stop generating content, ensuring its responses
46    # remain focused and concise. Examples include punctuation marks and
47    # markers like "[end]".
48    stop=None,
49
50    # If set, partial message deltas will be sent.
51    stream=False,
52)
53
54# Print the completion returned by the LLM.
55print(chat_completion.choices[0].message.content)

Streaming a Chat Completion

To stream a completion, simply set the parameter stream=True. Then the completion function will return an iterator of completion deltas rather than a single, full completion.


1from groq import Groq
2
3client = Groq()
4
5stream = client.chat.completions.create(
6    #
7    # Required parameters
8    #
9    messages=[
10        # Set an optional system message. This sets the behavior of the
11        # assistant and can be used to provide specific instructions for
12        # how it should behave throughout the conversation.
13        {
14            "role": "system",
15            "content": "you are a helpful assistant."
16        },
17        # Set a user message for the assistant to respond to.
18        {
19            "role": "user",
20            "content": "Explain the importance of fast language models",
21        }
22    ],
23
24    # The language model which will generate the completion.
25    model="llama-3.3-70b-versatile",
26
27    #
28    # Optional parameters
29    #
30
31    # Controls randomness: lowering results in less random completions.
32    # As the temperature approaches zero, the model will become deterministic
33    # and repetitive.
34    temperature=0.5,
35
36    # The maximum number of tokens to generate. Requests can use up to
37    # 2048 tokens shared between prompt and completion.
38    max_completion_tokens=1024,
39
40    # Controls diversity via nucleus sampling: 0.5 means half of all
41    # likelihood-weighted options are considered.
42    top_p=1,
43
44    # A stop sequence is a predefined or user-specified text string that
45    # signals an AI to stop generating content, ensuring its responses
46    # remain focused and concise. Examples include punctuation marks and
47    # markers like "[end]".
48    stop=None,
49
50    # If set, partial message deltas will be sent.
51    stream=True,
52)
53
54# Print the incremental deltas returned by the LLM.
55for chunk in stream:
56    print(chunk.choices[0].delta.content, end="")

Performing a Chat Completion with a stop sequence

1from groq import Groq
2
3client = Groq()
4
5chat_completion = client.chat.completions.create(
6    #
7    # Required parameters
8    #
9    messages=[
10        # Set an optional system message. This sets the behavior of the
11        # assistant and can be used to provide specific instructions for
12        # how it should behave throughout the conversation.
13        {
14            "role": "system",
15            "content": "you are a helpful assistant."
16        },
17        # Set a user message for the assistant to respond to.
18        {
19            "role": "user",
20            "content": "Count to 10.  Your response must begin with \"1, \".  example: 1, 2, 3, ...",
21        }
22    ],
23
24    # The language model which will generate the completion.
25    model="llama-3.3-70b-versatile",
26
27    #
28    # Optional parameters
29    #
30
31    # Controls randomness: lowering results in less random completions.
32    # As the temperature approaches zero, the model will become deterministic
33    # and repetitive.
34    temperature=0.5,
35
36    # The maximum number of tokens to generate. Requests can use up to
37    # 2048 tokens shared between prompt and completion.
38    max_completion_tokens=1024,
39
40    # Controls diversity via nucleus sampling: 0.5 means half of all
41    # likelihood-weighted options are considered.
42    top_p=1,
43
44    # A stop sequence is a predefined or user-specified text string that
45    # signals an AI to stop generating content, ensuring its responses
46    # remain focused and concise. Examples include punctuation marks and
47    # markers like "[end]".
48    # For this example, we will use ", 6" so that the llm stops counting at 5.
49    # If multiple stop values are needed, an array of string may be passed,
50    # stop=[", 6", ", six", ", Six"]
51    stop=", 6",
52
53    # If set, partial message deltas will be sent.
54    stream=False,
55)
56
57# Print the completion returned by the LLM.
58print(chat_completion.choices[0].message.content)

Performing an Async Chat Completion

Simply use the Async client to enable asyncio


1import asyncio
2
3from groq import AsyncGroq
4
5
6async def main():
7    client = AsyncGroq()
8
9    chat_completion = await client.chat.completions.create(
10        #
11        # Required parameters
12        #
13        messages=[
14            # Set an optional system message. This sets the behavior of the
15            # assistant and can be used to provide specific instructions for
16            # how it should behave throughout the conversation.
17            {
18                "role": "system",
19                "content": "you are a helpful assistant."
20            },
21            # Set a user message for the assistant to respond to.
22            {
23                "role": "user",
24                "content": "Explain the importance of fast language models",
25            }
26        ],
27
28        # The language model which will generate the completion.
29        model="llama-3.3-70b-versatile",
30
31        #
32        # Optional parameters
33        #
34
35        # Controls randomness: lowering results in less random completions.
36        # As the temperature approaches zero, the model will become
37        # deterministic and repetitive.
38        temperature=0.5,
39
40        # The maximum number of tokens to generate. Requests can use up to
41        # 2048 tokens shared between prompt and completion.
42        max_completion_tokens=1024,
43
44        # Controls diversity via nucleus sampling: 0.5 means half of all
45        # likelihood-weighted options are considered.
46        top_p=1,
47
48        # A stop sequence is a predefined or user-specified text string that
49        # signals an AI to stop generating content, ensuring its responses
50        # remain focused and concise. Examples include punctuation marks and
51        # markers like "[end]".
52        stop=None,
53
54        # If set, partial message deltas will be sent.
55        stream=False,
56    )
57
58    # Print the completion returned by the LLM.
59    print(chat_completion.choices[0].message.content)
60
61asyncio.run(main())

Streaming an Async Chat Completion

1import asyncio
2
3from groq import AsyncGroq
4
5
6async def main():
7    client = AsyncGroq()
8
9    stream = await client.chat.completions.create(
10        #
11        # Required parameters
12        #
13        messages=[
14            # Set an optional system message. This sets the behavior of the
15            # assistant and can be used to provide specific instructions for
16            # how it should behave throughout the conversation.
17            {
18                "role": "system",
19                "content": "you are a helpful assistant."
20            },
21            # Set a user message for the assistant to respond to.
22            {
23                "role": "user",
24                "content": "Explain the importance of fast language models",
25            }
26        ],
27
28        # The language model which will generate the completion.
29        model="llama-3.3-70b-versatile",
30
31        #
32        # Optional parameters
33        #
34
35        # Controls randomness: lowering results in less random completions.
36        # As the temperature approaches zero, the model will become
37        # deterministic and repetitive.
38        temperature=0.5,
39
40        # The maximum number of tokens to generate. Requests can use up to
41        # 2048 tokens shared between prompt and completion.
42        max_completion_tokens=1024,
43
44        # Controls diversity via nucleus sampling: 0.5 means half of all
45        # likelihood-weighted options are considered.
46        top_p=1,
47
48        # A stop sequence is a predefined or user-specified text string that
49        # signals an AI to stop generating content, ensuring its responses
50        # remain focused and concise. Examples include punctuation marks and
51        # markers like "[end]".
52        stop=None,
53
54        # If set, partial message deltas will be sent.
55        stream=True,
56    )
57
58    # Print the incremental deltas returned by the LLM.
59    async for chunk in stream:
60        print(chunk.choices[0].delta.content, end="")
61
62asyncio.run(main())

JSON Mode

1from typing import List, Optional
2import json
3
4from pydantic import BaseModel
5from groq import Groq
6
7groq = Groq()
8
9
10# Data model for LLM to generate
11class Ingredient(BaseModel):
12    name: str
13    quantity: str
14    quantity_unit: Optional[str]
15
16
17class Recipe(BaseModel):
18    recipe_name: str
19    ingredients: List[Ingredient]
20    directions: List[str]
21
22
23def get_recipe(recipe_name: str) -> Recipe:
24    chat_completion = groq.chat.completions.create(
25        messages=[
26            {
27                "role": "system",
28                "content": "You are a recipe database that outputs recipes in JSON.\n"
29                # Pass the json schema to the model. Pretty printing improves results.
30                f" The JSON object must use the schema: {json.dumps(Recipe.model_json_schema(), indent=2)}",
31            },
32            {
33                "role": "user",
34                "content": f"Fetch a recipe for {recipe_name}",
35            },
36        ],
37        model="llama3-70b-8192",
38        temperature=0,
39        # Streaming is not supported in JSON mode
40        stream=False,
41        # Enable JSON mode by setting the response format
42        response_format={"type": "json_object"},
43    )
44    return Recipe.model_validate_json(chat_completion.choices[0].message.content)
45
46
47def print_recipe(recipe: Recipe):
48    print("Recipe:", recipe.recipe_name)
49
50    print("\nIngredients:")
51    for ingredient in recipe.ingredients:
52        print(
53            f"- {ingredient.name}: {ingredient.quantity} {ingredient.quantity_unit or ''}"
54        )
55    print("\nDirections:")
56    for step, direction in enumerate(recipe.directions, start=1):
57        print(f"{step}. {direction}")
58
59
60recipe = get_recipe("apple pie")
61print_recipe(recipe)