Documentation
Chat Completion Models
The Groq Chat Completions API processes a series of messages and generates output responses. These models can perform multi-turn discussions or tasks that require only one interaction.
For details about the parameters, visit the reference page.
JSON mode (beta)
JSON mode is a beta feature that guarantees all chat completions are valid JSON.
Usage:
- Set
"response_format": {"type": "json_object"}
in your chat completion request - Add a description of the desired JSON structure within the system prompt (see below for example system prompts)
Recommendations for best beta results:
- Mixtral performs best at generating JSON, followed by Gemma, then Llama
- Use pretty-printed JSON instead of compact JSON
- Keep prompts concise
Beta Limitations:
- Does not support streaming
- Does not support stop sequences
Error Code:
- Groq will return a 400 error with an error code of
json_validate_failed
if JSON generation fails.
Example system prompts:
You are a legal advisor who summarizes documents in JSON
You are a data analyst API capable of sentiment analysis that responds in JSON. The JSON schema should include
{
"sentiment_analysis": {
"sentiment": "string (positive, negative, neutral)",
"confidence_score": "number (0-1)"
# Include additional fields as required
}
}
Generating Chat Completions with groq SDK
Code Overview
pip install groq
Performing a basic Chat Completion
1from groq import Groq
2
3client = Groq()
4
5chat_completion = client.chat.completions.create(
6 #
7 # Required parameters
8 #
9 messages=[
10 # Set an optional system message. This sets the behavior of the
11 # assistant and can be used to provide specific instructions for
12 # how it should behave throughout the conversation.
13 {
14 "role": "system",
15 "content": "you are a helpful assistant."
16 },
17 # Set a user message for the assistant to respond to.
18 {
19 "role": "user",
20 "content": "Explain the importance of fast language models",
21 }
22 ],
23
24 # The language model which will generate the completion.
25 model="llama-3.3-70b-versatile",
26
27 #
28 # Optional parameters
29 #
30
31 # Controls randomness: lowering results in less random completions.
32 # As the temperature approaches zero, the model will become deterministic
33 # and repetitive.
34 temperature=0.5,
35
36 # The maximum number of tokens to generate. Requests can use up to
37 # 32,768 tokens shared between prompt and completion.
38 max_completion_tokens=1024,
39
40 # Controls diversity via nucleus sampling: 0.5 means half of all
41 # likelihood-weighted options are considered.
42 top_p=1,
43
44 # A stop sequence is a predefined or user-specified text string that
45 # signals an AI to stop generating content, ensuring its responses
46 # remain focused and concise. Examples include punctuation marks and
47 # markers like "[end]".
48 stop=None,
49
50 # If set, partial message deltas will be sent.
51 stream=False,
52)
53
54# Print the completion returned by the LLM.
55print(chat_completion.choices[0].message.content)
Streaming a Chat Completion
To stream a completion, simply set the parameter stream=True
. Then the completion
function will return an iterator of completion deltas rather than a single, full completion.
1from groq import Groq
2
3client = Groq()
4
5stream = client.chat.completions.create(
6 #
7 # Required parameters
8 #
9 messages=[
10 # Set an optional system message. This sets the behavior of the
11 # assistant and can be used to provide specific instructions for
12 # how it should behave throughout the conversation.
13 {
14 "role": "system",
15 "content": "you are a helpful assistant."
16 },
17 # Set a user message for the assistant to respond to.
18 {
19 "role": "user",
20 "content": "Explain the importance of fast language models",
21 }
22 ],
23
24 # The language model which will generate the completion.
25 model="llama-3.3-70b-versatile",
26
27 #
28 # Optional parameters
29 #
30
31 # Controls randomness: lowering results in less random completions.
32 # As the temperature approaches zero, the model will become deterministic
33 # and repetitive.
34 temperature=0.5,
35
36 # The maximum number of tokens to generate. Requests can use up to
37 # 2048 tokens shared between prompt and completion.
38 max_completion_tokens=1024,
39
40 # Controls diversity via nucleus sampling: 0.5 means half of all
41 # likelihood-weighted options are considered.
42 top_p=1,
43
44 # A stop sequence is a predefined or user-specified text string that
45 # signals an AI to stop generating content, ensuring its responses
46 # remain focused and concise. Examples include punctuation marks and
47 # markers like "[end]".
48 stop=None,
49
50 # If set, partial message deltas will be sent.
51 stream=True,
52)
53
54# Print the incremental deltas returned by the LLM.
55for chunk in stream:
56 print(chunk.choices[0].delta.content, end="")
Performing a Chat Completion with a stop sequence
1from groq import Groq
2
3client = Groq()
4
5chat_completion = client.chat.completions.create(
6 #
7 # Required parameters
8 #
9 messages=[
10 # Set an optional system message. This sets the behavior of the
11 # assistant and can be used to provide specific instructions for
12 # how it should behave throughout the conversation.
13 {
14 "role": "system",
15 "content": "you are a helpful assistant."
16 },
17 # Set a user message for the assistant to respond to.
18 {
19 "role": "user",
20 "content": "Count to 10. Your response must begin with \"1, \". example: 1, 2, 3, ...",
21 }
22 ],
23
24 # The language model which will generate the completion.
25 model="llama-3.3-70b-versatile",
26
27 #
28 # Optional parameters
29 #
30
31 # Controls randomness: lowering results in less random completions.
32 # As the temperature approaches zero, the model will become deterministic
33 # and repetitive.
34 temperature=0.5,
35
36 # The maximum number of tokens to generate. Requests can use up to
37 # 2048 tokens shared between prompt and completion.
38 max_completion_tokens=1024,
39
40 # Controls diversity via nucleus sampling: 0.5 means half of all
41 # likelihood-weighted options are considered.
42 top_p=1,
43
44 # A stop sequence is a predefined or user-specified text string that
45 # signals an AI to stop generating content, ensuring its responses
46 # remain focused and concise. Examples include punctuation marks and
47 # markers like "[end]".
48 # For this example, we will use ", 6" so that the llm stops counting at 5.
49 # If multiple stop values are needed, an array of string may be passed,
50 # stop=[", 6", ", six", ", Six"]
51 stop=", 6",
52
53 # If set, partial message deltas will be sent.
54 stream=False,
55)
56
57# Print the completion returned by the LLM.
58print(chat_completion.choices[0].message.content)
Performing an Async Chat Completion
Simply use the Async client to enable asyncio
1import asyncio
2
3from groq import AsyncGroq
4
5
6async def main():
7 client = AsyncGroq()
8
9 chat_completion = await client.chat.completions.create(
10 #
11 # Required parameters
12 #
13 messages=[
14 # Set an optional system message. This sets the behavior of the
15 # assistant and can be used to provide specific instructions for
16 # how it should behave throughout the conversation.
17 {
18 "role": "system",
19 "content": "you are a helpful assistant."
20 },
21 # Set a user message for the assistant to respond to.
22 {
23 "role": "user",
24 "content": "Explain the importance of fast language models",
25 }
26 ],
27
28 # The language model which will generate the completion.
29 model="llama-3.3-70b-versatile",
30
31 #
32 # Optional parameters
33 #
34
35 # Controls randomness: lowering results in less random completions.
36 # As the temperature approaches zero, the model will become
37 # deterministic and repetitive.
38 temperature=0.5,
39
40 # The maximum number of tokens to generate. Requests can use up to
41 # 2048 tokens shared between prompt and completion.
42 max_completion_tokens=1024,
43
44 # Controls diversity via nucleus sampling: 0.5 means half of all
45 # likelihood-weighted options are considered.
46 top_p=1,
47
48 # A stop sequence is a predefined or user-specified text string that
49 # signals an AI to stop generating content, ensuring its responses
50 # remain focused and concise. Examples include punctuation marks and
51 # markers like "[end]".
52 stop=None,
53
54 # If set, partial message deltas will be sent.
55 stream=False,
56 )
57
58 # Print the completion returned by the LLM.
59 print(chat_completion.choices[0].message.content)
60
61asyncio.run(main())
Streaming an Async Chat Completion
1import asyncio
2
3from groq import AsyncGroq
4
5
6async def main():
7 client = AsyncGroq()
8
9 stream = await client.chat.completions.create(
10 #
11 # Required parameters
12 #
13 messages=[
14 # Set an optional system message. This sets the behavior of the
15 # assistant and can be used to provide specific instructions for
16 # how it should behave throughout the conversation.
17 {
18 "role": "system",
19 "content": "you are a helpful assistant."
20 },
21 # Set a user message for the assistant to respond to.
22 {
23 "role": "user",
24 "content": "Explain the importance of fast language models",
25 }
26 ],
27
28 # The language model which will generate the completion.
29 model="llama-3.3-70b-versatile",
30
31 #
32 # Optional parameters
33 #
34
35 # Controls randomness: lowering results in less random completions.
36 # As the temperature approaches zero, the model will become
37 # deterministic and repetitive.
38 temperature=0.5,
39
40 # The maximum number of tokens to generate. Requests can use up to
41 # 2048 tokens shared between prompt and completion.
42 max_completion_tokens=1024,
43
44 # Controls diversity via nucleus sampling: 0.5 means half of all
45 # likelihood-weighted options are considered.
46 top_p=1,
47
48 # A stop sequence is a predefined or user-specified text string that
49 # signals an AI to stop generating content, ensuring its responses
50 # remain focused and concise. Examples include punctuation marks and
51 # markers like "[end]".
52 stop=None,
53
54 # If set, partial message deltas will be sent.
55 stream=True,
56 )
57
58 # Print the incremental deltas returned by the LLM.
59 async for chunk in stream:
60 print(chunk.choices[0].delta.content, end="")
61
62asyncio.run(main())
JSON Mode
1from typing import List, Optional
2import json
3
4from pydantic import BaseModel
5from groq import Groq
6
7groq = Groq()
8
9
10# Data model for LLM to generate
11class Ingredient(BaseModel):
12 name: str
13 quantity: str
14 quantity_unit: Optional[str]
15
16
17class Recipe(BaseModel):
18 recipe_name: str
19 ingredients: List[Ingredient]
20 directions: List[str]
21
22
23def get_recipe(recipe_name: str) -> Recipe:
24 chat_completion = groq.chat.completions.create(
25 messages=[
26 {
27 "role": "system",
28 "content": "You are a recipe database that outputs recipes in JSON.\n"
29 # Pass the json schema to the model. Pretty printing improves results.
30 f" The JSON object must use the schema: {json.dumps(Recipe.model_json_schema(), indent=2)}",
31 },
32 {
33 "role": "user",
34 "content": f"Fetch a recipe for {recipe_name}",
35 },
36 ],
37 model="llama3-70b-8192",
38 temperature=0,
39 # Streaming is not supported in JSON mode
40 stream=False,
41 # Enable JSON mode by setting the response format
42 response_format={"type": "json_object"},
43 )
44 return Recipe.model_validate_json(chat_completion.choices[0].message.content)
45
46
47def print_recipe(recipe: Recipe):
48 print("Recipe:", recipe.recipe_name)
49
50 print("\nIngredients:")
51 for ingredient in recipe.ingredients:
52 print(
53 f"- {ingredient.name}: {ingredient.quantity} {ingredient.quantity_unit or ''}"
54 )
55 print("\nDirections:")
56 for step, direction in enumerate(recipe.directions, start=1):
57 print(f"{step}. {direction}")
58
59
60recipe = get_recipe("apple pie")
61print_recipe(recipe)