Generating text with Groq's Chat Completions API enables you to have natural, conversational interactions with Groq's large language models. It processes a series of messages and generates human-like responses that can be used for various applications including conversational agents, content generation, task automation, and generating structured data outputs like JSON for your applications.
Chat completions allow your applications to have dynamic interactions with Groq's models. You can send messages that include user inputs and system instructions, and receive responses that match the conversational context.
Chat models can handle both multi-turn discussions (conversations with multiple back-and-forth exchanges) and single-turn tasks where you need just one response.
For details about all available parameters, visit the API reference page.
To start using Groq's Chat Completions API, you'll need to install the Groq SDK and set up your API key.
pip install groq
The simplest way to use the Chat Completions API is to send a list of messages and receive a single response. Messages are provided in chronological order, with each message containing a role ("system", "user", or "assistant") and content.
from groq import Groq
client = Groq()
chat_completion = client.chat.completions.create(
messages=[
# Set an optional system message. This sets the behavior of the
# assistant and can be used to provide specific instructions for
# how it should behave throughout the conversation.
{
"role": "system",
"content": "You are a helpful assistant."
},
# Set a user message for the assistant to respond to.
{
"role": "user",
"content": "Explain the importance of fast language models",
}
],
# The language model which will generate the completion.
model="llama-3.3-70b-versatile"
)
# Print the completion returned by the LLM.
print(chat_completion.choices[0].message.content)
For a more responsive user experience, you can stream the model's response in real-time. This allows your application to display the response as it's being generated, rather than waiting for the complete response.
To enable streaming, set the parameter stream=True
. The completion function will then return an iterator of completion deltas rather than a single, full completion.
from groq import Groq
client = Groq()
stream = client.chat.completions.create(
#
# Required parameters
#
messages=[
# Set an optional system message. This sets the behavior of the
# assistant and can be used to provide specific instructions for
# how it should behave throughout the conversation.
{
"role": "system",
"content": "You are a helpful assistant."
},
# Set a user message for the assistant to respond to.
{
"role": "user",
"content": "Explain the importance of fast language models",
}
],
# The language model which will generate the completion.
model="llama-3.3-70b-versatile",
#
# Optional parameters
#
# Controls randomness: lowering results in less random completions.
# As the temperature approaches zero, the model will become deterministic
# and repetitive.
temperature=0.5,
# The maximum number of tokens to generate. Requests can use up to
# 2048 tokens shared between prompt and completion.
max_completion_tokens=1024,
# Controls diversity via nucleus sampling: 0.5 means half of all
# likelihood-weighted options are considered.
top_p=1,
# A stop sequence is a predefined or user-specified text string that
# signals an AI to stop generating content, ensuring its responses
# remain focused and concise. Examples include punctuation marks and
# markers like "[end]".
stop=None,
# If set, partial message deltas will be sent.
stream=True,
)
# Print the incremental deltas returned by the LLM.
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
Stop sequences allow you to control where the model should stop generating. When the model encounters any of the specified stop sequences, it will halt generation at that point. This is useful when you need responses to end at specific points.
from groq import Groq
client = Groq()
chat_completion = client.chat.completions.create(
#
# Required parameters
#
messages=[
# Set an optional system message. This sets the behavior of the
# assistant and can be used to provide specific instructions for
# how it should behave throughout the conversation.
{
"role": "system",
"content": "You are a helpful assistant."
},
# Set a user message for the assistant to respond to.
{
"role": "user",
"content": "Count to 10. Your response must begin with \"1, \". example: 1, 2, 3, ...",
}
],
# The language model which will generate the completion.
model="llama-3.3-70b-versatile",
#
# Optional parameters
#
# Controls randomness: lowering results in less random completions.
# As the temperature approaches zero, the model will become deterministic
# and repetitive.
temperature=0.5,
# The maximum number of tokens to generate. Requests can use up to
# 2048 tokens shared between prompt and completion.
max_completion_tokens=1024,
# Controls diversity via nucleus sampling: 0.5 means half of all
# likelihood-weighted options are considered.
top_p=1,
# A stop sequence is a predefined or user-specified text string that
# signals an AI to stop generating content, ensuring its responses
# remain focused and concise. Examples include punctuation marks and
# markers like "[end]".
# For this example, we will use ", 6" so that the llm stops counting at 5.
# If multiple stop values are needed, an array of string may be passed,
# stop=[", 6", ", six", ", Six"]
stop=", 6",
# If set, partial message deltas will be sent.
stream=False,
)
# Print the completion returned by the LLM.
print(chat_completion.choices[0].message.content)
For applications that need to maintain responsiveness while waiting for completions, you can use the asynchronous client. This lets you make non-blocking API calls using Python's asyncio framework.
import asyncio
from groq import AsyncGroq
async def main():
client = AsyncGroq()
chat_completion = await client.chat.completions.create(
#
# Required parameters
#
messages=[
# Set an optional system message. This sets the behavior of the
# assistant and can be used to provide specific instructions for
# how it should behave throughout the conversation.
{
"role": "system",
"content": "You are a helpful assistant."
},
# Set a user message for the assistant to respond to.
{
"role": "user",
"content": "Explain the importance of fast language models",
}
],
# The language model which will generate the completion.
model="llama-3.3-70b-versatile",
#
# Optional parameters
#
# Controls randomness: lowering results in less random completions.
# As the temperature approaches zero, the model will become
# deterministic and repetitive.
temperature=0.5,
# The maximum number of tokens to generate. Requests can use up to
# 2048 tokens shared between prompt and completion.
max_completion_tokens=1024,
# Controls diversity via nucleus sampling: 0.5 means half of all
# likelihood-weighted options are considered.
top_p=1,
# A stop sequence is a predefined or user-specified text string that
# signals an AI to stop generating content, ensuring its responses
# remain focused and concise. Examples include punctuation marks and
# markers like "[end]".
stop=None,
# If set, partial message deltas will be sent.
stream=False,
)
# Print the completion returned by the LLM.
print(chat_completion.choices[0].message.content)
asyncio.run(main())
You can combine the benefits of streaming and asynchronous processing by streaming completions asynchronously. This is particularly useful for applications that need to handle multiple concurrent conversations.
import asyncio
from groq import AsyncGroq
async def main():
client = AsyncGroq()
stream = await client.chat.completions.create(
#
# Required parameters
#
messages=[
# Set an optional system message. This sets the behavior of the
# assistant and can be used to provide specific instructions for
# how it should behave throughout the conversation.
{
"role": "system",
"content": "You are a helpful assistant."
},
# Set a user message for the assistant to respond to.
{
"role": "user",
"content": "Explain the importance of fast language models",
}
],
# The language model which will generate the completion.
model="llama-3.3-70b-versatile",
#
# Optional parameters
#
# Controls randomness: lowering results in less random completions.
# As the temperature approaches zero, the model will become
# deterministic and repetitive.
temperature=0.5,
# The maximum number of tokens to generate. Requests can use up to
# 2048 tokens shared between prompt and completion.
max_completion_tokens=1024,
# Controls diversity via nucleus sampling: 0.5 means half of all
# likelihood-weighted options are considered.
top_p=1,
# A stop sequence is a predefined or user-specified text string that
# signals an AI to stop generating content, ensuring its responses
# remain focused and concise. Examples include punctuation marks and
# markers like "[end]".
stop=None,
# If set, partial message deltas will be sent.
stream=True,
)
# Print the incremental deltas returned by the LLM.
async for chunk in stream:
print(chunk.choices[0].delta.content, end="")
asyncio.run(main())
Need reliable, type-safe JSON responses that match your exact schema? Groq's Structured Outputs feature guarantees that model responses strictly conform to your JSON Schema without validation or retry logic.
For complete guides on implementing structured outputs with JSON Schema or using JSON Object Mode, see our structured outputs documentation.
Key capabilities: