Groq API offers fast inference and low latency for multimodal models with vision capabilities for understanding and interpreting visual data from images. By analyzing the content of an image, multimodal models can generate human-readable text for providing insights about given visual data.
Groq API supports powerful multimodal models that can be easily integrated into your applications to provide fast and accurate image processing for tasks such as visual question answering, caption generation, and Optical Character Recognition (OCR):
Note: Images are billed at 6,400 tokens per image.
Use Groq API vision features via:
llama-3.2-90b-vision-preview
llama-3.2-11b-vision-preview
chat.completions
API endpoint (i.e. https://api.groq.com/openai/v1/chat/completions
) and set model_id
to llama-3.2-90b-vision-preview
or llama-3.2-11b-vision-preview
.
See code examples below.The following are code examples for passing your image to the model via a URL:
1from groq import Groq
2
3client = Groq()
4completion = client.chat.completions.create(
5 model="llama-3.2-11b-vision-preview",
6 messages=[
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "What's in this image?"
13 },
14 {
15 "type": "image_url",
16 "image_url": {
17 "url": "https://upload.wikimedia.org/wikipedia/commons/f/f2/LPU-v1-die.jpg"
18 }
19 }
20 ]
21 }
22 ],
23 temperature=1,
24 max_completion_tokens=1024,
25 top_p=1,
26 stream=False,
27 stop=None,
28)
29
30print(completion.choices[0].message)
To pass locally saved images, we'll need to first encode our image to a base64 format string before passing it as the image_url
in our API request as follows:
1from groq import Groq
2import base64
3
4
5# Function to encode the image
6def encode_image(image_path):
7 with open(image_path, "rb") as image_file:
8 return base64.b64encode(image_file.read()).decode('utf-8')
9
10# Path to your image
11image_path = "sf.jpg"
12
13# Getting the base64 string
14base64_image = encode_image(image_path)
15
16client = Groq()
17
18chat_completion = client.chat.completions.create(
19 messages=[
20 {
21 "role": "user",
22 "content": [
23 {"type": "text", "text": "What's in this image?"},
24 {
25 "type": "image_url",
26 "image_url": {
27 "url": f"data:image/jpeg;base64,{base64_image}",
28 },
29 },
30 ],
31 }
32 ],
33 model="llama-3.2-11b-vision-preview",
34)
35
36print(chat_completion.choices[0].message.content)
The llama-3.2-90b-vision-preview
and llama-3.2-11b-vision-preview
models support tool use! The following cURL example defines a get_current_weather
tool that the model can leverage to answer a user query that contains a question about the
weather along with an image of a location that the model can infer location (i.e. New York City) from:
curl https://api.groq.com/openai/v1/chat/completions -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GROQ_API_KEY" \
-d '{
"model": "llama-3.2-11b-vision-preview",
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "Whats the weather like in this state?"}, {"type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}}]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}' | jq '.choices[0].message.tool_calls'
The following is the output from our example above that shows how our model inferred the state as New York from the given image and called our example function:
[
{
"id": "call_q0wg",
"function": {
"arguments": "{\"location\": \"New York, NY\",\"unit\": \"fahrenheit\"}",
"name": "get_current_weather"
},
"type": "function"
}
]
The llama-3.2-90b-vision-preview
and llama-3.2-11b-vision-preview
models support JSON mode! The following Python example queries the model with an image and text (i.e. "Please pull out relevant information as a JSON object.") with response_format
set for JSON mode:
1from groq import Groq
2
3client = Groq()
4completion = client.chat.completions.create(
5 model="llama-3.2-90b-vision-preview",
6 messages=[
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "List what you observe in this photo in JSON format."
13 },
14 {
15 "type": "image_url",
16 "image_url": {
17 "url": "https://upload.wikimedia.org/wikipedia/commons/d/da/SF_From_Marin_Highlands3.jpg"
18 }
19 }
20 ]
21 }
22 ],
23 temperature=1,
24 max_completion_tokens=1024,
25 top_p=1,
26 stream=False,
27 response_format={"type": "json_object"},
28 stop=None,
29)
30
31print(completion.choices[0].message)
The llama-3.2-90b-vision-preview
and llama-3.2-11b-vision-preview
models support multi-turn conversations! The following Python example shows a multi-turn user conversation about an image:
1from groq import Groq
2
3client = Groq()
4completion = client.chat.completions.create(
5 model="llama-3.2-11b-vision-preview",
6 messages=[
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "What is in this image?"
13 },
14 {
15 "type": "image_url",
16 "image_url": {
17 "url": "https://upload.wikimedia.org/wikipedia/commons/d/da/SF_From_Marin_Highlands3.jpg"
18 }
19 }
20 ]
21 },
22 {
23 "role": "user",
24 "content": "Tell me more about the area."
25 }
26 ],
27 temperature=1,
28 max_completion_tokens=1024,
29 top_p=1,
30 stream=False,
31 stop=None,
32)
33
34print(completion.choices[0].message)
Vision models can be used in a wide range of applications. Here are some ideas:
These are just a few ideas to get you started. The possibilities are endless, and we're excited to see what you create with vision models powered by Groq for low latency and fast inference!
Check out our Groq API Cookbook repository on GitHub (and give us a ⭐) for practical examples and tutorials:
We're always looking for contributions. If you have any cool tutorials or guides to share, submit a pull request for review to help our open-source community!