Images and Vision

Groq API offers fast inference and low latency for multimodal models with vision capabilities for understanding and interpreting visual data from images. By analyzing the content of an image, multimodal models can generate human-readable text for providing insights about given visual data.

Supported Models

Groq API supports powerful multimodal models that can be easily integrated into your applications to provide fast and accurate image processing for tasks such as visual question answering, caption generation, and Optical Character Recognition (OCR).

meta-llama/llama-4-scout-17b-16e-instruct

Model ID

meta-llama/llama-4-scout-17b-16e-instruct

Description

A powerful multimodal model capable of processing both text and image inputs that supports multilingual, multi-turn conversations, tool use, and JSON mode.

Context Window

10M tokens (limited to 128K in preview)

Preview Model

Currently in preview and should be used for experimentation.

Image Size Limit

Maximum allowed size for a request containing an image URL as input is 20MB. Requests larger than this limit will return a 400 error.

Image Resolution Limit

Maximum allowed resolution for a request containing images is 33 megapixels (33177600 total pixels) per image. Images larger than this limit will return a 400 error.

Request Size Limit (Base64 Encoded Images)

Maximum allowed size for a request containing a base64 encoded image is 4MB. Requests larger than this limit will return a 413 error.

Images per Request

You can process as many images as you want in a single request, but we recommend a maximum of 5 for highest quality and accuracy in responses as included in guidelines by Meta.

How to Use Vision

Use Groq API vision features via:

GroqCloud Console Playground: Use Llama 4 Scout or Llama 4 Maverick as the model and upload your image.
Groq API Request: Call the chat.completions API endpoint and set the model to
meta-llama/llama-4-scout-17b-16e-instruct
or
meta-llama/llama-4-maverick-17b-128e-instruct
. See code examples below.

How to Pass Images from URLs as Input

The following are code examples for passing your image to the model via a URL:

1from groq import Groq
2import os
3
4client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
5completion = client.chat.completions.create(
6    model="meta-llama/llama-4-scout-17b-16e-instruct",
7    messages=[
8        {
9            "role": "user",
10            "content": [
11                {
12                    "type": "text",
13                    "text": "What's in this image?"
14                },
15                {
16                    "type": "image_url",
17                    "image_url": {
18                        "url": "https://upload.wikimedia.org/wikipedia/commons/f/f2/LPU-v1-die.jpg"
19                    }
20                }
21            ]
22        }
23    ],
24    temperature=1,
25    max_completion_tokens=1024,
26    top_p=1,
27    stream=False,
28    stop=None,
29)
30
31print(completion.choices[0].message)

How to Pass Locally Saved Images as Input

To pass locally saved images, we'll need to first encode our image to a base64 format string before passing it as the image_url in our API request as follows:

1from groq import Groq
2import base64
3import os
4
5# Function to encode the image
6def encode_image(image_path):
7  with open(image_path, "rb") as image_file:
8    return base64.b64encode(image_file.read()).decode('utf-8')
9
10# Path to your image
11image_path = "sf.jpg"
12
13# Getting the base64 string
14base64_image = encode_image(image_path)
15
16client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
17
18chat_completion = client.chat.completions.create(
19    messages=[
20        {
21            "role": "user",
22            "content": [
23                {"type": "text", "text": "What's in this image?"},
24                {
25                    "type": "image_url",
26                    "image_url": {
27                        "url": f"data:image/jpeg;base64,{base64_image}",
28                    },
29                },
30            ],
31        }
32    ],
33    model="meta-llama/llama-4-scout-17b-16e-instruct",
34)
35
36print(chat_completion.choices[0].message.content)

Tool Use with Images

The meta-llama/llama-4-scout-17b-16e-instruct, meta-llama/llama-4-maverick-17b-128e-instruct models support tool use! The following cURL example defines a get_current_weather tool that the model can leverage to answer a user query that contains a question about the weather along with an image of a location that the model can infer location (i.e. New York City) from:

curl https://api.groq.com/openai/v1/chat/completions -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GROQ_API_KEY" \
-d '{
"model": "meta-llama/llama-4-scout-17b-16e-instruct",
"messages": [
{
    "role": "user",
    "content": [{"type": "text", "text": "Whats the weather like in this state?"}, {"type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}}]
}
],
"tools": [
{
    "type": "function",
    "function": {
    "name": "get_current_weather",
    "description": "Get the current weather in a given location",
    "parameters": {
        "type": "object",
        "properties": {
        "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA"
        },
        "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"]
        }
        },
        "required": ["location"]
    }
    }
}
],
"tool_choice": "auto"
}' | jq '.choices[0].message.tool_calls'

The following is the output from our example above that shows how our model inferred the state as New York from the given image and called our example function:

[
  {
    "id": "call_q0wg",
    "function": {
      "arguments": "{\"location\": \"New York, NY\",\"unit\": \"fahrenheit\"}",
      "name": "get_current_weather"
    },
    "type": "function"
  }
]

JSON Mode with Images

The meta-llama/llama-4-scout-17b-16e-instruct and meta-llama/llama-4-maverick-17b-128e-instruct models support JSON mode! The following Python example queries the model with an image and text (i.e. "Please pull out relevant information as a JSON object.") with response_format set for JSON mode:

1from groq import Groq
2import os
3
4client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
5
6completion = client.chat.completions.create(
7    model="meta-llama/llama-4-scout-17b-16e-instruct",
8    messages=[
9        {
10            "role": "user",
11            "content": [
12                {
13                    "type": "text",
14                    "text": "List what you observe in this photo in JSON format."
15                },
16                {
17                    "type": "image_url",
18                    "image_url": {
19                        "url": "https://upload.wikimedia.org/wikipedia/commons/d/da/SF_From_Marin_Highlands3.jpg"
20                    }
21                }
22            ]
23        }
24    ],
25    temperature=1,
26    max_completion_tokens=1024,
27    top_p=1,
28    stream=False,
29    response_format={"type": "json_object"},
30    stop=None,
31)
32
33print(completion.choices[0].message)

Multi-turn Conversations with Images

The meta-llama/llama-4-scout-17b-16e-instruct and meta-llama/llama-4-maverick-17b-128e-instruct models support multi-turn conversations! The following Python example shows a multi-turn user conversation about an image:

1from groq import Groq
2import os
3
4client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
5
6completion = client.chat.completions.create(
7    model="meta-llama/llama-4-scout-17b-16e-instruct",
8    messages=[
9        {
10            "role": "user",
11            "content": [
12                {
13                    "type": "text",
14                    "text": "What is in this image?"
15                },
16                {
17                    "type": "image_url",
18                    "image_url": {
19                        "url": "https://upload.wikimedia.org/wikipedia/commons/d/da/SF_From_Marin_Highlands3.jpg"
20                    }
21                }
22            ]
23        },
24        {
25            "role": "user",
26            "content": "Tell me more about the area."
27        }
28    ],
29    temperature=1,
30    max_completion_tokens=1024,
31    top_p=1,
32    stream=False,
33    stop=None,
34)
35
36print(completion.choices[0].message)

Venture Deeper into Vision

Use Cases to Explore

Vision models can be used in a wide range of applications. Here are some ideas:

Accessibility Applications: Develop an application that generates audio descriptions for images by using a vision model to generate text descriptions for images, which can then be converted to audio with one of our audio endpoints.
E-commerce Product Description Generation: Create an application that generates product descriptions for e-commerce websites.
Multilingual Image Analysis: Create applications that can describe images in multiple languages.
Multi-turn Visual Conversations: Develop interactive applications that allow users to have extended conversations about images.

These are just a few ideas to get you started. The possibilities are endless, and we're excited to see what you create with vision models powered by Groq for low latency and fast inference!

Next Steps

Check out our Groq API Cookbook repository on GitHub (and give us a ⭐) for practical examples and tutorials:

We're always looking for contributions. If you have any cool tutorials or guides to share, submit a pull request for review to help our open-source community!