Documentation

Vision

Groq API offers fast inference for multimodal models with vision capabilities for understanding and interpreting visual data from images. By analyzing the content of an image, multimodal models can generate human-readable text for providing insights about given visual data.

Supported Model(s)

Groq API supports powerful multimodal model(s) that can be easily integrated into your applications to provide fast and accurate image processing for tasks such as visual question answering, caption generation, and Optical Character Recognition (OCR):


LLaVA V1.5 7B (Preview)

  • Model ID: llava-v1.5-7b-4096-preview
  • Description: LLaVA (Large Language-and-Vision Assistant) is an open-source, fine-tuned multimodal model that can generate text descriptions of images, achieving impressive performance on multimodal instruction-following tasks and outperforming GPT-4 on certain benchmarks.
  • Context Window: 4,096 tokens

Limitations

  • Preview Model: Llava V1.5 7B is currently in preview and should be used for experimentation.
  • Image Size Limit: The maximum allowed size for a request containing an image URL as input is 20MB. Requests larger than this limit will return a 400 error.
  • Request Size Limit (Base64 Enconded Images): The maximum allowed size for a request containing a base64 encoded image is 4MB. Requests larger than this limit will return a 413 error.
  • Single Image per Request: Only one image can be processed per request. Requests with multiple images will return a 400 error.
  • Single User Message per Request: Multi-turn conversations are not currently supported and only one user message is allowed per request. Requests with multiple user messages will return a 400 error.
  • No System Prompt or Assistant Message: System messages and assistant messages are currently not supported. Requests including system or assistant messages will return a 400 error.
  • No Tool Use: Tool Use is not currently supported. Requests with tool use or function calling will return a 400 error.
  • No JSON Mode: JSON Mode is not currently supported. Requests with JSON Mode enabled will return a 400 error.

How to Use Vision

Use Groq API vision features via:

  • GroqChat: Select llava-v1.5-7b-4096-preview as the model and upload your image.
  • GroqCloud Console Playground: Select llava-v1.5-7b-4096-preview as the model and upload your image.
  • Groq API Request: Call the chat.completions API endpoint (i.e. https://api.groq.com/openai/v1/chat/completions) that includes content elements for both types text for your query and image_url for your image path, and set model_id to llava-v1.5-7b-4096-preview. See code examples below.

How to Pass Images from URLs as Input

The following are code examples for passing your image to the model via a URL:


from groq import Groq

client = Groq()

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.png",
                    },
                },
            ],
        }
    ],
    model="llava-v1.5-7b-4096-preview",
)

print(chat_completion.choices[0].message.content)

How to Pass Locally Saved Images as Input

To pass locally saved images, we'll need to first encode our image to a base64 format string before passing it as the image_url in our API request as follows:


from groq import Groq
import base64


# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

client = Groq()

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}",
                    },
                },
            ],
        }
    ],
    model="llava-v1.5-7b-4096-preview",
)

print(chat_completion.choices[0].message.content)

Venture Deeper into Vision

Next Steps

Check out our Groq API Cookbook tutorial to learn how to leverage LLaVA powered by Groq.

Use Cases to Explore

The LLaVA vision model can be used in a wide range of applications. Here are some ideas:

  • Accessibility Applications: Develop an application that generates audio descriptions for images by using the LLaVA model to generate text descriptions for images, which can then be converted to audio with one of our audio endpoints.
  • E-commerce Product Description Generation: Create an application that generates product descriptions for e-commerce websites.
  • Education and Research: Develop an application that generates text descriptions for educational images or diagrams.

These are just a few ideas to get you started. The possibilities are endless, and we're excited to see what you create with our LLaVA preview!