Documentation

Speech to Text

Groq's Whisper API is capable of transcription and translation. Utilize our OpenAI-compatible endpoints to integrate high-quality audio processing directly into your applications.

API Endpoints

  • Transcriptions: Convert audio to text. https://api.groq.com/openai/v1/audio/transcriptions
  • Translations: Translate audio to English text. https://api.groq.com/openai/v1/audio/translations

Supported Models

  • Model ID: whisper-large-v3 This model provides state-of-the-art performance for both transcription and translation tasks.

Audio file limitations

  • File uploads are limited to 25 MB
  • The following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm
  • If a file contains multiple audio tracks, for example a video with dubs, only the first track will be transcribed

Whisper will downsample audio to 16,000 Hz mono before transcribing. This preprocessing can be performed client-side to reduce file size and allow longer files to be uploaded to groq. The following ffmpeg command can be used to reduce file size:


ffmpeg \
  -i <your file> \
  -ar 16000 \
  -ac 1 \
  -map 0:a: \
  <output file name>

Transcription Usage

Transcribe spoken words in audio or video files.


Optional Parameters:

  • prompt: Provide context or specify how to spell unfamiliar words
  • response_format: Define the output response format.
    • Default is "json"
    • Set to "verbose_json" to receive timestamps for audio segments
    • Set to "text" to return a text response
    • formats vtt and srt are not supported
  • temperature: Specify a value between 0 and 1 to control the translation output.
  • language: Specify the language for transcription (optional; Whisper will auto-detect if not specified)
    • Use ISO 639-1 language codes (e.g., "en" for English, "fr" for French, etc.).
    • Specifying a language may improve transcription accuracy and speed
  • timestamp_granularities[] is not supported

Code Overview

pip install groq

import os
from groq import Groq

client = Groq()
filename = os.path.dirname(__file__) + "/sample_audio.m4a"

with open(filename, "rb") as file:
    transcription = client.audio.transcriptions.create(
      file=(filename, file.read()),
      model="whisper-large-v3",
      prompt="Specify context or spelling",  # Optional
      response_format="json",  # Optional
      language="en",  # Optional
      temperature=0.0  # Optional
    )
    print(transcription.text)

Translation Usage

Translate spoken words in audio or video files to English.


Optional Parameters:

  • prompt: Provide context or specify how to spell unfamiliar words
  • response_format: Define the output response format
    • Default is "json"
    • Set to "verbose_json" to receive timestamps for audio segments
    • Set to "text" to return a text response
    • formats vtt and srt are not supported
  • temperature: Specify a value between 0 and 1 to control the translation output

Code Overview

pip install groq

import os
from groq import Groq

client = Groq()
filename = os.path.dirname(__file__) + "/sample_audio.m4a"

with open(filename, "rb") as file:
    translation = client.audio.translations.create(
      file=(filename, file.read()),
      model="whisper-large-v3",
      prompt="Specify context or spelling",  # Optional
      response_format="json",  # Optional
      temperature=0.0  # Optional
    )
    print(translation.text)

Prompting Guidelines

The prompt parameter is an optional input that helps the Whisper model to better understand the context of the audio segments and maintain a consistent writing style. When you provide a prompt parameter, Whisper treats it as a prior transcript from the same audio file, following the style of the prompt rather than the actual content. This means that the model will not follow instructions or attempt to execute commands contained within the prompt, unlike a chat completion prompt.


Tips for Using the prompt Parameter:

  • The prompt parameter should be in the same language as the audio file.
  • The prompt parameter is limited to 224 tokens.
  • The prompt parameter can be used to steer the model's output by denoting proper spellings or emulate a specific writing style or tone.