Documentation

Speech to Text

Groq's Whisper API is capable of transcription and translation. Utilize our OpenAI-compatible endpoints to integrate high-quality audio processing directly into your applications.

API Endpoints

  • Transcriptions: Convert audio to text. https://api.groq.com/openai/v1/audio/transcriptions
  • Translations: Translate audio to English text. https://api.groq.com/openai/v1/audio/translations

Supported Models

  • Model ID: whisper-large-v3 This model provides state-of-the-art performance for both transcription and translation tasks.

Audio file limitations

  • File uploads are limited to 25 MB
  • The following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm
  • If a file contains multiple audio tracks, for example a video with dubs, only the first track will be transcribed

Whisper will downsample audio to 16,000 Hz mono before transcribing. This preprocessing can be performed client-side to reduce file size and allow longer files to be uploaded to groq. The following ffmpeg command can be used to reduce file size:


ffmpeg \
  -i <your file> \
  -ar 16000 \
  -ac 1 \
  -map 0:a: \
  <output file name>

Transcription Usage

Transcribe spoken words in audio or video files.


Optional Parameters:

  • prompt: Provide context or specify how to spell unfamiliar words
  • response_format: Define the output response format.
    • Default is "json"
    • Set to "verbose_json" to receive timestamps for audio segments
    • Set to "text" to return a text response
    • formats vtt and srt are not supported
  • temperature: Specify a value between 0 and 1 to control the translation output.
  • language: Specify the language for transcription (optional; Whisper will auto-detect if not specified)
    • Use ISO 639-1 language codes (e.g., "en" for English, "fr" for French, etc.).
    • Specifying a language may improve transcription accuracy and speed
  • timestamp_granularities[] is not supported

Code Overview

pip install groq

import os
from groq import Groq

client = Groq()
filename = os.path.dirname(__file__) + "/sample_audio.m4a"

with open(filename, "rb") as file:
    transcription = client.audio.transcriptions.create(
      file=(filename, file.read()),
      model="whisper-large-v3",
      prompt="Specify context or spelling",  # Optional
      response_format="json",  # Optional
      language="en",  # Optional
      temperature=0.0  # Optional
    )
    print(transcription.text)

Translation Usage

Translate spoken words in audio or video files to English.


Optional Parameters:

  • prompt: Provide context or specify how to spell unfamiliar words
  • response_format: Define the output response format
    • Default is "json"
    • Set to "verbose_json" to receive timestamps for audio segments
    • Set to "text" to return a text response
    • formats vtt and srt are not supported
  • temperature: Specify a value between 0 and 1 to control the translation output

Code Overview

pip install groq

import os
from groq import Groq

client = Groq()
filename = os.path.dirname(__file__) + "/sample_audio.m4a"

with open(filename, "rb") as file:
    translation = client.audio.translations.create(
      file=(filename, file.read()),
      model="whisper-large-v3",
      prompt="Specify context or spelling",  # Optional
      response_format="json",  # Optional
      temperature=0.0  # Optional
    )
    print(translation.text)