Documentation

Speech

Groq API is the fastest speech-to-text solution available, offering OpenAI-compatible endpoints that enable real-time transcriptions and translations. With Groq API, you can integrate high-quality audio processing into your applications at speeds that rival human interaction.

API Endpoints

We support two endpoints:

EndpointUsageAPI Endpoint
TranscriptionsConvert audio to texthttps://api.groq.com/openai/v1/audio/transcriptions
TranslationsTranslate audio to English texthttps://api.groq.com/openai/v1/audio/translations

Supported Models

Model IDModelSupported Language(s)Description
whisper-large-v3-turboWhisper Large V3 TurboMultilingualA fine-tuned version of a pruned Whisper Large V3 designed for fast, multilingual transcription tasks.
distil-whisper-large-v3-enDistil-Whisper EnglishEnglish-onlyA distilled, or compressed, version of OpenAI's Whisper model, designed to provide faster, lower cost English speech recognition while maintaining comparable accuracy.
whisper-large-v3Whisper large-v3MultilingualProvides state-of-the-art performance with high accuracy for multilingual transcription and translation tasks.

Which Whisper Model Should You Use?

Having more choices is great, but let's try to avoid decision paralysis by breaking down the tradeoffs between models to find the one most suitable for your applications:

  • If your application is error-sensitive and requires multilingual support, use whisper-large-v3.
  • If your application is less sensitive to errors and requires English only, use distil-whisper-large-v3-en.
  • If your application requires multilingual support and you need the best price for performance, use whisper-large-v3-turbo.

The following table breaks down the metrics for each model.

ModelCost Per HourLanguage SupportTranscription SupportTranslation SupportReal-time Speed FactorWord Error Rate
whisper-large-v3$0.111MultilingualYesYes18910.3%
whisper-large-v3-turbo$0.04MultilingualYesNo21612%
distil-whisper-large-v3-en$0.02English onlyYesNo25013%

Audio File Limitations

Max File Size
25 MB
Minimum File Length
0.01 seconds
Minimum Billed Length
10 seconds. If you submit a request less than this, you will still be billed for 10 seconds.
Supported File Types
`mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, `webm`
Single Audio Track
Only the first track will be transcribed for files with multiple audio tracks. (e.g. dubbed video)
Supported Response Formats
`json`, `verbose_json`, `text`

Preprocessing Audio Files

Our speech-to-text models will downsample audio to 16,000 Hz mono before transcribing. This preprocessing can be performed client-side to reduce file size and allow longer files to be uploaded to Groq. The following ffmpeg command can be used to reduce file size:

ffmpeg \
  -i <your file> \
  -ar 16000 \
  -ac 1 \
  -map 0:a: \
  <output file name>

Transcription Endpoint Usage

The transcription endpoint allows you to transcribe spoken words in audio or video files. You can provide optional request parameters to customize the transcription output.

Optional Request Parameters

ParameterTypeDefaultDescription
promptstringNoneProvide context or specify how to spell unfamiliar words (limited to 224 tokens).
response_formatstringjsonDefine the output response format.
Set to verbose_json to receive timestamps for audio segments.
Set to text to return a text response.
temperaturefloatNoneSpecify a value between 0 and 1 to control the translation output.
languagestringNonewhisper-large-v3-turbo and whisper-large-v3 only!
Specify the language for transcription. Use ISO 639-1 language codes (e.g. "en" for English, "fr" for French, etc.). Specifying a language may improve transcription accuracy and speed.

Example Usage

The Groq SDK package can be installed using the following command:

pip install groq

The following code snippet demonstrates how to use Groq API to transcribe an audio file in Python:

1import os
2from groq import Groq
3
4# Initialize the Groq client
5client = Groq()
6
7# Specify the path to the audio file
8filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!
9
10# Open the audio file
11with open(filename, "rb") as file:
12    # Create a transcription of the audio file
13    transcription = client.audio.transcriptions.create(
14      file=(filename, file.read()), # Required audio file
15      model="whisper-large-v3-turbo", # Required model to use for transcription
16      prompt="Specify context or spelling",  # Optional
17      response_format="json",  # Optional
18      language="en",  # Optional
19      temperature=0.0  # Optional
20    )
21    # Print the transcription text
22    print(transcription.text)

Translation Endpoint Usage

The translation endpoint allows you to translate spoken words in audio or video files to English. You can provide optional request parameters to customize the translation output.

ParameterTypeDefaultDescription
promptstringNoneProvide context or specify how to spell unfamiliar words (limited to 224 tokens).
response_formatstringjsonDefine the output response format. Set to verbose_json to receive timestamps for audio segments. Set to text to return a text response.
temperaturefloatNoneSpecify a value between 0 and 1 to control the translation output.

Example Usage

The Groq SDK package can be installed using the following command:

pip install groq

The following code snippet demonstrates how to use Groq API to translate an audio file in Python:

1import os
2from groq import Groq
3
4# Initialize the Groq client
5client = Groq()
6
7# Specify the path to the audio file
8filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!
9
10# Open the audio file
11with open(filename, "rb") as file:
12    # Create a translation of the audio file
13    translation = client.audio.translations.create(
14      file=(filename, file.read()), # Required audio file
15      model="whisper-large-v3", # Required model to use for translation
16      prompt="Specify context or spelling",  # Optional
17      response_format="json",  # Optional
18      temperature=0.0  # Optional
19    )
20    # Print the translation text
21    print(translation.text)

Prompting Guidelines

The prompt parameter is an optional input of max 224 tokens that allows you to provide contextual information to the model, helping it maintain a consistent writing style.

How It Works

When you provide a prompt parameter, the speech-to-text model treats it as a prior transcript and follows its style, rather than adhering to the actual content of the audio segment. This means that the model will not:

  • Attempt to execute commands contained within the prompt
  • Follow instructions present in the prompt

In contrast to chat completion prompts, the prompt parameter is designed solely to provide stylistic guidance and contextual information to the model, rather than triggering specific actions or responses.

Best Practices
  • Provide contextual information about the audio segment, such as the type of conversation, topic, or speakers involved.
  • Use the same language as the language of the audio file.
  • Steer the model's output by denoting proper spellings or emulate a specific writing style or tone.
  • Keep the prompt concise and focused on stylistic guidance.

Use Cases

Groq API offers low latency and fast inference for speech recognition, transcription, and translation, enabling developers to build a wide range of highly accurate, real-time applications, such as:

  • Audio Translation: Translate audio files to break language barriers and facilitate global communication.
  • Customer Service: Create real-time, AI-powered customer service solutions that use speech recognition to route calls, transcribe conversations, and respond to customer inquiries.
  • Automated Speech-to-Text Systems: Implement automated speech-to-text systems in industries like healthcare, finance, and education, where accurate transcription is critical for compliance, record-keeping, and decision-making.
  • Voice-Controlled Interfaces: Develop voice-controlled interfaces for smart homes, cars, and other devices, where fast and accurate speech recognition is essential for user experience and safety.

We can't wait to see what you build! 🚀