Documentation
Speech
Groq API is the fastest speech-to-text solution available, offering OpenAI-compatible endpoints that enable real-time transcriptions and translations. With Groq API, you can integrate high-quality audio processing into your applications at speeds that rival human interaction.
API Endpoints
We support two endpoints:
Endpoint | Usage | API Endpoint |
---|---|---|
Transcriptions | Convert audio to text | https://api.groq.com/openai/v1/audio/transcriptions |
Translations | Translate audio to English text | https://api.groq.com/openai/v1/audio/translations |
Supported Models
Model ID | Model | Supported Language(s) | Description |
---|---|---|---|
whisper-large-v3-turbo | Whisper Large V3 Turbo | Multilingual | A fine-tuned version of a pruned Whisper Large V3 designed for fast, multilingual transcription tasks. |
distil-whisper-large-v3-en | Distil-Whisper English | English-only | A distilled, or compressed, version of OpenAI's Whisper model, designed to provide faster, lower cost English speech recognition while maintaining comparable accuracy. |
whisper-large-v3 | Whisper large-v3 | Multilingual | Provides state-of-the-art performance with high accuracy for multilingual transcription and translation tasks. |
Which Whisper Model Should You Use?
Having more choices is great, but let's try to avoid decision paralysis by breaking down the tradeoffs between models to find the one most suitable for your applications:
- If your application is error-sensitive and requires multilingual support, use
whisper-large-v3
. - If your application is less sensitive to errors and requires English only, use
distil-whisper-large-v3-en
. - If your application requires multilingual support and you need the best price for performance, use
whisper-large-v3-turbo
.
The following table breaks down the metrics for each model.
Model | Cost Per Hour | Language Support | Transcription Support | Translation Support | Real-time Speed Factor | Word Error Rate |
---|---|---|---|---|---|---|
whisper-large-v3 | $0.111 | Multilingual | Yes | Yes | 189 | 10.3% |
whisper-large-v3-turbo | $0.04 | Multilingual | Yes | No | 216 | 12% |
distil-whisper-large-v3-en | $0.02 | English only | Yes | No | 250 | 13% |
Working with Audio Files
Audio File Limitations
Audio Preprocessing
Our speech-to-text models will downsample audio to 16KHz mono before transcribing, which is optimal for speech recognition. This preprocessing can be performed client-side if your original file is extremely large and you want to make it smaller without a loss in quality (without chunking, Groq API speech endpoints accept up to 25MB). We recommend FLAC for lossless compression.
The following ffmpeg
command can be used to reduce file size:
ffmpeg \
-i <your file> \
-ar 16000 \
-ac 1 \
-map 0:a \
-c:a flac \
<output file name>.flac
Working with Larger Audio Files
For audio files that exceed our size limits or require more precise control over transcription, we recommend implementing audio chunking. This process involves:
- Breaking the audio into smaller, overlapping segments
- Processing each segment independently
- Combining the results while handling overlapping
Using the API
The following are optional request parameters you can use in your transcription and translation requests:
Parameter | Type | Default | Description |
---|---|---|---|
prompt | string | None | Provide context or specify how to spell unfamiliar words (limited to 224 tokens). |
response_format | string | json | Define the output response format. Set to verbose_json to receive timestamps for audio segments.Set to text to return a text response. |
temperature | float | None | Specify a value between 0 and 1 to control the translation output. |
language | string | None | whisper-large-v3-turbo and whisper-large-v3 only!Specify the language for transcription. Use ISO 639-1 language codes (e.g. "en" for English, "fr" for French, etc.). We highly recommend setting the language if you know it as specifying a language may improve transcription accuracy and speed. |
Example Usage of Transcription Endpoint
The transcription endpoint allows you to transcribe spoken words in audio or video files.
The Groq SDK package can be installed using the following command:
pip install groq
The following code snippet demonstrates how to use Groq API to transcribe an audio file in Python:
1import os
2from groq import Groq
3
4# Initialize the Groq client
5client = Groq()
6
7# Specify the path to the audio file
8filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!
9
10# Open the audio file
11with open(filename, "rb") as file:
12 # Create a transcription of the audio file
13 transcription = client.audio.transcriptions.create(
14 file=(filename, file.read()), # Required audio file
15 model="whisper-large-v3-turbo", # Required model to use for transcription
16 prompt="Specify context or spelling", # Optional
17 response_format="json", # Optional
18 language="en", # Optional
19 temperature=0.0 # Optional
20 )
21 # Print the transcription text
22 print(transcription.text)
Example Usage of Translation Endpoint
The translation endpoint allows you to translate spoken words in audio or video files to English.
The Groq SDK package can be installed using the following command:
pip install groq
The following code snippet demonstrates how to use Groq API to translate an audio file in Python:
1import os
2from groq import Groq
3
4# Initialize the Groq client
5client = Groq()
6
7# Specify the path to the audio file
8filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!
9
10# Open the audio file
11with open(filename, "rb") as file:
12 # Create a translation of the audio file
13 translation = client.audio.translations.create(
14 file=(filename, file.read()), # Required audio file
15 model="whisper-large-v3", # Required model to use for translation
16 prompt="Specify context or spelling", # Optional
17 response_format="json", # Optional
18 temperature=0.0 # Optional
19 )
20 # Print the translation text
21 print(translation.text)
Understanding Metadata Fields
When working with Groq API, setting response_format
to verbose_json
outputs each segment of transcribed text with valuable metadata that helps us understand the quality and characteristics of our
transcription, including avg_logprob
, compression_ratio
, and no_speech_prob
.
This information can help us with debugging any transcription issues. Let's examine what this metadata tells us using a real example:
{
"id": 8,
"seek": 3000,
"start": 43.92,
"end": 50.16,
"text": " document that the functional specification that you started to read through that isn't just the",
"tokens": [51061, 4166, 300, 264, 11745, 31256],
"temperature": 0,
"avg_logprob": -0.097569615,
"compression_ratio": 1.6637554,
"no_speech_prob": 0.012814695
}
As shown in the above example, we receive timing information as well as quality indicators. Let's gain a better understanding of what each field means:
id:8
: The 9th segment in the transcription (counting begins at 0)seek
: Indicates where in the audio file this segment begins (3000 in this case)start
andend
timestamps: Tell us exactly when this segment occurs in the audio (43.92 to 50.16 seconds in our example)avg_logprob
(Average Log Probability): -0.097569615 in our example indicates very high confidence. Values closer to 0 suggest better confidence, while more negative values (like -0.5 or lower) might indicate transcription issues.no_speech_prob
(No Speech Probability): 0.0.012814695 is very low, suggesting this is definitely speech. Higher values (closer to 1) would indicate potential silence or non-speech audio.compression_ratio
: 1.6637554 is a healthy value, indicating normal speech patterns. Unusual values (very high or low) might suggest issues with speech clarity or word boundaries.
Using Metadata for Debugging
When troubleshooting transcription issues, look for these patterns:
- Low Confidence Sections: If
avg_logprob
drops significantly (becomes more negative), check for background noise, multiple speakers talking simultaneously, unclear pronunciation, and strong accents. Consider cleaning up the audio in these sections or adjusting chunk sizes around problematic chunk boundaries. - Non-Speech Detection: High
no_speech_prob
values might indicate silence periods that could be trimmed, background music or noise, or non-verbal sounds being misinterpreted as speech. Consider noise reduction when preprocessing. - Unusual Speech Patterns: Unexpected
compression_ratio
values can reveal stuttering or word repetition, speaker talking unusually fast or slow, or audio quality issues affecting word separation.
Quality Thresholds and Regular Monitoring
We recommend setting acceptable ranges for each metadata value we reviewed above and flagging segments that fall outside these ranges to be able to identify and adjust preprocessing or chunking strategies for flagged sections.
By understanding and monitoring these metadata values, you can significantly improve your transcription quality and quickly identify potential issues in your audio processing pipeline.
Prompting Guidelines
The prompt parameter (max 224 tokens) helps provide context and maintain a consistent output style. Unlike chat completion prompts, these prompts only guide style and context, not specific actions.
- Provide relevant context about the audio content, such as the type of conversation, topic, or speakers involved.
- Use the same language as the language of the audio file.
- Steer the model's output by denoting proper spellings or emulate a specific writing style or tone.
- Keep the prompt concise and focused on stylistic guidance.
We can't wait to see what you build! 🚀