Documentation
Speech
Groq API is the fastest speech-to-text solution available, offering OpenAI-compatible endpoints that enable real-time transcriptions and translations. With Groq API, you can integrate high-quality audio processing into your applications at speeds that rival human interaction.
API Endpoints
We support two endpoints:
Endpoint | Usage | API Endpoint |
---|---|---|
Transcriptions | Convert audio to text | https://api.groq.com/openai/v1/audio/transcriptions |
Translations | Translate audio to English text | https://api.groq.com/openai/v1/audio/translations |
Supported Models
Model ID | Model | Supported Language(s) | Description |
---|---|---|---|
whisper-large-v3-turbo | Whisper Large V3 Turbo | Multilingual | A fine-tuned version of a pruned Whisper Large V3 designed for fast, multilingual transcription tasks. |
distil-whisper-large-v3-en | Distil-Whisper English | English-only | A distilled, or compressed, version of OpenAI's Whisper model, designed to provide faster, lower cost English speech recognition while maintaining comparable accuracy. |
whisper-large-v3 | Whisper large-v3 | Multilingual | Provides state-of-the-art performance with high accuracy for multilingual transcription and translation tasks. |
Which Whisper Model Should You Use?
Having more choices is great, but let's try to avoid decision paralysis by breaking down the tradeoffs between models to find the one most suitable for your applications:
- If your application is error-sensitive and requires multilingual support, use
whisper-large-v3
. - If your application is less sensitive to errors and requires English only, use
distil-whisper-large-v3-en
. - If your application requires multilingual support and you need the best price for performance, use
whisper-large-v3-turbo
.
The following table breaks down the metrics for each model.
Model | Cost Per Hour | Language Support | Transcription Support | Translation Support | Real-time Speed Factor | Word Error Rate |
---|---|---|---|---|---|---|
whisper-large-v3 | $0.111 | Multilingual | Yes | Yes | 189 | 10.3% |
whisper-large-v3-turbo | $0.04 | Multilingual | Yes | No | 216 | 12% |
distil-whisper-large-v3-en | $0.02 | English only | Yes | No | 250 | 13% |
Audio File Limitations
Preprocessing Audio Files
Our speech-to-text models will downsample audio to 16,000 Hz mono before transcribing. This preprocessing can be performed client-side to reduce file size and allow longer files to be uploaded to Groq.
The following ffmpeg
command can be used to reduce file size:
ffmpeg \
-i <your file> \
-ar 16000 \
-ac 1 \
-map 0:a: \
<output file name>
Transcription Endpoint Usage
The transcription endpoint allows you to transcribe spoken words in audio or video files. You can provide optional request parameters to customize the transcription output.
Optional Request Parameters
Parameter | Type | Default | Description |
---|---|---|---|
prompt | string | None | Provide context or specify how to spell unfamiliar words (limited to 224 tokens). |
response_format | string | json | Define the output response format. Set to verbose_json to receive timestamps for audio segments.Set to text to return a text response. |
temperature | float | None | Specify a value between 0 and 1 to control the translation output. |
language | string | None | whisper-large-v3-turbo and whisper-large-v3 only!Specify the language for transcription. Use ISO 639-1 language codes (e.g. "en" for English, "fr" for French, etc.). Specifying a language may improve transcription accuracy and speed. |
Example Usage
The Groq SDK package can be installed using the following command:
pip install groq
The following code snippet demonstrates how to use Groq API to transcribe an audio file in Python:
1import os
2from groq import Groq
3
4# Initialize the Groq client
5client = Groq()
6
7# Specify the path to the audio file
8filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!
9
10# Open the audio file
11with open(filename, "rb") as file:
12 # Create a transcription of the audio file
13 transcription = client.audio.transcriptions.create(
14 file=(filename, file.read()), # Required audio file
15 model="whisper-large-v3-turbo", # Required model to use for transcription
16 prompt="Specify context or spelling", # Optional
17 response_format="json", # Optional
18 language="en", # Optional
19 temperature=0.0 # Optional
20 )
21 # Print the transcription text
22 print(transcription.text)
Translation Endpoint Usage
The translation endpoint allows you to translate spoken words in audio or video files to English. You can provide optional request parameters to customize the translation output.
Parameter | Type | Default | Description |
---|---|---|---|
prompt | string | None | Provide context or specify how to spell unfamiliar words (limited to 224 tokens). |
response_format | string | json | Define the output response format. Set to verbose_json to receive timestamps for audio segments. Set to text to return a text response. |
temperature | float | None | Specify a value between 0 and 1 to control the translation output. |
Example Usage
The Groq SDK package can be installed using the following command:
pip install groq
The following code snippet demonstrates how to use Groq API to translate an audio file in Python:
1import os
2from groq import Groq
3
4# Initialize the Groq client
5client = Groq()
6
7# Specify the path to the audio file
8filename = os.path.dirname(__file__) + "/sample_audio.m4a" # Replace with your audio file!
9
10# Open the audio file
11with open(filename, "rb") as file:
12 # Create a translation of the audio file
13 translation = client.audio.translations.create(
14 file=(filename, file.read()), # Required audio file
15 model="whisper-large-v3", # Required model to use for translation
16 prompt="Specify context or spelling", # Optional
17 response_format="json", # Optional
18 temperature=0.0 # Optional
19 )
20 # Print the translation text
21 print(translation.text)
Prompting Guidelines
The prompt parameter is an optional input of max 224 tokens that allows you to provide contextual information to the model, helping it maintain a consistent writing style.
When you provide a prompt parameter, the speech-to-text model treats it as a prior transcript and follows its style, rather than adhering to the actual content of the audio segment. This means that the model will not:
- Attempt to execute commands contained within the prompt
- Follow instructions present in the prompt
In contrast to chat completion prompts, the prompt parameter is designed solely to provide stylistic guidance and contextual information to the model, rather than triggering specific actions or responses.
- Provide contextual information about the audio segment, such as the type of conversation, topic, or speakers involved.
- Use the same language as the language of the audio file.
- Steer the model's output by denoting proper spellings or emulate a specific writing style or tone.
- Keep the prompt concise and focused on stylistic guidance.
Use Cases
Groq API offers low latency and fast inference for speech recognition, transcription, and translation, enabling developers to build a wide range of highly accurate, real-time applications, such as:
- Audio Translation: Translate audio files to break language barriers and facilitate global communication.
- Customer Service: Create real-time, AI-powered customer service solutions that use speech recognition to route calls, transcribe conversations, and respond to customer inquiries.
- Automated Speech-to-Text Systems: Implement automated speech-to-text systems in industries like healthcare, finance, and education, where accurate transcription is critical for compliance, record-keeping, and decision-making.
- Voice-Controlled Interfaces: Develop voice-controlled interfaces for smart homes, cars, and other devices, where fast and accurate speech recognition is essential for user experience and safety.
We can't wait to see what you build! 🚀