Whisper

whisper-large-v3

Try it in Playground

INPUT

Audio

OUTPUT

Text

CAPABILITIES

Speech to Text

OpenAI

Model card

Whisper Large v3 is OpenAI's most advanced and capable speech recognition model, delivering state-of-the-art accuracy across a wide range of audio conditions and languages. This flagship model excels at handling challenging audio scenarios including background noise, accents, and technical terminology. With its robust architecture and extensive training, it represents the gold standard for automatic speech recognition tasks requiring the highest possible accuracy.

PRICING

Per Hour

$0.111

LIMITS

MAX FILE SIZE

100 MB

QUANTIZATION

This uses Groq's TruePoint Numerics, which reduces precision only in areas that don't affect accuracy, preserving quality while delivering significant speedup over traditional approaches. Learn more here.

Key Technical Specifications

Model Architecture

Built on OpenAI's transformer-based encoder-decoder architecture with 1550M parameters. The model uses a sophisticated attention mechanism optimized for speech recognition tasks, with specialized training on diverse multilingual audio data. The architecture includes advanced noise robustness and can handle various audio qualities and recording conditions.

Performance Metrics

Whisper Large v3 sets the benchmark for speech recognition accuracy:

Short-form transcription: 8.4% WER (industry-leading accuracy)
Sequential long-form: 10.0% WER
Chunked long-form: 11.0% WER
Multilingual support: 99+ languages
Model size: 1550M parameters

Key Model Details

Model Size: 1550M parameters
Speed: 189x speed factor
Audio Context: Optimized for 30-second audio segments, with a minimum of 10 seconds per segment
Supported Audio: FLAC, MP3, M4A, MPEG, MPGA, OGG, WAV, or WEBM
Language: 99+ languages supported
Usage: Groq Speech to Text Documentation

Use Cases

High-Accuracy Transcription

Perfect for applications where transcription accuracy is paramount:

Legal and medical transcription requiring precision
Academic research and interview transcription
Professional content creation and journalism

Multilingual Applications

Ideal for global applications requiring broad language support:

International conference and meeting transcription
Multilingual content processing and analysis
Global customer support and communication tools

Challenging Audio Conditions

Excellent for difficult audio scenarios:

Noisy environments and poor audio quality
Multiple speakers and overlapping speech
Technical terminology and specialized vocabulary

Best Practices

Prioritize accuracy: Use this model when transcription precision is more important than speed
Leverage multilingual capabilities: Take advantage of the model's extensive language support for global applications
Handle challenging audio: Rely on this model for difficult audio conditions where other models might struggle
Consider context length: For long-form audio, the model works optimally with 30-second segments
Use appropriate algorithms: Choose sequential long-form for maximum accuracy, chunked for better speed

Get Started

Features

Built-In Tools

Compound

Advanced Features

Prompting Guide

Production Readiness

Developer Resources

Console

Support & Guidelines

Uncategorized