August 20, 2025
When vendors cite accuracy benchmarks, they're usually testing against standard academic datasets. Understanding these datasets—what they contain and what they don't—helps you interpret benchmark claims and understand their relevance to your use case.
STT accuracy is always relative to test data. A model that scores 3% WER on one dataset might score 15% on another. The difference isn't the model changing—it's what's being tested.
Key questions for any benchmark:
For a deeper dive into what WER actually measures, see our WER explainer.
Derived from LibriVox audiobook recordings. ~1,000 hours of read English speech from public domain audiobooks. The LibriSpeech corpus is freely available for research.
LibriSpeech WER indicates performance on clear, carefully articulated English read aloud. It's a useful baseline but represents ideal conditions.
If a provider scores 2% WER on LibriSpeech:
Transcribed TED talks. Multiple versions exist (v1, v2, v3) with increasing size and coverage. Available from OpenSLR.
TED-LIUM performance indicates handling of non-native accents and varied vocabulary, but still with prepared, clear speech.
Better than LibriSpeech for:
Still not representative of:
A series of CHiME Challenges focused on speech recognition in noisy environments. Multiple versions with increasing difficulty.
CHiME performance indicates robustness to noise and challenging acoustic conditions. This is where the gap between benchmark and reality shrinks. We explore what CHiME benchmarks teach us in detail separately.
Good CHiME performance suggests:
Caveats:
Crowdsourced recordings from volunteers worldwide. Massive language coverage but variable recording quality. Good for language diversity, not for quality assessment.
Multi-speaker dataset with varied British accents. Useful for accent testing within English.
Telephone conversations. Spontaneous speech but dated recording quality. Relevant for call center applications.
Read news articles. Clean audio, formal language. Similar to LibriSpeech in character.
"How did you perform on [X] dataset?" is less useful than "Can you test on audio samples from my application?"
Benchmark scores are directionally useful but no substitute for testing with your actual content. See our benchmarking checklist for running your own evaluation.
For an overview of the major providers and how to compare them, see our STT API comparison guide.
Subscribe to our newsletter for tips, exciting benefits, and product updates from the team behind Voice Control!

The ultimate language training app that uses AI technology to help you improve your oral language skills.

Simple, Secure Web Dictation. TalkaType brings the convenience of voice-to-text technology directly to your browser, allowing you to input text on any website using just your voice.

Expand the voice features of Google Gemini with read aloud and keyboard shortcuts for the built-in voice recognition.