How to Benchmark Speech-to-Text for Your App: A Practical Checklist
Vendor benchmarks are marketing. They're tested on clean audio, favorable conditions, and carefully selected phrases. To know how an STT provider will actually perform for your application, you need to run your own benchmarks.
This checklist walks you through the process.
Before you start
- Define what "good" means for your use case
Not all errors are equal. Define:
- What accuracy level is acceptable? (95%? 99%?)
- Which words must never be wrong? (Names? Numbers? Commands?)
- What latency is acceptable?
- What failure modes are tolerable vs. unacceptable?
- Identify your evaluation dimensions
Common dimensions:
- Overall accuracy (WER)—see our WER explainer for what it means and its limitations
- Latency (time to first word, total transcription time)
- Handling of specific vocabulary
- Performance across speaker types
- Behavior under different audio conditions
- Cost at your volume
- Shortlist providers to test
Don't test everyone. Narrow to 3-5 providers based on:
Our STT API comparison covers the major players and what to look for.
Building your test corpus
- Collect representative audio samples
Your test data should match real usage:
- Source: Actual recordings from your application or very similar contexts
- Speakers: Diverse accents, ages, and speech patterns matching your users
- Conditions: Realistic noise levels, recording equipment, network conditions
- Content: Representative vocabulary, including edge cases
- Aim for sufficient quantity
Recommendations:
- Minimum: 30-50 audio samples
- Better: 100+ samples
- For production readiness: 500+ samples across segments
Research comparing STT services shows that vendor benchmarks often don't reflect real-world performance—your own testing is essential.
- Create ground truth transcripts
For each sample, create a verified correct transcript:
- Use human transcription (not another STT system)
- Include proper punctuation and capitalization
- Standardize formatting (numbers, dates, abbreviations)
- Have a second person verify critical samples
- Organize samples by segment
Group your test data by factors you want to analyze:
- Speaker accent/dialect
- Audio quality/noise level
- Content type
- Domain vocabulary density
For standard reference datasets, see our guide on LibriSpeech, TED-LIUM, and CHiME.
Running the benchmark
- Establish consistent test conditions
Control variables that could affect results:
- Same audio files to all providers
- Same timestamps for comparison
- Consistent API settings (language, model tier, etc.)
- Document any provider-specific configuration
- Test multiple configurations
For each provider, test:
- Default settings
- Optimized settings for your use case
- With custom vocabulary (if supported)
- Different model tiers (if available)
- Measure latency properly
Latency benchmarks should capture:
- Time to first transcribed word
- End-to-end processing time
- Consistency (measure variance, not just average)
- Behavior under load
See our comparison of real-time vs. batch architectures for what matters when.
- Run multiple iterations
Single runs are noisy. For reliable results:
- Run each sample multiple times
- Test at different times of day
- Test under varying network conditions
- Average results and note variance
Analyzing results
- Calculate WER correctly
Word Error Rate = (Substitutions + Insertions + Deletions) / Total Words
Use consistent text normalization:
- Case handling
- Punctuation handling
- Number format
- Contraction expansion
- Segment your analysis
Don't just look at overall WER. Break down by:
- Speaker demographics
- Audio quality
- Content type
- Specific vocabulary
Patterns in segment performance reveal important limitations. Our piece on bias in speech recognition covers what disparities to watch for.
- Analyze error types
Not all errors are equal. Categorize:
- Substitutions: What words get confused?
- Deletions: What gets missed?
- Insertions: What gets hallucinated?
This reveals what can be fixed with custom vocabulary vs. what's fundamental.
- Factor in cost
Calculate total cost at your expected volume:
- Per-minute/character pricing
- Volume discounts
- Costs for additional features (diarization, punctuation, etc.)
- Infrastructure costs for on-premise
Making a decision
- Create a comparison matrix
| Criterion | Weight | Provider A | Provider B | Provider C |
|---|
| Overall WER | 25% | | | |
| Critical vocab accuracy | 20% | | | |
| Latency | 15% | | | |
| Cost | 15% | | | |
| Segment consistency | 15% | | | |
| Integration ease | 10% | | | |
- Document limitations discovered
Every provider will have weaknesses. Document them clearly so they don't surprise you later.
- Plan for ongoing monitoring
Benchmark results are a snapshot. Plan to:
- Monitor production accuracy over time
- Re-benchmark periodically (providers improve)
- Track user-reported issues
Our requirements template can help structure your decision criteria.