Automatic Speech Recognition (ASR) is the technology that converts spoken audio into written text. Modern ASR systems use neural networks trained on large speech datasets, and their accuracy is commonly measured by Word Error Rate (WER).
What is Automatic Speech Recognition (ASR)?
Automatic Speech Recognition (ASR), also called speech-to-text, is the task of converting an audio recording of human speech into the corresponding written words. It powers dictation, voice assistants, call transcription, video captions, and the voice input layer of many AI applications.
Modern ASR systems are end-to-end neural networks that map an audio signal directly to text. They learn from large collections of audio paired with transcripts, so the same model can handle different speakers, accents, and background noise. OpenAI's Whisper, described in the 2022 paper 'Robust Speech Recognition via Large-Scale Weak Supervision', is a well-known example trained on about 680,000 hours of multilingual audio.
- Converts spoken audio into written text (speech-to-text).
- Modern systems are end-to-end neural networks trained on large speech corpora.
- Accuracy is most often reported as Word Error Rate (WER).
How modern speech-to-text works
A typical pipeline starts by turning the raw waveform into a compact time-frequency representation. The audio is resampled to a fixed rate, then converted into a log-Mel spectrogram, an image-like grid showing how energy across frequency bands changes over time. Whisper, for example, uses 80-channel log-Mel spectrograms computed from 16 kHz audio.
An encoder-decoder Transformer then reads the spectrogram. The encoder produces a sequence of acoustic features, and the decoder generates the transcript one token at a time, attending back to the encoder's features through cross-attention. Training on a very large, weakly supervised dataset (transcripts scraped from the web that are not perfectly clean) makes the model generalize across recording conditions and languages without per-dataset fine-tuning. Earlier ASR architectures used recurrent networks with a CTC loss, but Transformer encoder-decoder models are now common.
- Audio is converted to a log-Mel spectrogram before modeling.
- An encoder reads acoustic features; a decoder generates text with cross-attention.
- Large weakly supervised training data improves resilience to noise and accents.
How accuracy is measured: Word Error Rate
The standard ASR metric is Word Error Rate (WER). It compares the model's transcript to a reference transcript and counts the edits needed to turn one into the other: substitutions (wrong word), deletions (missing word), and insertions (extra word). The total edits are divided by the number of words in the reference, so lower WER is better and 0 means a perfect match.
WER can exceed 100% when there are many insertions, and it treats every word equally, so it does not capture meaning-level errors or punctuation quality on its own. For some languages or use cases, Character Error Rate (CER) is reported instead, using the same edit-distance idea at the character level.
- WER counts substitutions, deletions, and insertions against a reference.
- Lower is better; 0% means a perfect transcript.
- Character Error Rate (CER) applies the same idea per character for some languages.
Where ASR is used and its limits
ASR feeds dictation tools, meeting and call transcription, video subtitles, accessibility features, and the front end of voice agents that pair speech-to-text with a language model and text-to-speech. It is also a common first step in pipelines that summarize or search spoken content.
Accuracy still depends heavily on conditions. Heavy accents, overlapping speakers, domain-specific jargon, far-field microphones, and noisy environments all raise WER. Specialized vocabularies and proper nouns are frequent error sources, which is why domain adaptation, custom dictionaries, and careful evaluation matter in production deployments.
- Dictation, transcription, captions, accessibility, and voice agents.
- Accuracy drops with noise, accents, overlapping speech, and jargon.
- Domain adaptation and evaluation are important for production use.
Key takeaways
- ASR converts speech to text and underpins dictation, captioning, transcription, and voice assistants.
- Modern systems convert audio to log-Mel spectrograms and use encoder-decoder Transformers trained on large speech datasets.
- Word Error Rate (WER) measures accuracy by counting substitutions, deletions, and insertions against a reference; lower is better.
- Accuracy degrades with noise, accents, overlapping speakers, and specialized vocabulary, so domain adaptation matters.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free