AI Foundations

Automatic Speech Recognition (ASR)

By Aditya Kumar Jha, Engineer

Automatic Speech Recognition (ASR) is the technology that converts spoken audio into written text. Modern ASR systems use neural networks trained on large speech datasets, and their accuracy is commonly measured by Word Error Rate (WER).

What is Automatic Speech Recognition (ASR)?

Automatic Speech Recognition (ASR), also called speech-to-text, is the task of converting an audio recording of human speech into the corresponding written words. It powers dictation, voice assistants, call transcription, video captions, and the voice input layer of many AI applications.

Modern ASR systems are end-to-end neural networks that map an audio signal directly to text. They learn from large collections of audio paired with transcripts, so the same model can handle different speakers, accents, and background noise. OpenAI's Whisper, described in the 2022 paper 'Robust Speech Recognition via Large-Scale Weak Supervision', is a well-known example trained on about 680,000 hours of multilingual audio.

  • Converts spoken audio into written text (speech-to-text).
  • Modern systems are end-to-end neural networks trained on large speech corpora.
  • Accuracy is most often reported as Word Error Rate (WER).

How modern speech-to-text works

A typical pipeline starts by turning the raw waveform into a compact time-frequency representation. The audio is resampled to a fixed rate, then converted into a log-Mel spectrogram, an image-like grid showing how energy across frequency bands changes over time. Whisper, for example, uses 80-channel log-Mel spectrograms computed from 16 kHz audio.

An encoder-decoder Transformer then reads the spectrogram. The encoder produces a sequence of acoustic features, and the decoder generates the transcript one token at a time, attending back to the encoder's features through cross-attention. Training on a very large, weakly supervised dataset (transcripts scraped from the web that are not perfectly clean) makes the model generalize across recording conditions and languages without per-dataset fine-tuning. Earlier ASR architectures used recurrent networks with a CTC loss, but Transformer encoder-decoder models are now common.

  • Audio is converted to a log-Mel spectrogram before modeling.
  • An encoder reads acoustic features; a decoder generates text with cross-attention.
  • Large weakly supervised training data improves resilience to noise and accents.

How accuracy is measured: Word Error Rate

The standard ASR metric is Word Error Rate (WER). It compares the model's transcript to a reference transcript and counts the edits needed to turn one into the other: substitutions (wrong word), deletions (missing word), and insertions (extra word). The total edits are divided by the number of words in the reference, so lower WER is better and 0 means a perfect match.

WER can exceed 100% when there are many insertions, and it treats every word equally, so it does not capture meaning-level errors or punctuation quality on its own. For some languages or use cases, Character Error Rate (CER) is reported instead, using the same edit-distance idea at the character level.

WER = (S + D + I) / N
Word Error Rate equals substitutions S plus deletions D plus insertions I, divided by the number of words N in the reference transcript.
  • WER counts substitutions, deletions, and insertions against a reference.
  • Lower is better; 0% means a perfect transcript.
  • Character Error Rate (CER) applies the same idea per character for some languages.

Where ASR is used and its limits

ASR feeds dictation tools, meeting and call transcription, video subtitles, accessibility features, and the front end of voice agents that pair speech-to-text with a language model and text-to-speech. It is also a common first step in pipelines that summarize or search spoken content.

Accuracy still depends heavily on conditions. Heavy accents, overlapping speakers, domain-specific jargon, far-field microphones, and noisy environments all raise WER. Specialized vocabularies and proper nouns are frequent error sources, which is why domain adaptation, custom dictionaries, and careful evaluation matter in production deployments.

  • Dictation, transcription, captions, accessibility, and voice agents.
  • Accuracy drops with noise, accents, overlapping speech, and jargon.
  • Domain adaptation and evaluation are important for production use.

Key takeaways

  • ASR converts speech to text and underpins dictation, captioning, transcription, and voice assistants.
  • Modern systems convert audio to log-Mel spectrograms and use encoder-decoder Transformers trained on large speech datasets.
  • Word Error Rate (WER) measures accuracy by counting substitutions, deletions, and insertions against a reference; lower is better.
  • Accuracy degrades with noise, accents, overlapping speakers, and specialized vocabulary, so domain adaptation matters.

Frequently asked questions

Automatic speech recognition, or ASR, is technology that converts spoken audio into written text. Also called speech-to-text, it powers dictation, voice assistants, call transcription, and video captions, typically using neural networks trained on large amounts of audio paired with transcripts.
The most common metric is Word Error Rate (WER): the number of substitutions, deletions, and insertions needed to match the reference transcript, divided by the reference word count. Lower WER is better, and 0% means a perfect transcription.
Audio is resampled and converted into a log-Mel spectrogram, a time-frequency representation. An encoder reads it into acoustic features, and a decoder generates the transcript word by word using cross-attention, the design used by models like OpenAI's Whisper.
WER is an edit-distance metric for transcripts. It sums substitutions, deletions, and insertions and divides by the number of reference words. It can exceed 100% with many insertions and weights all words equally, so it does not fully capture meaning.
Accuracy drops with background noise, strong accents, overlapping speakers, far-field microphones, and domain-specific jargon or proper nouns. These conditions differ from clean training audio, which raises Word Error Rate and is why domain adaptation and custom vocabularies help.