AI Foundations

State Space Models (SSM) and Mamba

By Arpit Tripathi, Founder

State space models (SSMs) are sequence models that maintain a compressed hidden state and update it recurrently, scaling linearly with sequence length. Mamba is a selective SSM (the S6 layer) that lets its parameters depend on the input, matching transformer quality on language at the model sizes tested while avoiding attention's quadratic cost.

What are State Space Models (SSMs) and Mamba?

State space models (SSMs) are a class of sequence models that carry a compressed hidden state through time and update it with a linear recurrence as each input arrives. Borrowed from classical control theory, an SSM maps an input sequence to an output sequence through a continuous-time linear system that is discretized for use on tokens. The key property is that computation and memory scale linearly with sequence length, in contrast to the quadratic cost of self-attention.

Mamba, introduced by Albert Gu and Tri Dao in 2023, is a selective SSM that closed much of the quality gap with transformers on language modeling while keeping linear-time scaling. It builds on the earlier S4 (Structured State Space) model and adds a selection mechanism that makes the SSM parameters depend on the current input. This selective layer is often called S6, and the full architecture is referred to as Mamba.

  • SSMs maintain a hidden state updated by a linear recurrence over the sequence.
  • Computation scales linearly with sequence length, unlike quadratic attention.
  • S4 (2021) made structured SSMs practical for long sequences.
  • Mamba (2023) adds input-dependent selection, the S6 layer, matching transformer quality at the sizes tested.

The state space formulation

A continuous SSM is defined by a state transition that evolves a hidden state h(t) from the input u(t), and an output read from that state. For sequence models, the continuous system is discretized with a step size, producing a recurrence with discrete matrices A-bar and B-bar. This recurrence can also be unrolled into an equivalent convolution, which is what made S4 efficient to train in parallel while running as a fast recurrence at inference.

S4 used time-invariant parameters, meaning the same A, B, and C matrices apply at every step. That makes the convolutional view possible but limits the model's ability to selectively remember or ignore tokens based on content. Mamba's contribution is to make B, C, and the discretization step input-dependent, so the state update can focus on relevant tokens and discard irrelevant ones.

h_t = Ā·h_{t−1} + B̄·x_t, y_t = C·h_t
The discretized state space recurrence: the hidden state h_t is updated from the previous state and current input, and the output y_t is read linearly from the state.
cost(attention) = O(L²), cost(SSM) = O(L)
Self-attention scales quadratically in sequence length L; an SSM recurrence scales linearly, the core efficiency advantage for long sequences.
  • The continuous system is discretized into a step-wise recurrence.
  • The recurrence can be unrolled into a convolution for parallel training.
  • S4 keeps parameters fixed across steps (time-invariant).
  • Mamba makes B, C, and the step size depend on the input.

Selection and the Mamba architecture

The defining idea in Mamba is selectivity. In a time-invariant SSM the dynamics are fixed regardless of content, so the model cannot decide, based on the actual token, what to keep in its state. Mamba makes key parameters functions of the input, turning the system into a linear time-varying SSM. This lets the hidden state act more like content-aware memory: relevant information is retained and irrelevant information is filtered out.

Making the parameters input-dependent breaks the convolutional shortcut that S4 relied on, so Mamba introduces a hardware-aware parallel scan algorithm that computes the recurrence efficiently on GPUs without materializing the full state in slow memory. The paper reports that Mamba achieves strong performance across language, audio, and genomics with linear-time scaling and fast inference. A 2024 follow-up, Mamba-2, refined the architecture and connected selective SSMs to a structured form of attention, improving training speed.

  • Selection makes B, C, and the step size depend on the input token.
  • Input-dependence turns the SSM time-varying, enabling content-aware memory.
  • A hardware-aware parallel scan keeps training efficient on GPUs.
  • Mamba-2 (2024) refined the design and linked SSMs to structured attention.

Why SSMs matter and where they stand

The main appeal of SSMs is handling very long sequences cheaply. Because cost grows linearly rather than quadratically, SSMs are attractive for long documents, high-resolution audio, and genomic sequences where attention becomes expensive. At inference, the recurrent form keeps a fixed-size state and does not require a growing key-value cache, which lowers memory and can speed up generation.

SSMs are an active research direction rather than a settled replacement for transformers. Many recent systems are hybrids that interleave SSM layers with a few attention layers to combine linear-time efficiency with attention's precise recall. The practical takeaway is that Mamba demonstrated a credible, linear-time alternative to attention, and the field continues to explore where pure SSMs, pure transformers, and hybrids each work best.

  • Linear scaling suits long documents, audio, and genomics.
  • Recurrent inference uses a fixed-size state, with no growing KV cache.
  • Hybrid models mix SSM and attention layers in practice.
  • SSMs are an active alternative to, not a confirmed replacement for, transformers.

Key takeaways

  • State space models process sequences with a recurrent linear update over a compressed hidden state, scaling linearly in sequence length.
  • S4 made structured SSMs practical for long sequences using a convolutional formulation with time-invariant parameters.
  • Mamba adds a selection mechanism (the S6 layer) that makes SSM parameters input-dependent, giving content-aware memory.
  • Selectivity breaks the convolution trick, so Mamba uses a hardware-aware parallel scan to run efficiently on GPUs.
  • SSMs avoid attention's quadratic cost and a growing KV cache, making them attractive for very long sequences; hybrids with attention are common in practice.

Frequently asked questions

A state space model is a sequence model that maintains a compressed hidden state and updates it with a linear recurrence as each input arrives. Borrowed from control theory, it maps inputs to outputs while scaling linearly with sequence length, unlike quadratic attention.
A transformer uses self-attention, which compares every token to every other and costs quadratic time. Mamba uses a selective state space recurrence that scales linearly and keeps a fixed-size state at inference, avoiding the growing key-value cache transformers require.
Selection makes Mamba's SSM parameters depend on the current input rather than being fixed. This input-dependence lets the model decide what to keep in or filter out of its hidden state based on content, giving it content-aware memory that earlier time-invariant SSMs lacked.
S4 is a structured SSM with time-invariant parameters, allowing an efficient convolutional formulation. Mamba makes the parameters input-dependent (the S6 layer), which adds selectivity but requires a hardware-aware parallel scan since the convolution shortcut no longer applies.
Not universally. SSMs like Mamba scale linearly and excel on very long sequences, but transformers still offer precise recall. Many systems are hybrids that interleave SSM and attention layers, and the trade-offs remain an active area of research.