Mamba – A Replacement For Transformers?

Published 3 January 2024 by

Martin Hamilton

I am excited to share with you the latest development in linear time sequence modeling with Selective State spaces. Recently, a paper on Mamba has appeared on archive, which has caused quite a stir in the community. The paper showcases strong empirical results and proposes an interesting design. Moreover, the name Mamba carries a certain weight in the field, which adds to the excitement.

To understand the motivation behind Mamba, it is helpful to go back in time to the Distant Memories of 2020. At that time, the long range Arena was introduced as a benchmark for efficient Transformers. The benchmark showed that Transformers do not scale very well to long sequence lengths, largely due to quadratic self-attention complexity. As a result, the work proposed a benchmark specifically focused on evaluating model quality under long context scenarios. This led to the emergence of efficient alternatives to Transformers, such as the State Space Models.

Key Takeaways

Mamba is a State Space Model that incorporates selective State Spaces by building on a state space model Foundation.
Mamba uses selection as a means of compression, making it more efficient than previous methods.
Mamba combines the design of Prior State space model architectures with the MLP block of Transformers into a single block, leading to a simple and homogeneous architecture design.

Motivation for Mamba

The Long Range Arena Benchmark

In 2020, the Long Range Arena benchmark was introduced to evaluate the quality of models under long context scenarios. The benchmark focused on tasks such as long list operations and Pathfinder, where the model needs to calculate an answer or determine whether two circles can be joined. The benchmark found that Transformers do not scale well to long sequence lengths due to quadratic self-attention complexity. While other models such as recurrent neural networks and dilated RNNs showed promise, they still struggled to maintain performance compared to the vanilla Transformer.

Scaling Issues with Transformers

To tackle the long-range dependency problem, various approaches have been proposed, including the introduction of lean memory units and recurrent convolutional and continuous time models with linear State space layers. However, these models require a massive amount of memory, making them inefficient.

Efficiently Modeling Long Sequences with Structured State Spaces (S4) proposed a unifying framework for sequence modeling based on a standard State space representation. The authors showed how starting from an implicit continuous time State space model, one could discretize to produce a system that can be interpreted as a recurrent network or a convolutional model. However, a naive implementation of these models requires orders of magnitude more memory than comparably sized RNNs or CNNs.

Mamba builds on the state space model approach and proposes a few key contributions to tackle the scaling issues with Transformers. First, Mamba introduces a selection mechanism by parameterizing the state space model parameters based on the input, allowing for efficient data selection in an input-dependent manner. Second, Mamba proposes a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, leading to a faster implementation than previous methods. Finally, Mamba simplifies prior deep sequence model architectures by combining the design of prior State space model architectures with the MLP block of Transformers into a single block, leading to a simple and homogeneous architecture design.

Overall, the core motivation behind Mamba is to use selection as a means of compression, allowing for efficient modeling of long sequences without sacrificing performance.

Historical Context

Efficient Alternatives to Transformers

In 2020, the Long Range Arena benchmark was introduced to evaluate model quality under long context scenarios. It was found that Transformers do not scale well to long sequence lengths due to quadratic self-attention complexity. However, efficient alternatives to Transformers were emerging at the time, such as the Lean Memory Units that tackled the long-range dependency problem by constructing a sequence of polynomial that are orthogonal to each other to efficiently store a history of the inputs to the system.

Building on that idea and others, the Hippo Recurrent Memory introduced the concept of memory as a technical problem of online function approximation. The authors constructed a simple model based on AJ pooms that worked well on tasks like permuted mnist outperforming other sequence model-based baselines like lstm and dilated RNN.

Leap to Legend Memory Units

The Hippo Recurrent Memory also proposed the idea of memory as a technical problem of online function approximation. Using this insight, the authors constructed a simple model based on AJ pooms that worked well on tasks like permuted mnist outperforming other sequence model-based baselines like lstm and dilated RNN.

Hippo Recurrent Memory

The Hippo Recurrent Memory is a model that uses polynomial orthogonalization to store a history of inputs to a system. By projecting values onto a basis of these pooms, the input can be reconstructed from the components of the memory vector. This idea can then be baked into a layer where the memory block is implemented with simple linear operations.

Overall, these efficient alternatives to Transformers provide promising solutions for long sequence lengths by utilizing techniques such as polynomial orthogonalization and memory as a technical problem of online function approximation.

State Space Models

Unifying Framework for Sequence Modeling

State space models provide a unifying framework for sequence modeling. This framework is based on a standard state space representation, which allows us to interpret a system as a recurrent network or a convolutional model. However, a naive implementation of these models requires a massive amount of memory, making it difficult to maintain performance for long context scenarios.

Challenges with Memory Consumption

The challenge with memory consumption led to the development of the S4 model, which efficiently models long sequences with structured state spaces. The S4 model uses a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution. This leads to an implementation that is faster than previous methods, both in theory and on modern hardware.

The S4 model also simplifies prior deep sequence model architectures by combining the design of prior state space model architectures with the MLP block of Transformers into a single block, leading to a simple and homogeneous architecture design called Mamba. The core motivation behind Mamba is to use selection as a means of compression, which allows for efficient memory usage while maintaining model quality.

Overall, state space models, including the S4 model and Mamba, provide a promising approach to sequence modeling that balances efficiency and performance.

Introduction of S4 Model

Modeling with Structured State Spaces

The S4 model, introduced in the paper “Efficiently Modeling Long Sequences with Structured State Spaces,” builds on the state space model approach with some key contributions. First, the authors introduce a selection mechanism to efficiently select data in an input-dependent manner, which is lacking in previous state space models. The selection mechanism is achieved by parameterizing the state space model parameters based on the input. Second, the authors propose a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution. This leads to an implementation that is faster than previous methods, both in theory and on modern hardware.

The third contribution is to simplify prior deep sequence model architectures by combining the design of prior state space model architectures with the MLP block of Transformers into a single block. This leads to a simple and homogeneous architecture design called Mamba, which incorporates selective state spaces by building on a state space model foundation. Mamba is related to many previous state space model architectures, including linear retention, Hungry Hungry Hippos (H3), hyena rnet, and RW KV.

Efficient Training Techniques

To efficiently train state space models using the convolutional representation, the authors propose a few techniques. First, instead of computing the kernel Kar directly, they compute its spectrum and then determine Kar by applying an inverse FFT. Second, they use the Woodbury identity to perform an efficient matrix update. Third, they show that the diagonal matrix case is equivalent to the computation of a Koshi kernel, a well-studied problem with stable near-linear algorithms.

All of this comes together in Theorem 3, which states that given any step size Delta, computing the state space model convolution filter Kar can be reduced to four Koshi multiplies, requiring only Big O of n plus L operations (ignoring logarithmic factors) and Big O of n plus L space, where L is the sequence length and N is the state size.

Empirical Results on Long Range Arena

The S4 architecture does very well on the Long Range Arena, outperforming other models on tasks like long list Ops and Pathfinder. The authors note that the Pathfinder task is not too hard when the input is arranged as an image, but it’s pretty difficult when it’s a flattened sequence, as evidenced by the fact that many other models fail on this task.

If you’d like to gain some intuition for how S4 works, the authors recommend the annotated S4 by Sasha Rush and Sid Caram Chetti. The annotated S4 takes you step by step through the paper and provides snippets of code that implement the mathematics. It also includes nice visualizations to help you build intuition for how state space models work.

Understanding S4

Annotated S4

S4 is a state space model architecture proposed in the paper “Efficiently Modeling Long Sequences with Structured State Spaces” that aims to efficiently train state space models using the convolutional interpretation. However, to understand how S4 works, it is helpful to first gain some intuition for how state space models work in general. For this purpose, we recommend the annotated S4 by Sasha Rush and Sid Caram Chetti. The annotated S4 takes the reader step by step through the paper and provides code snippets that implement the mathematics. It also includes nice visualizations to help build intuition for how state space models work.

Visualizations and Intuition

State space models are a class of models that aim to capture the dynamics of a system by representing it as a set of states that evolve over time. In the context of sequence modeling, the idea is to represent the sequence as a set of hidden states that evolve over time. The key idea behind state space models is that the current state of the system depends on the previous state and the current input. This is captured mathematically using a set of linear equations that relate the current state to the previous state and the current input.

One way to visualize state space models is to think of them as a set of interconnected boxes. Each box represents a state, and the connections between the boxes represent the linear equations that relate the current state to the previous state and the current input. The input enters the system at the first box, and the output is generated by reading out the values of the last box.

The advantage of state space models is that they can capture long-range dependencies in a sequence by using a small set of hidden states. This is in contrast to models like Transformers that use attention to capture long-range dependencies but require a large number of parameters.

S4 builds on the state space model approach with a few key contributions. First, it introduces a selection mechanism that allows the model to efficiently select data in an input-dependent manner. Second, it proposes a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, leading to a faster implementation. Finally, it simplifies prior deep sequence model architectures by combining the design of prior state space model architectures with the MLP block of Transformers into a single block, leading to a simple and homogeneous architecture design called Mamba that incorporates selective state spaces.

Mamba: Key Contributions

Selective State Spaces

Mamba is a state space model architecture that builds on the foundation of state space models by incorporating selective state spaces. This approach allows for efficient selection of data in an input-dependent manner by parameterizing the state space model parameters based on the input. This selection mechanism is a significant contribution of Mamba, as previous state space models lacked the ability to efficiently select data in an input-dependent manner.

Hardware Aware Algorithm

Mamba also introduces a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution. This leads to an implementation that is faster than previous methods, both in theory scaling linearly in sequence length compared to pseudolinear for all convolution-based ssms and on modern hardware up to three times faster on a100 gpus.

Simplified Architecture Design

Mamba further simplifies prior deep sequence model architectures by combining the design of prior state space model architectures with the MLP block of Transformers into a single block. This leads to a simple and homogeneous architecture design that incorporates selective state spaces.

Overall, the key contributions of Mamba are its selection mechanism, hardware-aware algorithm, and simplified architecture design. These contributions make Mamba a promising model for efficiently modeling long sequences with structured state spaces.

Comparative Context Compression

Trade-offs in Sequence Modeling

As we delve deeper into the world of sequence modeling, we encounter various trade-offs that different models make when it comes to compressing context into a smaller state. At one end of the spectrum, we have attention-based models, which are highly effective but inefficient as they do not compress context at all. On the other end, we have recurrent models that are efficient due to their finite state, but they make a compromise on effectiveness.

Attention vs Recurrent Models Efficiency

The efficiency of attention-based models is affected by the quadratic self-attention complexity, which limits their scalability to long sequence lengths. This led to the emergence of efficient alternatives to Transformers, which prompted the creation of the Long Range Arena benchmark to evaluate model quality under long context scenarios. However, the benchmark found that it is quite hard to do much better than the vanilla Transformer when it comes to maintaining performance.

To tackle the long-range dependency problem, a creative approach was introduced in the context of recurrent neural networks with the idea of constructing Leon Pooms. These creatures are constructed by making a sequence of polynomials that are orthogonal to each other, which enables the efficient storage of a history of the inputs to the system.

Building on this idea, the Hippo Recurrent Memory with Optimal Polynomial Projections model was introduced, which worked well on tasks like permuted MNIST, outperforming other sequence model-based baselines like LSTM and dilated RNN. This was followed by the work combining recurrent, convolutional, and continuous time models with linear state space layers, which proposed a unifying framework for sequence modeling based on a standard state space representation.

However, a naive implementation of these models requires a massive amount of memory due to the linear state space layer, which uses orders of magnitude more memory than comparably sized RNNs or CNNs. This led to the introduction of the S4 model, which efficiently models long sequences with structured state spaces.

The S4 architecture does very well on the Long Range Arena, even doing well on the Path X task, which is difficult when the input is a flattened sequence. To gain intuition for how S4 works, we highly recommend the annotated S4 by Sasha Rush and Sid Caram Chetti.

The Mamba model builds on the state space model approach with a few key contributions, including a selection mechanism and a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution. This leads to an implementation that is faster than previous methods, both in theory scaling linearly in sequence length and on modern hardware up to three times faster on A100 GPUs.

In conclusion, the core motivation behind Mamba is to use selection as a means of compression, which enables it to make a compromise between the efficiency of recurrent models and the effectiveness of attention-based models.