Telegram-канал opendatascience - Data Science by ODS.ai 🦜: Technologies

Data Science by ODS.ai 🦜

23 June 2025 12:04

If you have worked with LLMs, you know how sharply and satisfyingly they grow stupid from large command windows. Mess up anything in the prompt—easy. Forget a chunk of text—easy. A large code-base? Forget it. That, by the way, underlies censorship bypasses, when a small censoring model is overloaded by a huge request and the larger primary one still executes it.

The attention mechanism is to blame—one of the pillars of their power to “think”. Now an architecture has been proposed that can do without it. Designed for gigantic tasks.

They propose to throw out attention. But not completely.

The foundation of a transformer is the self-attention mechanism. That is when each word in the text looks at every other word to grasp context.

It is like forming neural links between tokens. Very cool, strong, powerful, but it demands enormous computation.

Double the text length—get a stack overflow.

The Gemini command window is currently 1 million tokens (2 million on request), and that is still insufficient for real tasks. For example, rewriting “War and Peace”. Although real tasks are all somehow about war, without peace.

Instead of a word-to-word model, other approaches appear here:
— Cutting into chunks (for example, 2 048 words each). A cluster is formed, processed within itself, and builds neural links to other clusters. Hello, “Programming Pearls”; hello, Bentley.
— Blocks based on state-space models (SSM)—inside chunks blocks process words. This is like very smart convolutions. In essence, it is a filter deciding which neural connections to build. These operations run much faster than attention, almost linearly with chunk length.
— Multi-Resolution Convolution layers—inside each chunk after SSM are convolution layers with different strides. They let the model capture local patterns at various detail levels—from ties between neighbouring words to ties between words slightly farther apart inside the chunk. Thus every cluster is composed of clusters as well.
— Recurrent observer—outside all this sits a marvel with an attention mechanism. Another light model able to keep the continuous thread and pass information from one chunk to another (for example, a GRU or LSTM). It receives a summary (embed) of the current processed chunk and updates its internal global state, handing it to the next chunk. This helps maintain coherence across the whole long text.
— External memory with retrieval—for every processed chunk its compact representation is created. These representations are stored in an external memory database, brief summaries of their content. When the model processes a new chunk, it can query this memory to find representations of the most similar or relevant past chunks. The retrieved information is then added to the current chunk, enriching its context from the distant past without recomputing everything afresh. This introduces no quadratic operations.

This is not a total rejection of attention, but its limitation.
The recurrent observer still has attention, but it works at a higher level of abstraction, which is cheaper.
One can say it is an advanced RAG plus hierarchical processing.

This contraption should operate with near-linear growth of complexity.
Starting from a certain size it outperforms other transformers, including sparse ones (BigBird, Longformer), cache-based ones (Transformer-XL) and known non-transformer approaches such as retrieval models (REALM, RAG) and non-attention models (RNN, CNN, pure SSM like S4, Mamba).

Where it is needed:
— To extract meaning from a large mass of information, for example your entire personal correspondence, because you are tired of chasing links across chats;
— To answer questions over a large body of documentation;
— To work with a large code-base;
— And other ideas will come up.

In short, they removed token-to-token links and thus crossed the quadratic barrier of ordinary attention.
With this architecture one can find all the important things inside a block very closely and then hand them to attentive LLMs.