Concept

How transformers work.

The neural network design that powers almost every modern language model, and the one idea, attention, that made it work.

The short version

A transformer is a type of neural network introduced by Google researchers in 2017, in a paper titled "Attention Is All You Need". The T in GPT stands for transformer. It is the design that made today's language models possible.

Before it, models read text one word at a time, in order, which was slow and made it hard to connect words that sat far apart. The transformer read the whole sequence at once and let every word look at every other word directly.

Attention is the core idea

For each word, the model computes how relevant every other word is to it, and blends them together weighted by that relevance. In "the trophy did not fit in the suitcase because it was too big", attention is what lets the model tie "it" to "the trophy" rather than "the suitcase". Doing this across a whole passage is how the model tracks meaning and context.

Everything at once, not word by word

Older designs processed text sequentially, so training could not be easily parallelized. Transformers process all positions together, which maps perfectly onto modern GPUs. That single change is why it became practical to train on internet-scale text.

Stacked layers build up meaning

A transformer stacks many attention layers, often dozens. Early layers capture simple patterns like grammar; later layers combine those into higher-level meaning. Each layer refines the representation the next one works from.

The same design does more than text

The architecture is not limited to language. The same idea now powers image models, audio models, and multimodal systems that handle text and images together. Attention turned out to be a general-purpose way to relate the parts of almost any sequence.

An analogy

Imagine reading a sentence and, for every word, instantly drawing arrows to the other words that give it meaning, thicker arrows for stronger connections. Attention is the machine doing exactly that, for every word at once.

Where Berges AI fits

Every model Berges AI serves is a transformer under the hood. The differences you feel between them come from their size, their training, and the layer Berges AI wraps around them, not from a different underlying architecture.

Try Berges AI
Keep going

Related concepts

Questions

Things people ask.

Who invented the transformer?

A team at Google Brain and Google Research, in the 2017 paper "Attention Is All You Need". Nearly every major language model since has been built on that design.

What does "attention" actually mean here?

It is a mechanism that lets the model decide, for each word, how much every other word should influence its interpretation. It is math, not literal focus, but "attention" captures the intuition well.

Do I need to understand transformers to use AI?

Not at all. It helps to know that the model relates every word to every other word, which is why context and clear phrasing matter, but you can use these tools well without any of the math.