The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously:
1) expressive (in the forward pass)
2) optimizable (via backpropagation+gradient descent)
3) efficient (high parallelism compute graph)
1) expressive (in the forward pass)
2) optimizable (via backpropagation+gradient descent)
3) efficient (high parallelism compute graph)
(1) because its message-passing-like architecture is general (i.e. completeness) and powerful (i.e. efficiency), able to cover many real-world algorithms and in a small number of compute steps; an an empirical finding.
(2) because of residual connections, layer normalizations, and softmax attention. Absence of any flat tails. Residual connections support a kind of ability to learn short algorithms (think low LOC) fast and first, then gradually extend them longer during training.
Its success lies in a single architecture that simultaneously satisfies all of these properties. The original Attention Is All You Need paper is a bit haphazard and undersells the magnitude of these insights, their history and motivations. But there's a lot going on :)
So I probably would have called the paper something like "Transformer: A general-purpose, efficient, optimizable computer" and presented it alongside the Neural Turing Machine, NeuralGPU and friends, then applied it to translation as an example. Something like that, but ok :)
A few people have (correctly) pointed out the hindsight here, which is fair. I don't suspect the authors would have known that 5 years later that architecture will have taken over most of AI ~unchanged, except for a re-shuffling of layernorms. Calls for a followup paper :)
Loading suggestions...