Andrej Karpathy
Andrej Karpathy

@karpathy

7 Tweets 16 reads Apr 09, 2023
The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously:
1) expressive (in the forward pass)
2) optimizable (via backpropagation+gradient descent)
3) efficient (high parallelism compute graph)
(1) because its message-passing-like architecture is general (i.e. completeness) and powerful (i.e. efficiency), able to cover many real-world algorithms and in a small number of compute steps; an an empirical finding.
(2) because of residual connections, layer normalizations, and softmax attention. Absence of any flat tails. Residual connections support a kind of ability to learn short algorithms (think low LOC) fast and first, then gradually extend them longer during training.
(3) because the compute graph is shallow and wide, mapping significantly better to our high-parallelism compute architectures (think GPUs). An earlier attempt that understood the significance and optimized for this property was the Neural GPU paper (arxiv.org)
Its success lies in a single architecture that simultaneously satisfies all of these properties. The original Attention Is All You Need paper is a bit haphazard and undersells the magnitude of these insights, their history and motivations. But there's a lot going on :)
So I probably would have called the paper something like "Transformer: A general-purpose, efficient, optimizable computer" and presented it alongside the Neural Turing Machine, NeuralGPU and friends, then applied it to translation as an example. Something like that, but ok :)
A few people have (correctly) pointed out the hindsight here, which is fair. I don't suspect the authors would have known that 5 years later that architecture will have taken over most of AI ~unchanged, except for a re-shuffling of layernorms. Calls for a followup paper :)

Loading suggestions...