The Transformer architecture is a groundbreaking approach to sequence to sequence tasks, such as language translation, presented in the 2017 paper 'Attention is All You Need'. It builds on previous techniques by replacing recurrent layers with a self-attention mechanism, allowing for parallelized computations and better performance than previous models that used RNNs.
Self-attention addresses the issue of removing recurrence by allowing each input word to directly attend to every other word in the sentence. This creates a communication system between input words, preserving meaningful relationships between them and allowing the model to understand word order and relative positional information.
Positional encodings are distinct vectors corresponding to each input position, designed to be unique and distance-dependent. They have the same dimensions as word embeddings, so they can be summed with them. These encodings help Transformers understand word order and relative positional information, as they are shift-invariant and recognize patterns regardless of their position in the input sequence.