LLM Chronicles #5.1: The Transformer Architecture - Donato Capitella

What is the Transformer architecture and how does it build on previous sequence to sequence modeling techniques?

The Transformer architecture is a groundbreaking approach to sequence to sequence tasks, such as language translation, presented in the 2017 paper 'Attention is All You Need'. It builds on previous techniques by replacing recurrent layers with a self-attention mechanism, allowing for parallelized computations and better performance than previous models that used RNNs.

How does self-attention address the issue of removing recurrence in the Transformer architecture?

Self-attention addresses the issue of removing recurrence by allowing each input word to directly attend to every other word in the sentence. This creates a communication system between input words, preserving meaningful relationships between them and allowing the model to understand word order and relative positional information.

What are positional encodings and how do they help Transformers understand word order?

Positional encodings are distinct vectors corresponding to each input position, designed to be unique and distance-dependent. They have the same dimensions as word embeddings, so they can be summed with them. These encodings help Transformers understand word order and relative positional information, as they are shift-invariant and recognize patterns regardless of their position in the input sequence.