The fundamental issue is that the space is now split up and the model is out of the distribution it has seen, causing arbitrary behavior.
Tokenization is the process of breaking down text into smaller pieces, called tokens, which can be understood and processed by large language models.
Some issues that can arise from tokenization in large language models include difficulty with spelling tasks, simple string processing, non-English languages, and arithmetic. These issues often trace back to the tokenization process and can affect the performance of the model.
The tokenizer in the gpt2 paper uses a vocabulary of 50,257 possible tokens, and each token has a context size of 1,24 tokens. This is different from a character level tokenizer, which breaks down text into individual characters and assigns each character a token. The gpt2 tokenizer uses algorithms like the byte pair encoding algorithm to construct character chunks, which are then used as tokens.
The reason we can't use raw code points directly as a vocabulary in Unicode is that the vocabulary would be quite long (with 150,000 different code points in Unicode), and the Unicode standard is alive and keeps changing, making it an unstable representation for direct use.
The three types of encodings defined by the Unicode Consortium are utf8, UTF 16, and UTF 32. UTF8 is the most common encoding, taking every Unicode code point and translating it to a byte stream, which is between one to four bytes long. UTF 16 and UTF 32 have their own pros and cons, with UTF 32 being fixed length but having downsides.
The Bite Pair encoding algorithm is an iterative process that finds the pair of tokens occurring most frequently in a sequence and replaces that pair with a single new token, appended to the vocabulary. By repeating this process, the sequence gets compressed, and a new vocabulary is created, which can be used to encode and decode arbitrary sequences.
The purpose of encoding text into bytes using utf8 encoding in the process of training a tokenizer is to convert the raw text into a format that can be easily processed and manipulated by the training algorithm.
The merges dictionary in the training of a tokenizer maintains a child one child two mapping to a new token, building up a binary tree of merges. It records the most commonly occurring pair of tokens, mints a new token integer for it, and replaces all occurrences of that pair with the new lied token.
The compression ratio in the training of a tokenizer measures the amount of compression achieved by the tokenizer. It is calculated by dividing the number of bytes in the original tokens list by the number of bytes in the tokens list after merging.
If there's nothing to merge, the function will return float imps and the pair will become the first element of stats. However, this pair is not actually a mergeable pair, and it could be that the function doesn't succeed because there's no more merging pairs. In this case, the implementation will break out, signaling that no single pair can be merged anymore.
The purpose of the regex pattern used in the GPT2 tokenizer code is to enforce rules for what parts of the text will never be merged for sure. This pattern is used to split the input text into a list of texts, which are then processed independently by the tokenizer. This allows the tokenizer to only consider merges within each element of the list, preventing merges across letters, numbers, and punctuation.
The GPT2 tokenizer is hardcoded to handle specific kinds of apostrophes and doesn't account for all Unicode apostrophes. Additionally, it doesn't use re. ignore case, which can result in inconsistent tokenization between uppercase and lowercase contractions. This can lead to issues in languages that use or don't use apostrophes differently, and can result in gnarly and slightly gross tokenization inconsistencies.
In GPT-2, whitespaces remain unmerged, while in GPT-4, whitespaces merge.
The main change in the GPT-4 tokenizer pattern is the case sensitivity, which now matches both lowercase and uppercase characters, and the handling of whitespace and numbers.
The 'end of text' special token is used to delimit documents in the training set, signaling the end of one document and the beginning of another.
The difference between GPT and the tokenizer in the example is only in their training set. The speaker suspects that GPT probably had a lot of python code in its training set, while the tokenizer in the example seems to have less of that. This is evident in the way they handle white spaces.
Tiktoken approaches tokenization by first encoding the code points in the string using UTF-8 to bytes and then merging bytes. On the other hand, SentencePiece works directly on the level of the code points themselves, merging them and falling back to bytes for rare code points. The speaker finds Tiktoken to be significantly cleaner.
The shrinking factor is not used in the BPE algorithm in SentencePiece training. It applies to a different training algorithm and is therefore irrelevant to the example.
There are several reasons why the vocab size can't be infinite in a Transformer model. First, the token embedding table and the linear layer will grow, resulting in more computation. Second, with a large vocab size, each token will come up less frequently in the training data, leading to undertraining of the associated vectors. Third, as the vocab size grows, sequences will shrink, which means that more information is being squished into a single token, making it difficult for the model to process the information appropriately.
To extend the vocab size in a pre-trained Transformer model, you need to resize the embedding by adding rows and initializing the new parameters with small random numbers. You also need to extend the weight inside the linear layer to calculate the probabilities for the new tokens. This is a mild model surgery operation and can be done fairly easily. You can freeze the base model and only train the new parameters to introduce new tokens into the architecture.
The considerations when designing the vocab size in a Transformer model include the growth of the token embedding table and the linear layer, the risk of undertraining some parameters due to fewer examples of each individual token, the shrinking of sequences as the vocab size grows, and the amount of information being squished into a single token. It is mostly an empirical hyperparameter and seems to be usually in the high 10,000 or around 100,000 in state-of-the-art architectures today.
The model may not have seen this combination of tokens in its training data, causing it to predict a stop sequence immediately and resulting in no output.
The issue is caused by the tokenization dataset being different from the training dataset for the actual language model. The token 'solid gold Magikarp' was merged to a single token in the tokenization dataset, but it was not present in the training dataset for the language model, causing undefined behavior when it is evoked.