This is a continuation of Notes on ML basics and the pursuit of understanding LLMs deeper.
Tokenization
Machine learning centers on learning a function from data by building a model that approximates the relationship between input and output. Modern systems typically use a neural network to represent this function. A neural network is a layered structure made of simple computational units that transform the input step by step, allowing the model to learn complex patterns that would be impossible to define with hand-written rules.
The challenge is that a neural network doesn’t work with characters or words, but instead processes floating-point numbers in vectors and matrices. Therefore, the initial step is to convert characters into numbers.
Tokenization is the process of splitting text into meaningful pieces called tokens. In modern language models, tokens are often subword units rather than full words, which provides a balance between efficiency and flexibility and allows the model to handle rare words, new words, misspellings, and languages with complex structure.
Before tokenization is applied, the tokenizer is trained to create a vocabulary, which is the list of all allowed tokens. Each token in this vocabulary is assigned a unique integer ID. During tokenization, the text is broken into these vocabulary tokens, transforming the original text into numerical sequences that a neural network can process.
Word Embeddings
Early models treated each word as an arbitrary integer, which carried no meaning. Word embeddings solved this by mapping each token to a dense vector stored in an embedding matrix, so each ID retrieves a learned vector that captures semantic relationships. Training methods such as Skip-gram adjust these vectors so that words appearing in similar contexts move closer together in the vector space, giving the model a basic understanding of meaning. However, these vectors are static: a word like âbankâ always has the same embedding, regardless of whether the sentence refers to a river bank or a financial bank, which creates a fundamental limitation that later models needed to overcome.
The challenge was the enormous size of the word-level vocabularies. Human languages contain millions of surface forms, making it impossible to store every variation or handle rare and unseen words efficiently. Subword tokenization methods such as byte-pair encoding solve this by breaking words into smaller, frequently occurring units, for example, splitting âinterestingâ into âinterestâ and â-ingâ to keep the vocabulary manageable Modern language models mostly use subword tokenization.
For example, ChatGPT uses a byte-level Byte-Pair Encoding (BPE) tokenizer tiktoken, which splits text into subword units learned from data. It works at the byte level to ensure compatibility with all languages and characters.
Language Models
Once text has been converted into subword tokens and mapped to vectors, a language model processes these vectors in sequence and learns to estimate P(next token | previous tokens). As it reads, it updates an evolving internal state that combines each new token with everything that came before, allowing it to capture patterns in grammar, meaning, and long-range relationships. This works far better than static embeddings because the model builds context-dependent representations: the interpretation of each token shifts based on its surroundings. By repeatedly practicing next-token prediction over massive text corpora, the model learns the structure and semantics of language, enabling it to generate coherent text and understand ambiguity in ways fixed word vectors never could.
A count-based language model predicts the next word by using how often short word sequences appear in a dataset. It can produce simple continuations and was used in early autocomplete systems. Its major limitations are handling only short contexts, failing on unseen words, and lacking any understanding of meaning, which led to its replacement by neural models.
In essence, this is an algorithm of building a language model:
1. Collect a large text corpus.
2. Train a tokenizer:
- Learn a vocabulary of subword tokens (e.g., using BPE).
- Assign each token a unique integer ID.
3. Convert all text in the corpus into sequences of token IDs.
4. Initialize an embedding matrix with random numbers.
- Each token ID maps to a learnable vector.
5. Initialize the modelâs parameters (weights) randomly.
- These define how the model processes sequences of token vectors.
6. For many training steps:
a. Read a sequence of tokens: t1, t2, ..., tL.
b. Convert each token ID into its embedding vector.
c. Process these vectors in order to produce an internal representation.
d. Predict the probability distribution for the next token t(L+1).
e. Compare the prediction to the true next token.
f. Compute the loss (error).
g. Update all parameters using gradient descent to reduce loss.
7. Repeat until the model predicts next tokens accurately across the corpus.
8. The trained model can now:
- Generate text by repeatedly predicting the next token,
- Understand context because its internal representations change
based on all tokens seen so far.
Evaluation
There are different model evaluation techniques:
- Perplexity. How well a model predicts tokens. The standard mathematical metric used during the training of a language model. It measures the uncertainty of the model when predicting the next token. However, a model can have low perplexity (good at guessing the next word statistically) but still generate repetitive or nonsensical text. It measures confidence, not necessarily quality.
- ROUGE. How well generated text match a reference text. Measures overlap between the modelâs output and one or more reference texts. It measures lexical overlap, not true semantic similarity or factual correctness.
- Human evaluation. having people judge quality directly. Human evaluation is required when automated metrics fail to capture qualities such as coherence or factual accuracy. Two methods are used: Likert ratings, where annotators score outputs on a scale (â2 to 2) for attributes like coherence or informativeness, and Elo ratings, where annotators compare two outputs and choose the better one. Likert scores suffer from biases and inconsistent interpretation, while Elo provides more stable comparisons by updating model scores through many pairwise judgments.
What do we mean by evals?
Evals measure how well a language model performs across different tasks. At the modeling level, perplexity evaluates next-token prediction quality, while ROUGE (and related metrics like BLEU) measure overlap between generated text and reference text in tasks such as summarization or translation. When automatic metrics are insufficient, human evals, using Likert ratings or Elo-style pairwise comparisons, assess qualities like coherence, accuracy, and usefulness. Modern LLM evals extend beyond these basics and use standardized benchmark suites to test reasoning, math, coding, knowledge, and safety.
Common evaluation algorithms and benchmarks:
- Perplexity (next-token prediction quality)
- ROUGE-1 / ROUGE-2 / ROUGE-LÂ (summarization)
- BLEUÂ (machine translation)
- Accuracy / F1 / Exact Match (QA tasks)
- Likert scoring (human rating)
- Elo comparison (pairwise preference)
- MMLUÂ (general knowledge reasoning)
- GSM8KÂ (math word problems)
- HumanEval (code generation)
- BBH / BIG-Bench (reasoning tasks)