This week, I began reading “The Hundred-Page Language Models Book by Andriy Burkov to gain a foundational understanding of large language models and modern machine learning. I only have a vague comprehension of various ML-related terms that I keep seeing, and it’s time to build up that understanding.
Basics
The central goal in machine learning is to learn a function from data rather than writing explicit rules. Think of this as learning a relationship like “output = function(input).” A model is our approximation of this relationship, and its behavior is controlled by adjustable parameters that the model improves during training.
Data
All information must be converted into numerical form before the model can process it. Inputs are represented as vectors, which are one-dimensional arrays of numbers. Batches of inputs are represented as matrices, which are two-dimensional arrays. In supervised learning, each input has a corresponding target which serves as the correct answer during training.
Neural Networks and Deep Learning
A neural network is one way to implement the learned function. It consists of simple computational units arranged in layers that transform the data step by step. The network contains weights and biases, which together form its parameters. These values are usually initialized as small random numbers so the network starts with no knowledge.
Activation functions such as ReLU or Sigmoid are applied inside the network. They introduce nonlinearity, which allows the model to capture complex relationships. Without nonlinear activation functions, multiple layers would collapse into a single linear transformation. Deep learning refers to neural networks with many layers that can learn increasingly abstract representations.
Training
Training is a continuous feedback process that adjusts parameters to reduce prediction error.
This cycle starts with a forward pass where the input flows through the network and produces a prediction.
The prediction’s quality is assessed by a cost function, which measures how wrong the prediction is by comparing it to the true target.
The primary goal of training is to minimize this cost. Automatic differentiation computes the gradient (how the loss would change if each parameter changed slightly), which guides how each parameter should be updated. Gradient Descent uses this gradient to adjust the parameters in a direction that tends to reduce the cost, repeating these updates across many batches until the parameters settle into values that yield a consistently low error.
Example
To predict the price of a house, the model receives a vector describing the house such as size, number of rooms, and age. The network transforms this vector using its current parameters to produce a predicted price.
The cost function measures the difference between the predicted price and the actual sale price. Automatic differentiation computes how changes in each parameter would affect this error. Gradient descent then adjusts all parameters slightly toward values that reduce the error.
By repeating this cycle across many examples, the model gradually learns parameter values that allow it to make accurate predictions for new data.
Evolution
The progression toward modern deep learning arose from the need to model relationships far more complex than what linear methods could capture. Early machine learning models were limited to patterns that were essentially linear, which pushed researchers to explore multilayer neural networks capable of richer transformations. It quickly became clear that simply stacking linear layers was ineffective, because a sequence of linear operations still behaves like a single linear function.
The introduction of nonlinear activation functions resolved this limitation and allowed networks to represent highly complex mappings. Combined with the backpropagation training algorithm, growing datasets, improved initialization techniques, and the rise of GPU-based computation, it became possible to train deep networks at scale. Over time, architectures specialized for particular domains emerged, such as convolutional neural networks for vision and transformer models for language, forming the foundation of today’s deep learning practice.