\(\Im\in\int\int\)

language games

language

With language modelling, we have a probability space \(({\Omega}, \mathcal{F}, \mathbb{P})\) where the sample space is the set of all tokens called the vocabulary, the event space \(\mathcal{F}\) is the set of all possible token sequences called sentences, and the probability measure \(\mathbb{P}: \mathcal{F} \mapsto [0,1]\). Once we use a random vector to map events over to \( \mathbb{R}^d \), we can forget about the probability space and focus our attention on language models which are joint probability distribution over all sequences of tokens. By the chain rule of probability (think, word one and word two, and ... word n), joint is the conditional. computing the conditional is hard..we make a markov assumption. https://x.com/karpathy/status/1821257161726685645

Before we jump into large language models (where \(\mathcal{H}\) is the class of deep neural networks), we will start with n-grams, formalized by Claude Shannon back in 1948 with his seminal paper A Mathematical Theory of Communication, along with our first two loss functions which poses the optimization problem as maximum likelihood estimation (MLE) and maximum a priori (MAP) estimation.

book2 on ngram as markov chain

This document covers the foundations of neural language modelling by building an automatic differentiation engine with a tensor frontend to train a multilayer perceptron (MLP) on the Penn Treebank (PTB) dataset. The autodiff engine will be the computable implementation of the math specification. In a future document, we'll be making our framework more performant by adding CUDA and Triton support.

References

Jurafsky, Martin 2024

References