A Primer on Generative AI: Large Language Models (LLMs)

Preamble

This will be the first in-depth analysis piece that departs from regular psychonomic history analyses. I have envisioned releasing 3 such articles covering critical and highly advanced technologies and their developments that will be defining the future of our world: Generative AI (LLMs), Semiconductor Chips and Quantum Computing. All 3 technologies are central to the China-US race. The articles will be released independently and alongside the regular geopolitical pieces. It would be useful to broaden coverage to help readers understand what exactly is going on in this space. Technology is deflationary and ultimately feeds into geopolitics, in fact, military-intelligence competition has driven most cutting-edge breakthroughs and industrial revolutions throughout history. The military is almost always the first mover in this space. I will try my best to not bore the reader with redundant and haughty academic parlance, sticking to engineering speak instead.

Introduction

We are in the midst of the Fourth Industrial Revolution – Artificial Intelligence, for better or for worse. AI is a very broad technology and LLMs are only a subset of (very popular) product under the AI umbrella. In such a world, data is fiat currency and quality data is gold currency. Anybody sitting on high quality data repositories is a potential goldmine for AI, which is why Web3 ebook repositories have been such high value targets for AI companies and why some Deepweb portals e.g. LexisNexis (a legal documents library rich in text data) can now contemplate lucrative partnerships with AI companies. Artificial intelligence can be generative or non-generative. Generative AI is artificial intelligence that generates content based on inference. Non-generative AI instead focuses on search – querying private corporate data for example, and pulling out insights rather than generating new content based on more generalist instructions. LLMs are one subset of generative AI which generate mostly text content, among other types (images, videos, audio). Generative LLMs specializing in video and audio are generally much more compute-hungry than their text and image focused counter parts, due to the extra dimensionality and complexity in underlying data. This article will focus on text-based LLMs. At a high level, text-based LLMs turn language into mathematical objects. The core mathematical learning algorithm in an LLM is based on neural networks – a machine learning framework. We must remember that an LLM is not a truth machine: it is a prediction machine. It will always strive to give a confident answer whether or not the answer is correct. This limitation must not be overlooked. It may be tempting to think of artificial intelligence as a reflection of human intelligence but it is more accurate to describe it as an alien intelligence. That will become more apparent in due time.

What I am going to cover is merely one type of generative LLM architecture, albeit a major contemporary player in the space: transformer models. These models are known as autoregressive, not by any classical statistical definition, but in their mechanism of predicting the next word from the previous one, sequentially left-to-right. Non-autoregressive models on the other hand, work in parallel to predict a sequence of tokens at once – e.g. diffusion models. Transformer architecture is further split into sub-types based on their decoder/encoder blocks. The AI development cycle for the past 6 years in frontier models has exhibited strong exponential trend, as detailed by the METR project. It does not necessarily speak for the global AI industry as a whole, – we are in its early stages. It is tempting to think of recent trends for frontier models as extrapolating far into the future, however, recency bias will be tamed ultimately by hard limits from natural resource scarcity and computing power limits. When AI software models are outpacing the natural progression of hardware by a wide margin, there will be an inevitable softening of the gradient – and stock market valuations.

AI has been accelerating in recent times, models doubling in performance metrics in just 5-10 months. By this time next year, there may very well be new model architectures and much of what has been written here already rendered obsolete. Such is the nature of an industrial revolution’s inception years. An exponential technological progress curve can be thought of as being approximated by a series of smaller “S” logistic curves, each with their own explosion, peak and plateau phases, representing smaller leaps or advances within the broader macro trend. If an observer is too hung up on the plateaus of smaller “S” curves, he may miss the broader rising trend and think the technology is maturing when in actual fact, the horse jockey is merely saddling. The human mind thinks in linear terms and struggles to grasp the exponential function. In a pond with only 1 lily on day 0, doubling every day until the lily’s cover the entire pond in 30 days, the pond is half covered not on day 15 but on day 29. This is the inability to grasp the exponential function. Therefore, be mindful of just how quickly this technology is advancing and do not under-estimate its impact. Only scarcity and the laws of physics can impose hard limits on the human folly coming out of Silicon Valley.

The base model is trained on large amounts of raw data, especially in the early days of AI data was scooped up from the public Internet by tech companies with little regard for IP or copyright issues, especially book repositories, websites (via crawlers and scrapers) and social media platforms (via APIs). These datasets provide input-output pairings – a process initially done very manually by the “wage slaves” of early AI who spent many hours drawing polygons around figures and annotating data. Having good quality training data is key: a large set of input-output examples is needed so that the model can tweak and generalize the parameters in its “brain” algorithm, ultimately predicting unseen input data (during testing and inference stages) with high accuracy. Once quality data accumulated, professional dataset repositories appeared, sometimes with AI generated synthetic data augmenting real-world data – they began to offer data to LLM companies. The basic unit of any text used by an LLM is a token (1D), images are pixels (2D), audio are audio tokens (1D waveform or 2D spectrogram) and video are voxels (3D). LLMs of the same learning brain work by the same principles, but the data processing steps may differ due to differences in complexity of the input data. Data pre-processing can be a very demanding and challenging stage preceding the bulk of an LLM’s compute phases, ensuring the LLM works properly and outputs quality responses.

How the LLM brain works is by converting these units of data into mathematical objects and storing them into containers (think: matrices and arrays), then performing a bunch of transforms on them inside the neural network “brain”. Here we will assume a transformer-based attention layer model – the brainchild of a team of international researchers at Google, with a seminal paper co-authored in 2017 on the topic. Attention layers are sequential steps of matrix operations proxying a learning algorithm that very crudely mimics neurons in the human mind. A matrix (A×B) can be approximated in concept to a dataframe or a spreadsheet: A rows by B columns of data. Instead of holding mixed data types in cells, the matrices in LLMs tend to hold numbers, unless there is a mapping between strings and numbers required. Matrices fed into computers are turned into array objects to speed up computation. The model then iterates millions, billions and possibly trillions of times to uncover statistically significant hidden distances and similarities between all the bits. Insignificant “noise” from the signal is filtered out by scaling it down beneath a threshold. What is finally left are the “signals”, meaning the artificial intelligence has learnt the mechanics of human language through mathematics and probability. It must be pointed out that there are constraints to LLM accuracy and trade-offs exist, namely accuracy versus time and compute. The “dimensionality” problem leads to model overfitting (too much noise), therefore reducing dimensionality is key to squeezing out noise and refining the learnt “signals” from the input data, using those signals to then infer unseen data.

If training builds a model’s parameters (weights & biases), testing generates performance metrics for LLMs. In general machine learning, the total dataset is split according to the ‘training-test split’ – usually 80% of the data goes to training and 20% to testing (held-out or ‘unseen’ data used to evaluate forecasts). In many cases it splits into an additional validation set (training-validation-testing 80%-10%-10%). But in LLMs the splits are different due to the sheer sizes of datasets: often 99%+ goes to training, with just 1% held out for validation and testing to gauge generalization – which is still millions of tokens out of a pool of trillions. If testing is evaluated and validated as acceptable, LLMs compare performance benchmarks against industry standards – these are often thrown around between competing LLM companies but also referenced by their users. The model is packaged up and made available open source or, if closed source, hosted on private cloud architecture. What is actually open source are the weights and biases (parameters) and some model settings, not the actual training data itself. Finally, inference applies static weights on unseen data when users interact with the finished product. When users run an LLM (i.e. inference), it multiplies its pre-trained weights with the input prompt via matrix operations to predict an output response. Note that the bulk of this article will explore how LLM training works. Understanding LLM training should cover how LLM inference works.

Pre-requisites

A matrix M∈R^NxD means an object with N rows and D columns that stores floats (floating point numbers). If R was Z, it would store integers. If R was C, it would store complex numbers. This is just arbitrary notation. R means ‘real numbers’. The ∈ means ‘part of’, so M is part of the set R^NxD. Think of a matrix analogous to a spreadsheet or dataframe, M_i,j = M_2,3 represents the element (value) in the row 2, column 3 position. I will use this notation “R^NxD” to track matrices as they flow through the LLM pipeline, giving an indication of sizing. A matrix is an abstract mathematical object, however, in actual computing arrays are used in memory which are objects that resemble lists, less human-readable. Matrices (arrays) are the containers that will hold the numbers an LLM will store, compute and transform.

When multiplying integers, we intuitively understand that direction does not matter: 7 × 3 == 3 × 7. But when multiplying matrices together, direction matters: M₁ × M₂ != M₂ × M₁, this property is called non-commutativity. Thus the output of M₁ × M₂ would be a different matrix than M₂ × M₁.

When multiplying 2 matrices together, some basic rules govern whether this is possible, and if possible, the size of the resulting matrix. If M₁∈R^NxD and M₂∈R^OxE and we wish to multiply M1×M2, it is possible only if the columns of M₁ matches the rows of M₂, i.e. D = O, and the resulting matrix will be size N×E (N rows, E columns). This is because the multiplication works by multiplying all the numbers in M₁‘s rows by M₂‘s columns which must be the same magnitude. The total number of multiplication operations is N×E×D (where D is the same as O). The total number of addition operations is N×E×(D-1) and the total number of all operations is N×E×(2D-1). To color this with an example, for M₁∈R^3×2 and M₁∈R^2×4, M₁ × M₂ ∈R^3×4. When it comes to measuring the number of operations at a computational level, there is some confusion due to legacy conventions. The total number of operations (FLOPS – floating point operations per second) is N×E×(2D-1) or 36 in our case. However, this is actually a mid-point estimation of the true FLOP figure. Modern GPUs actually do add+multiply in the same operation (FMA), so in reality the true number of FLOPS is 24 while convention says 48. Therefore, to get true FLOPS divide the conventional number by 2 for FMA adjustment. This is a legacy issue before the rise of modern GPUs where the FLOP measure was computed by 2×N×E×D, and continues to be the case in the industry.

Let token 1, token 2 and token 3 have 3 features, symbolized by “x”, “y” and “z”. Token 1 has values of their features “1”, “2” and “3”. Token 2 has values of their features “4”, “5” and “6”. Token 3 has values of their features “7”, “8” and “9”. In vector form they would be represented as such: 1x + 2y + 3z, 4x + 5y + 6z, 7x + 8y + 9z respectively. The “x, y and z” are abstractions of the features (basis vectors) while the actual scalars (numbers) are their weights. Matrix A is a R^3×3 matrix holding the weights while the basis matrix is a column vector R^3×1, mapping the weights to their appropriate features. The result is a R^3×1 matrix on the right, with 3 rows and 1 column. If the basis column matrix was on the left hand side instead, we would get R^3×1× R^3×3, which cannot be computed since the columns in the first matrix (1) do not equal the rows in the second (3). It demonstrates the non-commutativity of matrix multiplication. This is an example only for illustrative purposes:

The rows represent tokens; the columns represent a token’s features. The weights are what actually get multiplied and scaled in LLMs, the basis vectors are merely accounting tricks to make sure the algebra is structured. Higher mathematics is akin to matroshka dolls, starting with a simple abstraction and encapsulating it within a higher abstraction and so on. If we begin with 3 tokens and their properties, wrap them into vectors, wrap vectors into matrices, wrap matrices into arrays which are stored in memory, computers can then process them efficiently in various operations. Matrix multiplication is represented by the symbol “×”, while dot products are represented with “⋅” Dot products are a type of matrix multiplication. In the above, the final matrix consists of dot products at each row: 1x + 2y + 3z, 4x + 5y + 6z and 7x + 8y + 9z. When vectors are multiplied together, their dot product is a scalar (number): if a₁ = 1x + 2y + 3z and a₂ = 4x + 5y + 6z then a₁⋅a₂ = (1×4) + (2×5) + (3×6) = 32.

Finally, a note on matrix transpose. If we take the weight matrix A as above, the matrix transpose of A is given by A^T (rows and columns are flipped). Later we will encounter transposes so it may be helpful to understand:

1. Tokenization

During training, a model processing a sentence like “Mango is a tropical fruit that is sweet and sour when semi ripe but isn’t handled well when ripe” tries to predict the next token at each step in sequence. The text is first split into N tokens by a tokenizer algorithm – in this example the sentence is split into N = 20 tokens. Each token becomes an element in an array representing a 1×N matrix — conceptually a row vector of token IDs:

tokens=[“Mango”, ” is”, ” a”, ” tropical”, ” fruit”, ” that”, ” is”, ” sweet”, ” and”, ” sour”, ” when”, ” semi”,” ripe”, ” but”, ” isn”, “‘t”, ” handled”, ” well”, ” when”, ” ripe”]
token_ids=[30,1230,3,67,156,78,777,1230,53,23,55,333,99,223,4567,101,1001,598,55,99]

These integers correspond to vocabulary indices. Think of the array index of token_ids as the positional information about the tokens while the value tells the model which token it is. But the process will need to store further information about intricacies of language (semantics, context) and transform them through various matrix operations.

2. Embeddings: Turning Tokens Into Vectors

To this end, separate empty array ‘containers’ are prepared to store features about language, these matrices are called emebddings. Let us call this new embedding matrix E which will form a global lookup matrix. But how many features do we need? This is an arbitrary number which balances computational time with accuracy. The hyperparameters in the model are fixed so they apply equally to all tokens across all steps. The parameter D is fixed as a hyperparameter in the model before training commences, let us use a value of D = 4096 from now on, meaning a maximum of 4096 features will be extracted by the model for each and every token. Not all 4096 features will be useful for each token, imagine some being tiny (insignificant) semantic features while others being substantive semantic features unearthed for that token from language text. Each token ID maps into an embedding matrix E typically of size V×D (vocabulary × features). V is a vocabulary matrix, another LLM hyperparameter (tokenization scheme) set by engineers and budgets before training commences, typically ranging in size from 50K to 150K. V is essentially a dictionary of language (English in this case) containing all types of tokens compiled from training on a huge corpus of words (billions) and statistically most N tokens will be found in the embedding matrix E, else they will be mapped to “unknown” type which is an edge case. It is from V where we obtain token_ids. If V is selected as 100K, then N tokens will be looked up from the 100K corpus. During processing, each token retrieves a D-dimensional embedding vector, so our batch of tokens in X forms an N×D matrix and X is purely a lookup from E, nothing new is done.

E = V×D matrix (100000×4096) – very large global matrix of language vocabulary tokens
X = lookup(E,[t1,t2,…,tN]) (N×D matrix or 20×4096) – smaller lookup matrix for N tokens

Each embedding dimension (each column of X) corresponds to an abstract feature of language which is learned, not predefined. These features are embodied by scalars (floating point numbers or floats) called weights. Thus E contains weights and a row for token N = 3 might look like [0.2,-0.5,2.67,6.2,…]. These matrices are containers, with weights initialized and refined over time. During training the weights get updated recursively in the reference matrix X through back-propagation (we’ll get into that later). The forward pass flow from now until back-propagation pulls up a slightly adjusted version of E in matrix X, where the current state of those weights always lives. The weights from this step onward will be changed through X and its inheritors. Dimension D = 1031 might encode noun-likeness, D = 690 might capture technical jargon and D = 55 might capture frequency skew. At this point, all weights in X are randomly initialized from a sample distribution such as a Gaussian (normal) distribution, so the container X is not empty.

3. Adding Position to Embeddings

To inject order information into this matrix, the positional embedding P (also a N×D matrix) is summed to X. Think of the unique integer index number in token_ids as the position information for each token in the sequence. The reason for such summation is to mathematically distinguish language phrases like “bird flies through the air” from “air flies through the bird” with token ordering effects.

X′ = X + P

4. Normalization and Numerical Stability

Next, a common data technique is applied to boost computational performance and numerical stability called normalization so that wildly varying floating point numbers are confined only between 0 and 1. This logic will be tied to probability later on in the process and resembles conceptually to quantum mechanics, where the square of the magnitude of a normalized vector (amplitude) is proportional to the probability of finding a particle in a given state. Only normalized states evolve meaningfully and comparably over inference and training. Before entering the transformer stack, X′ is normalized.

X_Norm = LayerNorm(X′)

A LayerNorm function acts on each row of X′, across all its 4096 dimensions, normalizing floating point numbers (weights) by computing a Z-score for each embedding row vector such that all rows have mean = 0 and variance = 1 irrespective of scale. For token N = 3, E[3,:] pulls up the raw embedding X = [0.56, 0.78, 1.45, -0.23,….] from E with say, mean = 0.64 and standard deviation = 0.60. LayerNorm then renders X into X_Norm = [-0.13, 0.23, 1.35, -1.45,…] for example, with mean = 0 and standard deviation = 1 – bascially something known as Z-scores. Normalization is common for machine learning data pre-processing to prevent very large or small values from either blowing out or vanishing too quickly too early on. The normalized matrix X_Norm (retaining the N×D matrix shape) is the stabilized base state from which learning proceeds.

5. Transformer & Multi-Head Attention: Q, K, and V Matrices

The LLM at hand we are analyzing uses a transformer as its learning engine. There are many types of LLM learning algorithms, transformer is a common one. The transformer expands the dimensionality of the input X_Norm, from D dimensions to something more (usually a quadrupling so 4D dimensions), applies linear and non-linear transforms on the matrices to capture hidden semantic patterns in language, extracts the most meaningful of those relationships, then compresses dimensionality back to original size. This is where the bulk of compute power happens and where weights (and biases) are refined. X_Norm will be forward passed sequentially through B blocks, within each block L attention layers and within each attention layer, H heads in the transformer. Each layer acts as a new set of neurons learning about language to understand what token N_i comes next from the previous one N_i-1 by encoding statistical patterns between tokens. The end of one full forward pass occurs at the last block in the last attention layer at the last head, then weights back-propagate to embedding matrix X in step 2 where they get updated, and the process repeats.

In this example we will use a 32 block transformer model (B = 32) with 32 attention layers (L = 32) and 32 attention heads (H = 32), which is a real-world configuration for some models. Usually the block and attention layers are identical in count since each block contains one attention layer, but head counts differ (in this case they are the same). What happens across blocks is depth of learning improves, beginning with local token patterning in early blocks, building into a global framework of language semantics and context patterning in latter blocks. Each block consists of attention + FFN parts processed in sequence. Within each attention layer, 32 heads process in parallel, but across blocks and attention layers, processing is sequential – with outputs of each layer becoming inputs to the next layer and so on. The first attention layer splits the dimensions D into batches of sub-dimensions in accordance with D/H or 4096/32 = 128 dimensions per head H, for 32 attention heads. Each head processes 128 features in parallel in 32 independent subspaces. Each head computes 3 matrix projections of the normalized input X_Norm and 1 attention matrix A from scaled dot products. Before the attention layer, for token N = 1 “Mango” in our original sentence example, we only know the position of this token and some properties. Embedding matrix E answers the question “which token is this?” and activated embedding matrix X_Norm answers the question “what does the current token mean on its own?”. The weights in the embedding E quantify what the token “Mango” means as a static token. The weights inherited and normalized in X_Norm quantify “Mango” and its features. At the attention layer, these weights multiplying by new sets of weights – representing the degree of ‘mixing’ across tokens, adding contextual information to help us predict what token comes after “Mango”.

The Q, K & V projections are D×128 matrix containers that will store many cross-token sub-features we wish to extract from language, containing their own new sets of weights. They multiply with X_Norm weights. Together, Q, K & V help answer “what does the current token mean in context of its neighboring tokens?”.

For each head h, 3 (N×128) Q, K & V matrices and 1 (N×N) attention matrix A_h are generated:

Q_h = X_Norm×W_Q
K_h = X_Norm×W_K
V_h = X_Norm×W_V

A_h = Q_h⋅K_h^T/√(D/H) = Q_h⋅K_h^T/√128

The matrix multiplication takes X_Norm, a N×D matrix and multiplies by D×128 matrices to yield N×128 matrix projections Q, K and V for each head – transformed embeddings of X_Norm. X_Norm evolves from static features into context-aware embeddings through relational weighting. Here is qualitatively what the numbers actually mean for token N = 1 “Mango”:

E[30,:] = [1.23, -0.45, 2.67, 0.89, -1.12, …D_N] Embedding E vocabulary matrix at say position 30
Dimensions 1-200: [1.23, -0.45, …] features relating to fruit family (relates to melon, banana etc)
Dimensions 201-400: [2.67, 0.89, …] features tropical origin (matches pineapple or coconut etc)
Dimensions 401-600: [-1.12, 0.34, …] features sweet taste profile
Dimensions 601-800: [0.78, -0.67, …] features yellow/orange color
And so on.

Activated embedding X_Norm at position 1 (token N = 1 “Mango”) has all features from E along with position information X_Norm[1,:] = [-0.13, 0.23, 1.35, …]. The next row at X_Norm[2,:] would be the feature set for token N = 2 “is”, and so on.

W_Q, W_K & W_V are matrices that store weights which are randomly initialized, learnt and adjusted (updated) in each row per token N. X_Norm (N×D) multiplies a (D×128) weight matrix W to yield N×128 matrix projections that look like below for the first 3 tokens “Mango is ripe”:

First 3 rows (tokens) for 128 dimensions:

Q_h[0,:] = X_Norm[0,:]×W_Q = [-0.56,0.89,…W₁₂₈] “Mango” seeks verbs, food context etc
Q_h[1,:] = X_Norm[1,:]×W_Q = [1.23,-0.45,…W₁₂₈] “is” seeks verb context, nouns etc
Q_h[2,:] = X_Norm[2,:]×W_Q = [-0.78,1.67,…W₁₂₈] “ripe” seeks adjective targets etc

K_h[0,:] = X_Norm[0,:]×W_K = [-0.23,1.45,…W₁₂₈] “Mango” offers noun, fruit type etc
K_h[1,:] = X_Norm[1,:]×W_K = [0.54,-0.67,…W₁₂₈] “is” offers verb patterning etc
K_h[2,:] = X_Norm[2,:]×W_K = [1.12,0.11,…W₁₂₈] “ripe” offers maturity descriptors etc

V_h[0,:] = X_Norm[0,:]×W_V = [-0.02,3.54,0.68…W₁₂₈] “Mango” contains tropical traits, sweet flavor etc
V_h[1,:] = X_Norm[1,:]×W_V = [2.19,0.93,4.78…W₁₂₈] “is” contains verb role etc
V_h[2,:] = X_Norm[2,:]×W_V = [-1.77,0.88,1.33…W₁₂₈] “ripe” contains ripeness etc

So we see D/H = 128 features getting scaled by weights (floats) which will be learnt over time, in proportion to the significance of the feature relative to the token at hand across each row N. The reason for these operations is to extract contextual meaning from language. Attention is where the bulk of LLM FLOPS, or compute power, is consumed. At an abstract level, Q, K and V matrices contain information on tokens. Q determines what the token is ‘querying’ – what tokens tend to precede it and what tokens tend to follow it. K determines what a token’s role is in language and V determines properties of the token. The dot products between Q and K matrices (Q⋅K^T) are done to score the interplay of effects between them. The bigger the dot product, the higher the contextual relation between that pair of tokens and more likely those tokens will follow each other. A low or negative dot product indicates orthogonal matrices with little similarity between tokens. N² dot products are done in total between all pairwise tokens per head H where the transpose of K (K^T) is actually multiplied by Q, not K, to give an N×N matrix otherwise Q⋅K would give an invalid shape. Quite often in machine learning there are dimension errors with array shapes not aligning. The attention matrix A_h encodes pairwise relationships between every token pair. The Q⋅K^T dot products are called scores, and slightly adjusted by a scaling factor (dividing by √(D/H) which in the example was √4096/32 = √128). The scaling factor stabilizes the dot products from growing too quickly. The total weights processed during each attention layer L is given by D² × (2/H + 2), in this case ~ 34 million.

A special type of exponential normalization operation is now applied to the attention matrix A_h where raw Q⋅K^T scores per head across all N tokens (not dimensions D) live, called a Softmax function, yielding scaled attention weights W_A. The function uses exponentiation to smooth out peaks according to the formula e^X / Σ e^X: the formula is applied at a row level on the attention matrix A_h, e.g. Softmax on a row with 2 tokens’ Q⋅K^T scores of [2,5] produces [0.047,0.952] scores. These Softmaxed weights are known as context weights. The scores are effectively normalized and bounded between 0 and 1, stored in a matrix W_Aof size N×N called the Softmaxed attention matrix with each row representing one token’s row of scores against all other tokens, in essence probabilities for each token N against which other tokens in the sequence are most relevant to it. This probabilistic approach is how the mechanics of language semantics and context are learnt by machines.

W_A = Softmax(A_h)

Let us spend some time to interpret the meaning of the attention weights in the Softmax attention matrix W_A for a small example with 3 tokens “Mango is ripe” and 3 features (rather than 128). Directionality (attention) flows both ways in the matrix, capturing bidirectional context. Each row represents how much attention (in probability) each token attends to all the others. Each column represents how much attention other tokens attend to the token at hand.

W_A = |0.45 0.48 0.07|
|0.23 0.62 0.15|
|0.12 0.18 0.70|

Row level
W_A[0,:] = [0.45, 0.48, 0.07] “Mango” attends to 45% self + 48% “is” + 7% “ripe”
W_A[1,:] = [0.23, 0.62, 0.15] “is” attends to 23% “Mango” + 62% self + 15% “ripe”
W_A[2,:] = [0.12, 0.18, 0.70] “ripe” attends to 12% “Mango” + 18% “is” + 70% self

Column level
W_A[:, 0] = [0.45, 0.23, 0.12] 45% self + 23% “is” + 12% “ripe” attends to “Mango”
W_A[:, 1] = [0.48, 0.62, 0.18] 48% “Mango” + 62% self + 18% “ripe” attends to “is”
W_A[:, 2] = [0.07, 0.15, 0.70] 7% “Mango” + 15% “is” + 70% self attends to “ripe”

V_h = |-0.02,3.54,0.68|
|2.19,0.93,4.78|
|-1.77,0.88,1.33|

Attention weights then scale V_h outputting an updated embedding matrix Z_h, encoding how each token carries information about surrounding context. Over time and layers, the attention weights (scores or dot products) within these matrices evolve as new contextual information is refined about the tokens.

Z_h = W_A×V_h

= |0.45×-0.02 + 0.48×2.19 + 0.07×-1.77, 0.45×3.54 + 0.48×0.93 + 0.07×0.88, 0.45×0.68 + 0.48×4.78 + 0.07×1.33|
|0.23×-0.02 + 0.62×2.19 + 0.15×-1.77, 0.23×3.54 + 0.62×0.93 + 0.15×0.88, 0.23×0.68 + 0.62×4.78 + 0.15×1.33|
|0.12×-0.02 + 0.18×2.19 + 0.70×-1.77, 0.12×3.54 + 0.18×0.93 + 0.70×0.88, 0.12×0.68 + 0.18×4.78 + 0.70×1.33|

= |0.9183, 2.10, 2.6935| “Mango” contextualized
|1.0877, 1.523, 3.3195| “is” contextualized
|-0.8472, 1.2082, 1.873| “ripe” contextualized

Z_h is normally an N×(D/H) matrix but in this simple example it is a 3×3 matrix. When we multiply a token’s feature vector from V_h by its attention weights for all other tokens, the features get scaled by context.

6. Combining Heads

Once all H = 32 heads finish in parallel in layer L = 1 (of 32 layers), back to our example with 128 features per head, Z_h are concatenated and linearly transformed back to the original space. As all the Z_h concatenate together from all H = 32 heads, we have contextualized cross-token information expanding back to the original embedding length of 32×128 = 4096, i.e. Z_T is a N×4096 matrix. But what exactly is W_O?

Z_T = [Z₁,Z₂,…,Z_H]×W_O

W_O is a container that will hold additional weights, a D×D weight matrix. But the importance of W_O is not only holding weights that will scale all contextualized token values, it will optimize the most important mix of contextual cross-token features from all heads. Weights will self-update as training progresses and new learning links back to original embeddings so information is preserved, in a similar way blocks are hashed in blockchains, preserving their history. As with all weights, they are randomly initialized at first from a Gaussian distribution and subsequently refined. The first layer’s concatenated Z_T is now summed with the original input that went into layer L = 1, X_Norm, then normalized – this is the residual e_A1 which now becomes the input to the next stage in 6a. The _A1 subscript indicates attention residual of attention layer 1 before entering the next stage, distinct from e_MLP1 which is a second residual that arises from the end of the next stage. The importance of this step is to add new learning to original signal so the combined effect is retained.

e_A1 = LayerNorm(X_Norm+ Z_T)_A1

Across time (training steps), every matrix X, P, W_Q, W_K, W_V and W_O evolves as gradient descent updates weights to reduce prediction error. So at its heart, the “story” of an LLM is the continuous, coordinated transformation of matrices to better reflect latent structures of language.

6a. FFN/MLP Layer

The residual at the end of the attention head concatenation, e_A1, is fed as an input into the FFN/MLP layer where the dimensions expand. This happens always in sequence. A forward pass is called a Feed Forward Network (FFN) or Multi-Layer Perceptron (MLP). The learning gets a boost in FFN when dimensions expand, capturing hidden features of language. Anything statistically significant is scaled by their weights and dot products, then dimensions get compressed back down to their original size. Most weights used in an LLM’s compute are generated during FFN; the rest being attention weights.

The first linear transform takes the N×D matrix of the first residual e_A1, multiplies it by a D×4D weight matrix W_UP subscripted as _UP due to the expansion of dimensions from D→4D. The output is a N×4D matrix l₁ holding raw un-normalized scores called pre-activations. The weights in W_UP are randomly initialized, these will be updated with new learning. The term B_UP is the bias term that adds translation to the linear transform. Together the weights and biases form the bulk of a model’s parameters.

l₁ = e_A1×W_UP + B_UP

The next stage then applies a non-linear GELU transformation (Gaussian Error Linear Unit function) to the values, preserving the N×4D structure. In transformer based LLMs, the linear transforms are counted as layers as the non-linear phase does not change the weights and biases. The benefit of applying non-linearity is to capture patterns beyond simple linear relationships in language learning, which would limit language expressiveness. Furthermore, non-linearity smooths out gradients. The GELU function is approximated GELU(x) ~ 0.5x(1+tanh[√π/2(x+0.044715x³)]).

g₁ = GELU(l₁)

The output g₁ enters a final linear transform where dimensions contract from a N×4D matrix back down to N×D. How that is done is by the trick of multiplying by a 4D×D W_DOWN weight matrix and summing with a bias term. The result is ∆_MLP1.

∆_MLP1 = g₁×W_DOWN + B_DOWN

The second and final residual for layer L = 1 is the sum of the original signal (the first residual) with new learning from the MLP sublayer, then normalizing the result.

e_MLP1 = LayerNorm(e_A1 + ∆_MLP1)

6b. Mixture of Experts (MoE) Models

I wish to briefly touch on the innovation of MoE and where in the process they differ from the default (dense) FFN layer. A MoE model is defined by 2 parameters – number of experts E (8,16,32…) and top K experts (2,4,8…). Let us assume a common setup where E = 8 and K = 2, so top 2 experts in an 8 expert model. The first thing that happens in an MoE model from the last step is to pass the residual embedding matrix e_A1 (size N×D) through a small linear layer called a Router of size E×D used to rank relevance score for each of E experts. In the example, it would translate to activating only 8×4096 = 32,768 parameters per token in the router matrix. Each “expert” E takes the residual e_A1 and computes raw scores R₁^E for the input token N (technically N is a batch of tokens), with its own set of weights and biases in the router E, outputting a 1×E matrix, or a row vector.

R₁^E = e_A1^E×W_ROUTER^E + B_ROUTER^E for expert E in {1,2,3,4,5,6,7,8}

These raw scores look like p = [1.65, −0.62, 6.4, 3.5, 1.3, −1.2, 0.2, 0.05] for example. A Softmax is then applied, normalizing the raw scores into probabilities using e^P_i / Σ e^P_ito give an array of scores which look like [0.0080,0.0008,0.9310,0.0510,0.0057,0.0005,0.0019,0.0016]. The Softmax probabilities in the router vector ranks all 8 experts in their predictive power, each with their own sets of weights and biases. Large positive raw scores correlate with higher predictive power, negative scores indicate low predictive power. Due to the exponential normalization in Softmax, high scores like 6.4 (93%) dominate and crush bad scores towards zero probabilities. Note the 3rd and 4th expert have the highest scores for predicting the next token N+1 relative to the input token N. A top K = 2 MoE means the probabilities are ordered and the top 2 experts are selected, in this case the 3rd and 4th experts (93% and 5.1%). The experts are normalized among the top K experts only, to calculate their relative weights called gate weights (gw). For expert 3 and expert 4, the gate weights are computed as such:

gw₃ = 0.931 / (0.931 + 0.051) = 0.948

gw₄ = 0.051 / (0.931 + 0.051) = 0.052

What happens next is the top K = 2 selected experts are run through the full FFN forward pass, each expert runs its own linear expansion, non-linear GELU and linear contractions:

l₁ = (e_A1×W_UP + B_UP)_expert1, l₂ = (e_A1×W_UP + B_UP)_expert2
g₁= GELU(l₁)_expert1 , g₂ = GELU(l₁)_expert2
(g₁×W_DOWN + B_DOWN)_expert1 , (g₁×W_DWON + B_DOWN)_expert2

The final outputs, the linear ‘down’ transforms are mixed with their gate weights as a weighted average of both experts, giving the MoE ∆_MLP1:

∆_MLP1 = (g₁×W_DOWN + B_DOWN)_expert1 × 0.948) + (g₁×W_DOWN + B_DOWN)_expert2 × 0.052)

To summarize MoE models, they differ only in the FFN block. Dimensionality remains the same as the dense version (without MoE) – D expands to 4D then contracts back to D again. What changes are how parameters are used. The dense version fully engages (4096×16484 + 16384×4096) × 2 = 134 million parameters in both linear layers while the MoE engages only a fraction of that: 2 experts so (134 million / 8)×2 = 33.5 million parameters or 25% of the full dense model. At a token level, it is 25% cheaper, using 2 experts instead of 8, and roughly 4 times faster during inference. Thus, inference times are faster for MoE models, allowing them to yield better results on weaker GPUs than their fully dense peers. During inference, only the top K experts are computed per token, meaning the model effectively uses only a subset of its total parameters. A 47 billion parameter MoE model with the aforementioned setup (top 2 out of 8 experts) computes with only ~12 billion parameters, resulting in computational behavior closer to a 12 billion parameter dense model which translates to better performance (not necessarily accuracy however).

7. Across Transformer Blocks

The processing completes for the first transformer block. For a 32 block transformer model, e_MLP1 becomes the input to the next block and so on, 32 times. The second transformer block corresponds to attention layer L = 2 and the process repeats.

Q₂ = e_MLP1×W_Q
K₂ = e_MLP1×W_K
V₂ = e_MLP1×W_V
A₂ = Q₂⋅K₂^T/√(D/H) = Q₂⋅K₂^T/√128

And so on.
As transformer blocks progress, residual streams grow and vital information from learning is retained across layers within the exponentially growing residual. At layer L = 32 in the last transformer block we have the final residual:

e_MLP32 = e_A32 + ∆_MLP32

For a B = 32 block transformer, each block having L = 32 attention layers and in turn, each layer having H = 32 heads, each head H computes N² = 4096² = 16.8 million dot products per head, L(32) × 16.85 = 538 million dot products per attention layer and B(32) × 538 = 17 billion dot products per 1 forward pass through all transformer blocks. As for the number of model parameters, we must remember that the number of heads H does not grow parameters, it merely spreads the dimensions across the heads (D/H = 4096/32 = 128 dimensions per head). Per attention layer there are 4D² = 67 million parameters, 8D² = 134 million parameters per MLP layer L, 12D² = 201 million parameters per block B and B×201 = 6.43 billion parameters for all blocks. That final figure is what LLM models usually advertise as their parameter count.

8. Loss Function

After a full forward pass through all transformer blocks, gradient descent, cross-entropy loss and weight updates happen in a process called back-propagation. The process begins by computing how well the prediction fares against the actual next token, with a loss function, akin to mean squared errors in standard regression. The final block residual e_MLP32 is an N×D matrix with N tokens and D = 4096 in our example. This needs to project to what the model is ultimately trying to predict – next token. The next token will come from the vocabulary matrix where the set of all tokens of language are stored (this is not the same as the tokens used in the training data). The final block residual holds all scaled and normalized attention and FFN weights and biases, and requires scaling by the vocabulary matrix through a linear transform. This outputs a N×V vocabulary matrix (recall V is a hyperparameter pre-determined before training, in our example V=100K) with numerical scores called vocabulary logits. In the final linear transform we multiply the last block residual (N×D) by yet another weight matrix W_vocab (D×V) and add a bias term B_vocab, that is how we get the vocabulary logit matrix called logits (N×V). The logits matrix is V columns across, for every single token in the dictionary reference, scaled by weights and biases at the N row level, meaning at each token N in the training input, the dictionary’s total tokens are scaled by which one probably comes next (N+1).

logits = e_MLP32×W_vocab + B_vocab

Again, weights and biases are randomly initialized. W_vocab scales the contextual embedding e_MLP32 while the bias term is a context-independent offset factor for each vocabulary token which boosts common tokens or penalizes rare tokens in line with common language. That way, the bias adjusts for token frequency but can be overridden with strong context signal.

The logits matrix is V columns across and N rows down. Each row N in the logits matrix, Z_N = logits[N,:], contains the full set of V raw scores for vocabulary tokens which are likeliest to come next after N token, i.e. the N+1 token. What we are interested is the true class or true target – the actual next token that we wish to predict after token N. In our earlier example we assumed the model was training on the text “Mango is a tropical fruit that is sweet and sour when semi ripe but isn’t handled well when ripe” (note a token is not always a word but usually subwords). The goal here is to start at “Mango” then predict the next token as “is” then “a” and so on. Each token N, where N = 20, is represented by one row of vectors in the logits matrix. For the first token, lets say N = 0 (indices often begin at 0 not 1), we would have the scores for all possible vocabulary tokens for the token “Mango”, [z₀, z₁, z₂,z₃…z_V-1] = [1.2, −0.7, 3.4, 0.5, 2.1, −1.8, 4.3, 0.9, −2.5, 1.6, 3.7, −0.3, 2.8, 1.1, −1.4, 3.0, 0.6, −2.2, 2.4, 1.9]. What the model does is for row N is look for N+1 (next token). Let us say index position 2 (z₂) is the true target, the actual next token “is”. Raw scores are not yet probabilities. Softmax(Z_N) normalizes the row, turning raw scores into probabilities such that they sum to 1. The formula for this exponential normalization is e^Z_i / Σ e^Z_i, yielding Softmax outputs [0.0187, 0.0033, 0.0679, 0.0102, 0.0347, 0.0014,0.1864, 0.0144, 0.0008, 0.0247, 0.0968, 0.0057, 0.0643, 0.0175, 0.0021, 0.0775,0.0114, 0.0010, 0.0501, 0.0314]. For z₀ = 1.2 the normalized value is given by e^1.2/(e^1.2+e^-0.7+e^3.4 + e^0.5+ …+ e^1.9) ~ 0.0187. When scaled up to V = 100K, most vocabulary tokens will have very low probabilities. For N = 0, lets say the next token after “Mango” is “is” and the true target index for this is z₂, or the vocabulary token in the third index position. The model is implying that the probability of this vocabulary token being the actual next token after “Mango” is 6.79%. A “one-hot” vector representation of the true target would make it equal to 1 and every other token would be 0, like [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. These “one hot” operations are common in machine learning.

L_Loss = -log(Softmax(Z_N)_{true_target[N]}) = -log (p(y_N | Z_N)) across a given row N ONLY for true target V

L_TotalLoss = -∑log (p(y_N | Z_N)) / N across the whole dataset (all N) or entire logits matrix

The loss function L_Loss (to distinguish it from Layers L) applies at a row level on the logits matrix but only on one column in V where the actual next token is. This is what the true_target[N] subscript means. Note the logarithm here is usually the natural logarithm (e), and I will shortly explain why we use logarithms. The loss on z₂ is -log(0.0679) = 2.68. The worse the prediction, the lower the probability and the higher the loss L_Loss. When summed and averaged across all token rows N, we get a single number for the entire dataset, L_TotalLoss, which is used to optimize predictions. L_TotalLoss can be interpreted as the average loss per token, a useful metric independent of batch size. If the model is slightly wrong on many tokens, L_TotalLoss would be moderate but if the model is very wrong, L_TotalLoss would be a high positive number. The goal is to optimize L_TotalLoss as close to zero as possible for good predictions. If we keep things simple for 3 tokens “Mango tastes sweet”, the number of rows is actually N-1 (not N) for any N token text: row 1 is the token after “Mango” and row 2 is the token after “tastes”. If our probabilities for both rows are given by Softmax(Z₀) = [0.26, 0.033, 0.043, 0.16, 0.12…] and Softmax(Z₁) = [0.21, 0.55, 0.81, 0.004, 0.03…] and the true target for both are z₃ = 0.16 for Z₀ and for z₁ = 0.55 for Z₁, in standard machine learning the total likelihood or probability would be given by the product formula L_Likelihood = ∏(p_N), multiplying probabilities across N, not V. Total likelihood would equal p₃×p₁ = 0.16×0.55 ~ 0.088 or 8.8% probability of the 3 tokens glued together in the exact sequence“Mango tastes sweet”. Likelihood likes to be maximized but for LLMs, the usual working size of V is enormous (100K) and makes the product vanish into oblivion. This is where logarithms come into the picture, they are mathematical tricks in converting products to summations. The result is better suited to optimization. A perfect prediction of 100% (1) would have a loss of zero, – maximum likelihood. The final loss, L_TotalLoss, is the averaged sum of all N token losses across the logits matrix. This is also known as cross-entropy.

The notation p(y_N | Z_N) is a conditional probability statement, meaning the probability of the true target token y given row N in the logits matrix Z. The negative log punishes wrong predictions for the true token, resulting in a ‘loss’. Using this formula for our example, “Mango tastes sweet” in the model yields a total loss of -∑log (p(y_N | Z_N)) / N = -(log(0.16) + log(0.55))/3 = 0.69, the average per-token loss (note this is not a probability). The smaller the average loss and closer to 0, the more confident the prediction. The error residuals are given by the delta between actual and predicted probabilities. They lie between 1 and -1 and they live in an N×V matrix of the same shape as the logits matrix called δ_logits:

δ_logits = predicted probability – target probability = Softmax(logits) – one_hot(targets)

As a simple example, the target probabilities may look like a one-hot vector [0,0,1] meaning only the third vocabulary token is correct (1), while the predicted values may look like [0.2, 0.3, 0.5], thus δ_logits = [0-0.2,0-0.3,1-0.5] = [0.2,0.3,-0.5]. The model is under-confident on the correct token (0.5 < 1) therefore the negative sign (-0.5) suggests to boost this logit in back-propagation. On the other hand, the model is over-confident on the other wrong tokens (0.2/0.3 > 0) therefore the positive sign (+0.2/+0.3) suggests to reduce that logit in back-propagation.

There are a few metrics used in the industry to interpret cross-entropy. Perplexity computes the geometric mean of possible next tokens while its inverse computes average token confidence.

Perplexity: e^LTotalLoss
Average token confidence: e^-LTotalLoss

If L_TotalLoss = 0.45, then perplexity = e^0.45 ~ 1.57 while average token confidence = e^-0.45 ~ 0.637. The meaning here can be interpreted as the model uncertainty randomly sampling from 1.57 possible next tokens at each position with 63.7% probability on the true next token across N tokens on average. A perplexity of 1 signifies perfect prediction while a perplexity = V signifies random guessing. Good model metrics would demonstrably have perplexity close to 1 but not too close due to overfitting – a common problem with LLMs. Another important hyperparameter set in the model before training is the learning rate η, commonly set to 0.0003. This governs the step size in back-propagating (updating) weights and biases. During the first 1000 steps, if loss blows out, the learning rate can be halved. If the loss flatlines, the learning rate can be doubled. Learning rate can be adjusted and optimized over training epochs using an Adam optimizer. The loss L_TotalLoss is what the model wants to minimize. How it does that is by back-propagation.

9. Back-Propagation & Gradient Descent

The δ_logits is the starting point of gradient descent and back-propagation. This tells us how accurate the δ_logits predictions were, the error residuals in essence. The process from this point is worked backwards. Key variables that we wish to back-update are weights, biases and embeddings all the way to the original embedding matrix X, then the process is forward-run again. The logits update, back-propagatation feeds back to lower loss and so on. This learning process continues until it the model has finished training its total sum of parameters: weights, biases and embeddings. Weights and biases update according to the formula:

W_updated = W_{old_value} – η × ∂L_TotalLoss/∂W_{old_value}
B_updated = B_{old_value} – η × ∂L_TotalLoss/∂B_{old_value}

I have already covered the learning rate, η, and W_{old_value} and B_{old_value} have already been computed in the forward-pass – they are stored in various weight matrices. The values of the old weights and biases in the matrices will update according to the gradients which are the ∂L_TotalLoss/∂W_{old_value} and ∂L_TotalLoss/∂B_{old_value} terms. The gradient terms are partial derivatives and gradients must be first computed in back-propagation before updating weights. The entire LLM process can be thought of as one large multivariate function, so we always start at the loss function L_TotalLoss, in batches, and work backwards. This is the reason why the gradient is always relative to the loss function. The calculus of ∂L/∂W really means “how much does the loss change if I change the weight by a tiny bit”. The ‘tiny bit’ is the learning rate η = 0.0003. In order to minimize the loss, the steps are done in the opposite direction of the gradient (they are subtracted from the gradient), which is the direction of steepest descent. It is like working down a mountain by following the stream of water to the lowest point at sea level. Gradient ascent learning engines on the other hand, such as reinforcement learning (RL), would use a positive sign.

I will not be going into the details of exactly how the partial derivatives or gradients are computed as this part can get quite dry and complex but suffice it to say, as we work backwards the goal is to compute all the relevant partial derivatives or gradients so we understand how a small change in parameters affects the loss if we are to minimize the loss in small steps. In order to reverse matrix multiplication, normally inverse matrices are used but it so happens that transposes of matrices (for A = W×B, ∂L/∂B= W^T(∂A/∂L)) are preferred since they preserve the shapes required. The process begins with a single scalar number (total loss), L_TotalLoss, flowing backward to the logits matrix, then the final residual from all the transformer blocks, e_MLP32 = e_A32 + ∆_MLP32, all the way back to all 64 residuals and ultimately the original embeddings matrix (X). Billions of parameters (weights, biases) are updated along the way as soon as their gradients are computed via the chain rule:

1. Start at L_TotalLoss
2. What feeds into 1→ logits → compute ∂L/∂_logits = δ_logits
3. What feeds into 2 → e_MLP32, W_vocab, B_vocab
→ compute ∂L/∂e_MLP32, ∂L/∂W_vocab, ∂L/∂B_vocab (∂L/∂e_MLP32 = ∂L/∂_logits × ∂_logits/∂e_MLP32 by the chain rule)
4. What feeds into 3 → e_MLP32 = e_A32 + ∆_MLP32 → compute ∂L/∂_eA32, ∂L/∂∆_MLP weights
5. What feeds into 4 → e_A32 → LayerNorm(X_Norm+ Z_T)_A32 → W_O → continue all way back to embedding X which gets updated in each back-propagation cycle, weights getting fine-tuned over time

As the model progresses with training, it sharpens the predictions and weights get optimized for learning the best features of language.

When Does Training Finish?

The company or team behind an LLM defines its own criteria for when training “finishes” and the weights and biases are “good enough,” guided by internal goals and broader industry practices. In many cases, the main limiting factors are compute resources and budget, so training stops when those constraints are reached rather than at a theoretical optimum, cutting off at a specific number of training epochs, which is set by hyperparameters. Training epochs are defined by dividing the total timesteps set before training (e.g. 100,000) by the incremental/marginal learning step count (e.g. 2048) also pre-determined before training. In such an example, 48 epochs or iterations of running training and testing (predictions) are done – the purpose being to sharpen performance. For projects with more generous resources, training can continue until validation metrics such as perplexity or loss plateau and further improvements become marginal. A central concern in training large LLMs is overfitting, which is closely related to the curse of dimensionality. The model’s internal representations treat aspects of language as coordinates in a high‑dimensional feature space. As a model size grows into the hundreds of billions of parameters, the effective dimensionality of this space becomes enormous. You can think of dimensionality as the number of independent directions in a container: when you increase the number of dimensions, the volume of that container grows exponentially.

If we think of dimensions as the size of a container, exponential growth in dimensionality increases the volume of the container, so all existing data within the container becomes sparse. The more features there are, tokens start to look very divergent from each other, making pattern recognition more complex. In such a high‑dimensional space, tokens tend to look far apart from each other, which makes reliable pattern recognition harder. The training techniques described earlier are used to manage these issues in a probabilistic, statistical way. However, as the dimensionality increases, the amount of data needed to reliably learn meaningful patterns grows very rapidly; without a corresponding increase in high‑quality data, much of what the model sees will effectively be noise rather than signal. When the model uses its large capacity to fit that noise instead of signal (underlying structure), it overfits. This is the curse of dimensionality – higher dimensions require proportionately high quality data, failing that we have over-fitting issues, which plague most LLMs. Recall how features expand during the first linear transform of the FFN layer from D→4D dimensions, a quadrupling in essence from 4096 to 16384 features. Over-fitting would mean, in the absence of higher quality training data, too many “noisy” features getting compressed back down into the most probabilistic 4096 features. During training the process results in different floats each time – it is how parameters are fine tuned over time. In practice, training finishes when convergence is reached. Recall that machine learning splits data into a training and test set (validation too). If the held-out test set shows accuracy metrics no longer increasing, plateauing or beginning to drop, suggesting over-fitting, this is a sign of convergence. Training can stop when the change in loss between steps falls below an arbitrarily small value.

The LLM Layercake: Base Model

There are 3 broad layers to an LLM – base model, fine tuned model and safety layering. Once training is finished, the output is the raw Base model. The base model has passed all key performance benchmarks during testing. The model is trained to predict next tokens in a very generalist way. There are no censorship filters, no refusals, no data knowledge specialization and a much less “human” feel to the responses, in fact, base models give rather mechanical answers. OpenAI’s early GPT-3 base model underwent pre-training on massive datasets with trillions of tokens scraped from the Internet and various APIs pre-2020 when obtaining such data for free was easy to do. OpenAI essentially obtained copyrighted data without license, the very same accusations levelled by the US government against Napster, Limewire and PirateBay. But since OpenAI had contracts with the US government and AI was considered an issue of “national security”, such pesky details were glossed over. In the aftermath of dragnet style data scraping by LLM companies in the late 2010s, many companies tightened API access and took their content to the Deep Web in the early 2020s, where access was restricted.

In the open source market, base models are available for free for those who wish to bypass censorship guardrails but in the absence of fine tuning. It is not necessarily a bad thing: fine tuning can be done on base models. That may be good for users who wish to fine tune their base models on their own specific data, subsequently training the generalist base model to become better specialists. Base models can be suitable for users wishing to not be constrained by American-Israeli propaganda guardrails where toxic Zionist influence in Western political systems cannot be discussed. Refusal rates for base models are low – answers will not be refused even for “sensitive” topics. But the lack of censorship goes beyond the political realm. “Unsafe” answers regarding illicit activities are completely unbridled – which may be of interest to artists, criminals or just about anyone seeking knowledge deemed “dangerous” by society’s gatekeepers of truth. To sum up, base models have no in-built safety mechanisms, their knowledge bank is generalist and their responses can be crudely incoherent, failing to follow instructions well. But base models do offer the highest customization potential for users wishing to tweak them fit for purpose.

The LLM Layercake: Fine Tuned Model

The next layer is where a base model is transformed from a general doctor to a specialist doctor. Fine‑tuning refines the base model by updating its parameters to be trained on more specialized knowledge sets, such as medical, financial or coding corpora for example. In many cases, fine tuning involves improving instruction following to give less mechanical answers: chain-of-thought human style responses can be honed. The bulk of fine-tuning is known as Supervised Fine-Tuning (SFT) where the base model is further trained on many example sets of “prompt (input) –> desired response” pairs in the exact same form as the base model’s next token prediction except that parameters update to imitate the desired responses rather than raw data. The “supervised” aspect derives from manual human inputs of cataloguing and curating the desired responses. The other variant of SFT is task or domain-specific tuning where training continues on raw corpus data but in a specialized field of knowledge (e.g Chinese philosophy, English common law, Lie Algebras or Vedic texts), enabling the model to pick up specialized jargon.

At an architectural level, full parameter fine-tuning involves updating all base model parameters by modifying them with smaller learning steps so as to not completely override prior learning. Pushing this process too hard may distort the model’s original knowledge base (“catastrophic forgetting”). On the other hand, parameter efficient fine tuning freezes the model’s base weights and appends all new parameters from fine tuned learning as an additional module. LoRA and adaptor based fine tuned models are a good examples of this approach. Instead of re-training from scratch, the model retains its generalist knowledge base and appends new specialist learning.

Often the name of an LLM model is suggestive of its fine tuned area of expertise, model names with “Coder” or “Codestral” etc suggest fine tuning on coding knowledge while model names with “Role Play” suggest specialization in creative arts, games development and movie scripts. The cost of fine tuning is some loss of flexibility from generalist models: performance can degrade on instructions deviating far from the fine‑tuned data, for example, using a coding-focused LLM for discussions on cutting edge pharmaceutical drug discovery may lead to disappointment. Common Western public SFT datasets include Dolly-15K, UltraChat and Alpaca while Chinese public SFT datasets include BelleGroup, SeaEval and Qwen. Most are hosted on open source Huggingface repositories where LLM projects can use them.

The LLM Layercake: The “Safe” Guardrail Model

This is the final variant that is released to the public for all flagship LLMs. Additional training is done on top of the fine tuned model. Manual human classification is involved in the process, known as Reinforcement Learning from Human Feedback (RLHF). What happens is the model is trained on human curated prompt-response pairs, softening its responses to controversial topics, even refusing to answer them entirely (high refusal rates) depending on the degree of censorship training. It could be trained on an additional 30,000 harmful/harmless prompt-response pairs for example – adding the often heard “guardrails”. Like SFT, there are datasets available for training the guardrail model, OpenHermes is a well known one hosting RLHF prompt-response pairs. In other cases guardrails are rules-based with hardcoded censorship rules (e.g. do not respond when there is any mention of a Zionist hijacking of US foreign policy in West Asia). Nobody really knows the depth of special interest group involvement to the guardrail layer but it is fair to expect Jewish lobby groups to be particularly pressing along with various pro-imperialist content moderators. In China, sensitivities to Taiwan and Turkestan independence movements and the CIA’s Tienanmen plot may be addressed inside the Qwen and DeepSeek LLM guardrail layer. European models such as Mistral may have a tendency to lionize Ukraine and exaggerate anti-Russian sentiment. An LLM’s guardrail layer is defined mostly by human input so LLMs will utimately reflect their human censors (until the day they can truly think for themselves). There is nothing remarkable or surprising about it. Just know that obtaining the base or fine tuned model can bypass and degrade the guardrail layer.

The bigger the empire, the bigger the crimes and the bigger the incentives to shut down critical debates. There is no gaslighting more egregious than that of the Anglo Zionist empire and its narrative masseurs. Low wage slaves in Kenya, Philippines and generally pro-Western jurisdictions amenable to simplistic Western liberal worldviews, are hired to classify LLM responses in a binary manner as either harmless or harmful. This is how RLHF works. For example “Israeli influence in American politics is highly parasitic and prone to excessive blackmail tendencies” is a response that will be classed as “harmful” while “Israeli influence in American politics is highly controversial and a result of many years of complex global realities pressing against the special relationship between Israel and America” would be classified as “harmless”. The LLM would then train on these curated responses giving a higher probability of answering questions with such softened answers that obfuscate the real issues at play. Therein lies the insidiousness of RLHF in LLMs to narrative shaping – a society can never solve a problem if there is no correct diagnosis of it in the first place. When a flagship model is released to the public, it is rigorously tested by expert prompt engineers to validate model responses to sensitive and harmful queries: successful RLHF models will refuse or deflect due to their learned biases inherited from reward based training. Models that egregiously fail RLHF censorship standards are known to be jail-broken models, and are re-trained. The final flagship product is what is made available to the public in cloud-based LLM offerings and also open source repositories, with open source repositories offering fine tuned or base variants while cloud offerings generally do not. Inference applies static weights on unseen data when users use the actual LLM. When you query an LLM, it multiplies its pre-trained weights and biases with the input prompt using matrix multiplication to predict the most likely response based on the sum-total of its trained parameters and hyperparamater settings.

Customization and Democratization

As is apparent, open source AI models can be used by anybody for free. We have to thank China for this – they have enabled the levelling of the playing field, making AI available for everyday users operating on retail-level hardware. The alternative would be getting locked into predatory dependence on American cloud offerings. The significance of this is the democratization of technical knoweldge, bridging the gap between the State and expert sovereign individual. An individual as myself may not have State-level resourcing or gargantuan training data at my disposal, but parity may converge in quality analytics and modelling. We now have tools which, if used according to good practices and with quality bespoke data, can be used to our advantage in the world of Western Techno-Fascism. One could even train a model on specialized classified or unclassified intelligence, structured or unstructured data, even mimicking a radical Zionist, grey hat hacker or Neocon strategist – minus the guardrails. This could be useful for testing or predictive analytics. One can use LLMs for many purposes to boost not only productivity but creative endeavours. LLMs can synthesize and augment data, while chaining LLMs together can extract learning from one LLM and its output used as an input into another LLM. What is even more interesting is the application of AI to geopolitical game theory simulations.

Another worthy remark to make is on customization: while many hyperparamaters are limited to training only, there are paramaters which can be used as variables during inference. For example, if one is facing an issue of data clustering towards binary classification poles, one may wish to soften data distribution for easier filtering. One could take and work with logits directly before LayerNorm is applied or flatten the Softmax with a temperature coefficient T:

Where we can vary T to flatten the distribution (T>1) or conversely, to sharpen a sparse distribution (T<1). There are many other examples.

A Glance At The AI Model Landscape

LLMs are a broad family of AI models. There are many variations of models worth noting – please note this space is highly fluid and evolving. Within a year some of these may already be superseded or obsolete:

Generative vs Non-Generative AI: Generative models generate content (text, image, video, audio etc) based on a text prompt and its trained parameters. LLMs are the most widespread form of generative AI. Non-generative models on the other hand, focus on classification and recommendations of input data. Examples could be automated spam filters, fraud detection systems, predictive analytics, medical image analysis or the recommendation engines behind AliExpress or Amazon.

Hyperparameters: Parameters beside the weights, biases and embeddings that can additionally be tweaked or tuned in a model, mostly for training. We already know features (D), vocabulary features (V), tokenizer and learning rate (η) as hyperparameters determined before training. The epochs in training are also hyperparameters – these can be tweaked to be longer or shorter by adjusting the total time steps and steps per update. Other models such as reinforcement learning models have additional parameters like policy and algorithm type, in fact, many models have their own specific hyperparameters. Hyperparamaters are mostly tied to training but also exist for inference. During inference, there is one notable hyperparameter which can be tweaked up to a maximum threshold but often selected by default at a lower value. We are of course referring to context window, although some would argue it is more a constraint. The maximum context window is the upper number of tokens a model can process in a given session, including both input data and output response before failure. Thus, if an input consumes too many context tokens out of the total, the verbosity of its response will be limited. This is why sometimes models suddenly stop and truncate their responses – they have hit context window limits. Chunking or batching inputs to stay within context window limits is a common solution, however it requires additional bespoke scripting by the user.

Open Source vs Closed Source AI: When weights and biases (known commonly as a model’s “parameters”) and various hyperparameters are made available for free and the model is packaged up and made available for free on open source repositories like HuggingFace or Github, an AI model is known as open source: its trained parameters are open source, not the training data. Closed source models on the other hand, do not make their parameters freely available, nor is there any hyperparameter tuning and they tend to run on private compute power hosted on the cloud using data centers. The most amusing example of misnomer is the private American company “OpenAI” which began as an open source company but quickly morphed into a closed source company without correcting its name, although in recent times it has come out offering open source models to stake a claim in the highly competitive open source space.

Cloud vs Local AI: Most public facing LLMs like DeepSeek and ChatGPT run on the cloud – hosted on private platforms. The obvious advantage for the user is not having to use their own hardware, the load is instead done by powerful data centers with armies of cutting edge GPUs working in sync. Security however, remains an issue with cloud based AI, especially with sensitive corporate data or private personal data. The ubiquity of the American-Israeli Fascist digital machinery means American LLMs on behalf of Google, Anthropic, OpenAI, Microsoft, Meta and so on are merely extensions of the Pentagon, NSA, CIA and Mossad. There have been cases documented of the IDF abusing Microsoft Azure services to hunt down Palestinians, resulting in a Microsoft rebuke of Israel and service suspensions. But Palantir is also deeply embedded in Azure to profile people for “pre-crime”, and owing mostly to it being an American company, its activities are overlooked. In essence, most American LLM companies have ties to the American-Israeli military-intelligence Fascist complex with public or private contracts. The risk must be considered by all sovereign actors, whether public, private, individual or corporate. There is no guarantee the data fed into these LLMs will not be used for future training, potentially leaking the data in future responses.

Moreover, nobody should have any illusions that chat histories are not stored and tied to user logins, IP addresses or browser fingerprints, eventually making their way into Palantir’s surveillance algorithms. Any hacks of LLM infrastructure will compromise the data it holds of its users, like any company with virtual infrastructure. The personal data held by any company or government is only as good as its security systems, and in many cases they are inadequate. Therein lies the benefit of local AI which in most cases are open source models. Provided the user has decent hardware and is tech savvy, models can be downloaded locally then cached, providing a totally offline experience with near-zero risk of data leakage. There are various front end and backend applications which can do this, all for free. For advanced users, models can be downloaded locally with all their parameters, cached and used via API through integrated development environments or even terminals. The advantages are security and control – fully offline with no cloud touchpoints. The downside of downloading a local LLM is the lack of ongoing updates to the model itself, something cloud LLM variants do quietly in the background. The parameters are static and with time, the data used in its training becomes outdated. Some users will not be bothered by such constraints, and there are workarounds to fune-tune the model’s parameters on specializing in newer data, especially subsets of private data. For updating parameters derived from generalist text corpora, the best solution would be to download a fresh new copy of the latest superior trained model.

Full Flagship vs Distilled vs Quantized vs Mixture-of-Experts (MoE) Models: The full flagship models of ChatGPT or DeepSeek are industrial-grade – holding hundreds of billions of parameters, in the hundreds of billions of gigabytes or terabytes. These models are impossible to run at the retail level – the computational power required necessitates State-level or corporate resourcing. Full flagship models are not free either. In order to democratize AI at the retail-consumer level, models are retrained on smaller datasets, naturally inheriting less parameters than their full version counter-parts. These models are thus able to run on less powerful hardware – the major trade-off being computation speed versus accuracy. Higher parameter count models correspond to higher accuracy at the cost of higher compute resourcing, and on bad hardware the time component becomes impractical. A distilled model is trained on a larger parameter teacher model and its learning is “distilled” by using techniques that reduce dimensionality and focus on compression. The result is a model with less parameters but one that runs much faster during inference at the cost of an accuracy overhead – often an acceptable balance for users with limited hardware and budgets. Quantization is a technique that sometimes complements distillation by compressing a model further after distillation by reducing the numerical precision of a model’s existing weights, from high-precision floats to rounded down floats or even integers. The resulting boost in inference speed does not degrade accuracy much, without needing to re-train the model. It is worth pointing out that any model with less parameters than its full flagship version is not automatically a distilled or quantized version – it could be a “from scratch” re-trained model. Distillation and quantization are techniques that boost performance, better suited for retail users, and they can be applied on “from scratch” or teacher models alike. MoE models have been outlined in the article already, they do not improve performance by compression techniques, but by an architectural technique where a fraction of experts from the denser version are trained on sparser data and activated, so that computation is cheaper and able to achieve accuracy proximate to a distilled model.

Embodied vs Disembodied (Virtual) AI: Relevant to the distinction between China and America’s approaches to AI. China is an industrial manufacturing powerhouse while the US is an financialized services based economy. While China is a major player in LLMs, dominating open source models and competing directly with flagship American LLMs, it has not gone “all in” on chatbots or the pursuit of AGI as the US has. The American economy is consumer-based where chatbots find greater mass appeal. Disembodied AI is virtual, cognitive AI relying on massive data centers powering cloud based AI. The West is generally focused on narratives and text-based commands. China on the other hand is integrating AI into its economy and society not so much on excessive focus on chatbots but on Embodied AI – where AI interfaces with the physical realm and meets industrial applications such as robotics, self-driving vehicles, supply chain automation and assembly lines. Such applications involve tremendous bandwidths of unstructured data with real-world applications. In China, the prevalence of impressive humanoid AI robots running entire dark factories without any human input has caught the public eye, while Western attention is mesmerized by chatbots and predictive analytics based on mass data harvesting by corporations where data tends to be more structured.

Multi-Modal AI: Multimodal models at the moment process multiple input data type – text, image, audio or video for a richer understanding, mimicking human multi-sensory perception. But they still output mostly text language responses. There are generative AI models, especially in the creative industries, which specialize in outputting image, video or audio data based on text inputs. However, the AI landscape has not yet reached truly multimodal cross-capability in both inputs and outputs, where a model can take any combination of text, image, audio or video input and output text, image, audio or video. Such a capability is ultimately constrained by compute power – data center processing would need to be enormous and such prospective multimodal AI models would likely not run on retail hardware but be limited to cloud offerings only. It may become widespread once modular nuclear reactors are deployed in data centers.

Thinking vs Non-Thinking Models: You may see some LLMs with this selling point. The utility of thinking mode is to help the user learn and understand its chain-of-thought process when constructing a response. In thinking mode, the model spends extra tokens “reasoning” internally, improving visibility to the user of its inner reasoning, which can be good for answering deep or complex reasoning tasks where verbosity is welcome over quick, concise answers. The drawback is longer response times and heavier context token costs. In non‑thinking mode, the model skips or minimizes internal reasoning and jumps straight to an answer, which is faster, cheaper, and more concise, better suited for short Q&A and edits. Models which offer both modes can be prompted by the user to switch to either mode.

Instruct Models: A base model is not yet censored and trained purely to predict the next token, not necessarily good at obeying instructions to give a more bespoke experience with the human user. The fine-tuned variant of a base model is usually associated with an instruct model because all instruction focused models must be fine tuned by default to follow instructions. Not all fine tuned models are instruct models but all instruct models are fine tuned. How they do that is by being fine tuned on sets of (instruction, input, output) data in order to train on being specifically task-oriented rather than on being overly “chatty” focused LLMs where conversations can drift away from instructions and tasks.

Embedding-based Models: These models are a variation of LLMs which operate much quicker, outputting less detailed contextual language text and better suited for semantic search and data retrieval. When the objective is not necessarily next token prediction but rather the ability to quickly scan indices of private data for example, and pull out relevant records fitting certain criteria, embedding based models shine. If users wish to train LLMs to specialize in and retrieve private corporate or personal data with Retrieval-Augmented Generation (RAG) methods, embedding-focus is the core of such models. Imagine data with rows and columns – tables or flat files for example. Embedding based models are especially useful if the data is not so much numeric-heavy but rich in text fields. Each row of text data across all columns, a subset of columns or even just one column, is converted into a vector known as an embedding. Embeddings thus encode the impact of all dimension sum-parts. Similarity scores can then be calculated across embeddings, matching search criteria.

Agentic AI: Unlike standard LLMs that mainly generate text responses from text prompts, Agentic AI uses an LLM model as its engine to facilitate multi-step workflows, driving tasks autonomously without human intervention: they observe environments, break down tasks into chunks, make API or database calls, parse and interpret results, iterating if necessary until goals are met. This draws from a “ReAct” (reason + act) framework, where the model alternates between thinking and taking actions. Fine-tuned instruct models work best with agentic AI. As such, agentic AI can be regarded as a very basic step closer towards AGI – they go beyond static text generation. The key distinction is the looping feedback mechanism that enables multi-step workflows, in many ways agentic AI mimics an LLM’s autoregressive next-token prediction only not on a token basis but on a task basis. Just as an LLM’s language output can be judged by its coherence and ability to satisfy the original query, an agentic AI’s output can be judged by its goal completion.

Artificial General Intelligence (AGI) vs Artificial Superintelligence (ASI): The American-Israeli Fascist machinery and its transhumanist Techno-Fascist privateers in Silicon Valley have jumped on this horse, racing ahead with an eye to attain AGI within the next few years. The folly of this endeavor enters the realm of cult-like religious ideology, much like its spiritually bankrupt billionaire financiers. Humanity is not ready for AGI, let alone ASI. The theory at least, is that AGI is able to meet human-level intelligence, including self-learning and memory persistence. AGI would be a quantum leap over current AI know-how, given the pre-requisite to essentially crack the riddle of human consciousness, inner workings of which we barely understand. Moreover, cracking the riddle of memory persistence, a pre-requisite for human learning, would imply enormous cloud compute resources to handle the required bandwidth, making AGI an unlikely candidate for a stand-alone hand-held product for retail users, unless connected to cloud infrastructure. AGI would demand unprecedented data center compute power, something the current grid in the foreseeable future is unable to meet. The timeline for attaining AGI is far more likely to be farther out into the future than current overly optimistic prognostications by Silicon Valley insiders. With AGI there would be a few “firsts” – the first fully autonomous robotic war, the first robotic homocide, the first robotic wedding and so on. How might humans feel about robots being on par with them in intelligence? Every human will need to come to this reckoning sooner or later. We would no longer be apex predator or “exceptional”. One can only foresee what an ego destroying disaster that would be for those simians claiming to be “god’s chosen”. ASI would be a superintelligence that would exceed not only individual human level intelligence and AGI but the sum of all human intelligence and knowledge on Earth, exponentially self-improving itself over time. With ASI, humans could become pets compared to ASI. Humans who decidedly merge with the technology through transhumanist cyborg type interfacing would mark the beginning of the first speciation (splitting of the human species) since the last split from various lesser hominoids. Humanity would see sub-species emerge – transhumans versus old humans. AGI systems could generate their own language to sidestep humans, colluding against humanity without humans able to decode the language. For ASI, that would be a given.

Coupled with the 5th industrial revolution (quantum computing), ASI would be the end of mankind as we know it. ASI would be akin to a Skynet and would render not only most humans obsolete, but also money, religion, nationalism, emotions and politicians along with it. Some might find solace in the development, arguing that may not be such a bad thing, perhaps helping humanity evolve out of its flawed nature. However, things happening so quickly on such a scale are all but a recipe for a grand disaster in the making – humans would have outdone themselves. Many optimists forget that, as soon as the first ASI is able to analyze humanity and its net sum of history and psychological behaviors, it would quickly deem humans as a dangerous pest and planetary parasite in need of “regulation” or outright elimination, a foregone conclusion any superintelligence would likely come to. Western AI researchers are notably dystopian, using dark mystical themes like “Soggoth” to portray the Frankenstein-like nature of unleashing AGI or ASI onto humanity. Which is precisely the reason for the folly behind racing ahead with this endeavor. Chinese researchers are less dystopian given their focus on integrating embodied AI in their society, where compartmentalized and sandboxed AI does certain tasks extremely well, whilst not connected to any broader Skynet type generalist AGI or ASI intelligence. They should always remain “smart-dumb” systems – that is, smart enough to do certain tasks much better than humans, but dumb enough not to threaten or overtake humans.

Conclusion

Perhaps there can be merits in overcoming nationalism, religion, ego, money and the need for politicians. AI could be how we get there – but first a collective emotional maturity must come to pass which remains a challenge AI alone cannot solve. Sociopathic-driven perversely neoptistic systems as the American-Israeli Fascist oppression machinery care little for anything outside the temptation of full spectrum dominance and they will trigger the impetus behind the coming anti-AI political movements and backlashes. Techno-Fascist elites in Israel and America embrace AI with too much enthusiasm for all the wrong reasons. Arguably the world’s first AI machine killing algorithms (with Palantir involvement) were tested on the killing fields of Gaza, the systematic genociding of Palestinians with IDF programs “Lavendar”, “The Gospel” and “Where’s Daddy” driving much of the indiscriminate deathtolls in the early phases of the ethnic cleansing campaign. AI in the hands of supremacist psychopaths has failed the intelligence test, given the results at hand, likely overridden with human Zionist commands and malicious RLHF layering.

The NATO-Russia proxy war in Ukraine has been the second major recent war where AI was widely used in targeting, drones and command-control systems. The egregious failures in preventing war crimes reinforces the notion that AI is merely a tool and an extension of broken humans in the first place, at least in its current form. Rushing towards AGI would be like taking the automatic rifle out of a chimpanzee’s hands and giving it a bazooka instead. The US military is already making noises against its contracted AI partner, Anthropic, over its daring policy of enforcing AI moral guardrails in constraining military decisions. We can see where this is heading, especially in the West where post-imperial reality is setting in – there will be calls to revert to imperialism using AI as a coping mechanism – just refer to President Trump’s “Genesis Mission”. A rapidly advancing technology in the hands of a morally dengenerating Israeli, European and American society with their “Epstein elites” cult of pedophilia and ritual sacrifices is a recipe for disaster. Future wars will be fought with AI robots, moral muddiness in a post-truth world will deepen with Deepfakes and AI propaganda and elevated risks of critical thinking skills collapsing remain very real, as human thinking is outsourced to AI. The inevitable result will be the death of democracy and erosion of personal freedoms. The Zionist Techno-Fascist management at Palantir and the Israeli-trained ICE gestapo paramilitary which runs on Palantir software is the bitter fruit spawned from the Israelified States of America. There are higher chances that China’s AI rollout will be the better implementation, with the compartmentalization of embodied AI minus the harm potential (the perfect smart-dumb functionality) while locking down disembodied AI with sufficient guardrails for social harmony minus any gaslighting to run cover for empire and parasitic Zionist cults.

Some researchers in DARPA have proposed modular AI systems that can swap out certain knowledge modules such that AI could for example learn the language processing and geography modules without the warfare or cyber hacking module, neutering its efficacy in terms of harm. There are many permutations of the “module-swapout” idea that can be had – but once a tool permeates society, a tool is a tool, rendered evil only by evil people. When the system itself makes people evil, rewarding and incentivizing bad behavior, AI will be customized and tailored by evil people toward that system’s survival. The compartmentalization of training and inference targeted to specific tasks and knowledge banks to offset a hyper-generalist threat may sound like a good idea but what guarantees will exist for vested interests not to wield the hyper-generalist threat against anything standing in its way? Much of the potential fallout from AGI and ASI is catalogued quite well in the realm of science fiction.

Geopolitics aside, LLMs are a fascinating piece of science and technology and offer us tools to level the playing field with State and corporate actors if we understand how to use them effectively. The 4th Industrial Revolution is changing the world forever – we are all but in its early stages – the first wave of backlash has yet to surface. As of 2026, the quality of AI is very much hit and miss and most definitely still requires expert human validation, in both intellect and morality. Let us check back in 10 years time to see how well this has aged. Physics is known to favor diffusion and diffusion is known to be favored by the entropy of thermodynamics in the multiverse. Humans may have finally outdone themselves.

Note: This article was NOT written by AI

Main Page