Technical Explorer

Transformer Architecture

An interactive deep-dive into the inner workings of Transformer models. Visualize self-attention, embeddings, and mathematical flow in real-time. Recommended for Desktop View.

Live GPT-2 Simulation
Isolated High-Fidelity Mode
01

Representation

See how Tokenization and Position Encodings transform discrete words into high-dimensional vector space.

02

Self-Attention

Inspect the Query, Key, and Value matrices to understand how the model captures linguistic relationships.

03

Refinement

Follow the data through MLP Layers and Residual Paths that build abstract understanding across blocks.

What Exactly is a Transformer?

Introduced in the seminal 2017 paper "Attention is All You Need", the Transformer architecture initiated a fundamental paradigm shift in Artificial Intelligence. It now powers text-generative giants like OpenAI's GPT, Meta's Llama, and Google's Gemini. Its versatility extends beyond text; today, Transformers are used for audio synthesis, image recognition, and even predicting complex protein structures.

At their core, text-based Transformer models operate on next-token prediction. Given an input sequence (your prompt), they calculate the most probable subsequent token (a word or subword). This predictive power is driven by the self-attention mechanism, enabling the network to analyze entire sequences simultaneously and master long-range linguistic dependencies.

This visualizer is powered by a live GPT-2 (small) model operating entirely in your browser with 124 million parameters. While much smaller than modern frontier models, it shares the exact same architectural DNA—making it the perfect educational baseline.

Anatomy of the Architecture

Every generative Transformer relies on three distinct processing phases:

1. The Embedding Layer

Before the model can interpret text, it must be mapped into numerical space. This happens in four steps:

  • Tokenization: Breaking the raw input text down into manageable sub-word tokens. (GPT-2 utilizes a predetermined vocabulary of 50,257 unique tokens).
  • Token Embedding: Mapping each token ID to a dense vector (768 dimensions for GPT-2). This high-dimensional space ensures conceptually similar words reside physically close to one another.
  • Positional Encoding: Because Transformers process tokens simultaneously (not sequentially like RNNs), they inject positional signals into the embeddings so the model grasps order and grammar.
  • Final Embedding: The token vector and positional vector are combined to produce the final input representation.

2. The Transformer Block

The core computational engine. Models stack these blocks (GPT-2 small has 12). Within each block, representations undergo two major operations:

Multi-Head Self-Attention

This mechanism allows every token to evaluate its relationship against every other token. To do this, the network derives three vectors per token:

  • Query (Q): What the token is currently "looking for."
  • Key (K): What other tokens "offer" to match against queries.
  • Value (V): The actual substance or contextual payload of the token.

By computing the dot product of Queries and Keys, the model produces an attention score. A mask is applied to hide future tokens (ensuring the model can't cheat by looking ahead), and the results are normalized using Softmax. Splitting this process across multiple "heads" allows the model to analyze different syntactic and semantic properties in parallel.

Multi-Layer Perceptron (MLP)

After attention routers contextualize the sequence, the MLP layer refines each token’s representation individually. It applies a massive mathematical expansion (projecting the 768 dimensions to 3072) to capture complex non-linear patterns, before compressing it back to original dimensions.

3. Output Probabilities

Once the sequence survives the final block, a linear layer projects the representations back across the entire vocabulary space. The model outputs a raw score (logit) for all 50,257 potential next words. These are normalized via a Softmax function into a precise 0-100% probability distribution.

During generation, you can manipulate this outcome using inference parameters:

  • Temperature: Scaling the logits before Softmax. Values < 1 sharpen the distribution (making the model deterministic and literal), while values > 1 flatten it (increasing randomness and creativity).
  • Top-K: Hard-caps the candidate pool to only the K most likely tokens.
  • Top-P (Nucleus): Dynamically selects the smallest pool of tokens whose cumulative probability exceeds the threshold P.

Auxiliary Stabilizers

For training stability and speed, the architecture utilizes three critical support structures:

  • Layer Normalization: Normalizes inputs to maintain consistent variance, drastically improving convergence speeds.
  • Dropout: A regularization technique that randomly severs network connections during training to prevent rote memorization (overfitting). It is disabled during live inference.
  • Residual Connections: Bypassing shortcuts that wrap around Attention and MLP layers. They guarantee that the original input signal (and error gradients during training) survives the deep passage through the network without degrading.
Source Experiment by: Aeree Cho, Grace C. Kim, Alexander Karpekov, Alec Helbling, Jay Wang, Seongmin Lee, Benjamin Hoover, and Polo Chau (Polo Club of Data Science, Georgia Tech)GitHub