Back to Notes

LLM Fundamentals — Transformer Stack to AI Systems

LLM Fundamentals — Transformer Stack to AI Systems

Source: [[raw/20 Most Important AI Concepts Explained in Just 20 Minute]]

Full stack: from raw text → tokens → vectors → transformer → LLM → training → inference → systems.


1. Foundation

Neural Networks

Layers of connected neurons. Input layer → hidden layers → output layer.

  • Each layer refines the representation: pixels → edges → shapes → meaning
  • Weights = importance scores on each connection. Training = adjusting weights until outputs are accurate
  • Modern LLMs: billions of weights

Transfer Learning

Don't train from scratch. Take a model trained on broad data → fine-tune on specific task.

  • Foundation model learns general patterns once (expensive: data + compute + $$$)
  • Developers adapt it for specific tasks (cheap: small focused dataset)
  • This is how most modern AI works. You're building on top, not from zero.

2. Transformer Stack

Tokenization

Text → tokens (model's "alphabet"). Tokens ≠ words.

"playing" → ["play", "ing"]   # subword split
"dog"     → ["dog"]           # common word stays whole

Why not full words? Language is messy — new words, spelling mistakes, mixed languages. Fixed token vocabulary = model can handle any text via familiar fragments.

Practical: ~4 chars per token average for English. "Hello world" ≈ 2-3 tokens. Code is token-dense.

Embeddings

Token → vector (list of numbers representing meaning).

  • Similar words = nearby in high-dimensional space
  • doctornurse are close; doctormountain are far
  • Relationships are geometry: king - man + woman ≈ queen

Model doesn't understand language — it understands distance and direction in vector space.

Attention

Context-dependent meaning. The word "Apple" means different things in "I ate an Apple" vs "I invested in Apple."

  • Attention lets each token look at ALL other tokens and weigh which ones matter most
  • "She bought shares in Apple" → "Apple" attends to "shares" and "bought" → company, not fruit
  • Self-attention: tokens attend to each other within the same sequence
  • Before attention: models read left-to-right sequentially, missed long-range relationships

"Attention is All You Need" (2017) — the paper that changed everything.

Transformer

Stack of attention + processing layers. Introduced 2017.

Text → Tokenize → Embed → [Attention Layer × N] → Output
                           (each layer refines)

Key advantage over older architectures (RNNs):

  • Processes all tokens in parallel (GPUs love this)
  • No sequential bottleneck → scales to massive models
  • Handles long-range dependencies naturally

Powers: GPT, Claude, Gemini, Llama, Mistral — all transformers.


3. Large Language Models

LLM Training Objective

Predict the next token. That's it.

Input: "The capital of France is"
Target: "Paris"

Trained on trillions of tokens, the model learns grammar, facts, reasoning patterns, code — all from next-token prediction at massive scale.

"Large" = number of parameters. GPT-4 ≈ 1T+ params. Training cost = millions of dollars.

Context Window

Maximum tokens the model can process in one interaction (input + output combined).

Model genContext
Early GPT~4K tokens
GPT-4 Turbo128K tokens
Claude 3.7200K tokens
Gemini 1.5 Pro1M tokens

Bigger context = more memory + compute + cost = slower.

"Lost in the middle" problem: Models attend more to content at the beginning and end of the context. Information buried in the middle is underweighted. Put critical instructions at start or end, not in the middle of a long prompt.

Temperature

Controls how the model samples the next token from the probability distribution.

TemperatureBehaviorUse for
~0 (0.1-0.3)Always picks highest-prob tokenCode generation, factual Q&A, structured output
~0.7BalancedGeneral chat, summarization
~1.0+Explores lower-prob tokensCreative writing, brainstorming, diversity
Very high (>1.5)IncoherentExperimentation only

Hallucination

Model generates confident, fluent, wrong output.

Why: The model is a next-token predictor, not a truth-verifier. If a false statement "looks like" a natural continuation, it generates it with full confidence.

Mitigations:

  • RAG — ground responses in retrieved real documents
  • Citations — force model to reference source material
  • Low temperature — reduces randomness, slightly reduces hallucination
  • Self-consistency — generate multiple responses, pick majority answer
  • Human-in-the-loop — for high-stakes outputs

Cannot be fully eliminated. Always verify model output for facts, code correctness, APIs.


4. Training Optimization

Fine-Tuning

Take pretrained model → continue training on smaller, domain-specific dataset.

When: You need consistent behavior on a specific domain (medical, legal, your company's tone).

Cost: High — must load entire model + training data. Multi-GPU setup required for large models.

Alternative: If you just need factual grounding, use RAG instead. Fine-tuning is for behavior/style, RAG is for knowledge.

RLHF (Reinforcement Learning from Human Feedback)

What turns "next token predictor" into a helpful, safe assistant.

1. Generate multiple responses to same prompt
2. Humans rank: which response is better?
3. Train a reward model on human preferences
4. Fine-tune LLM using RL to maximize reward model score

Without RLHF: model is fluent but not helpful, safe, or instruction-following. With RLHF: model learns a sense of preference — what good answers look like.

Used by: OpenAI (InstructGPT → ChatGPT), Anthropic (Constitutional AI variant), most commercial models.

LoRA (Low-Rank Adaptation)

Fine-tuning without updating all parameters.

Problem: Fine-tuning huge models = update billions of params = expensive.

LoRA insight: Weight updates during fine-tuning can be approximated with low-rank matrices — much smaller than the full weight matrix.

Original weights: W (frozen)
LoRA adapter:     W + A×B  (A, B are tiny matrices — fraction of W's size)

Benefits:

  • Fine-tune on a single GPU instead of a cluster
  • Store multiple LoRA adapters (one per task), switch at runtime
  • No degradation on original model capabilities (base weights frozen)

Common in: open-source fine-tuning (Llama, Mistral), multi-task model serving.

Quantization

Reduce model size by lowering numerical precision of weights.

PrecisionBits per weightSize
FP3232 bitsFull
FP16/BF1616 bits2x smaller
INT88 bits4x smaller
INT4 (GGUF)4 bits8x smaller

Why it works: Small precision loss (often <5% quality drop) for massive memory savings.

Practical: When you see people running LLaMA 70B on a desktop GPU — they're using 4-bit quantization. Without it, 70B model needs ~140GB VRAM.


5. Prompting & Reasoning

Prompt Engineering

Shape the input to get better output. Same question, different prompt = very different results.

Bad promptBetter prompt
"explain APIs""explain how REST APIs handle authentication with a JWT example. Format: concept, code, gotcha"
"fix this code""you're a senior Python engineer. This function has a bug. Identify the bug and fix it. Don't change the logic."

Patterns:

  • Role assignment: "You are a senior SRE..."
  • Few-shot examples: Show 2-3 input→output pairs before the actual query
  • Format constraints: "Answer in JSON with keys: issue, fix, example"
  • Instruction at start AND end — for long prompts, repeat key instruction at end (combats lost-in-middle)

Chain of Thought (CoT)

Prompt the model to reason step-by-step instead of jumping to the answer.

Bad:  "What is 17 × 24?" → "408" (might be wrong)

CoT:  "What is 17 × 24? Think step by step."
      → "17 × 20 = 340. 17 × 4 = 68. 340 + 68 = 408." ✓

Why it works: Forces the model to use the context window as "scratch space" instead of pattern-matching to a fast answer. Particularly powerful for: math, multi-step reasoning, logical problems.

Zero-shot CoT: Just add "Think step by step." to any prompt.


6. AI Systems

RAG (Retrieval-Augmented Generation)

Ground LLM responses in real, retrieved documents.

User query
  → embed query → search vector DB → fetch top-K relevant chunks
  → inject chunks into context → LLM generates answer citing them

Why: Eliminates hallucination for knowledge-grounded tasks. Knowledge is in documents, not model weights — so no retraining when knowledge changes.

When RAG vs Fine-tuning:

  • RAG: You need up-to-date facts, domain knowledge, citations
  • Fine-tune: You need a specific behavior, style, or format the model doesn't naturally have

See: [[System Design/Problem Designs/RAG & LLM System]] — full architecture

Vector Database

Semantic search over embedded documents.

Documents → chunk → embed → store vectors
Query     → embed → nearest-neighbor search → top-K chunks returned

Finds semantically similar content, not just keyword matches.

Common: Pinecone, Weaviate, Qdrant, ChromaDB, pgvector (PostgreSQL extension).

Key operation: Approximate Nearest Neighbor (ANN) search — exact nearest neighbor is O(N×D), ANN trades tiny accuracy for O(log N).

AI Agents

LLMs that take actions, not just respond.

Observe state → decide action → execute tool → observe new state → repeat

Tools: web search, code execution, API calls, file read/write, DB queries.

The reliability problem: Each step has failure probability. In a 10-step chain, even 95% per-step reliability = 60% end-to-end success. Building reliable agents requires: retry logic, validation steps, fallback strategies, human checkpoints for high-stakes decisions.

Current state: Good for bounded, well-defined tasks. Unreliable for open-ended long-horizon tasks.

Diffusion Models

Image generation via learned denoising.

Training: real image → gradually add noise until pure static → train model to reverse it
Inference: start with noise → denoise step-by-step guided by prompt → image emerges

Not just images — diffusion now used for video (Sora), audio, 3D content, protein structure prediction.


Interview Talking Points

"What's the difference between fine-tuning and RAG?" → Fine-tuning changes model behavior/style (baked in weights). RAG changes what the model knows at inference time (injected context). For factual grounding: RAG. For consistent persona/format: fine-tune.

"Why does hallucination happen?" → Next-token prediction has no truth verification. Model generates what's statistically plausible, not necessarily true. Mitigate with RAG (external ground truth) or constitutional constraints.

"What is LoRA used for?" → Parameter-efficient fine-tuning. Keeps base model frozen, adds tiny trainable adapters. Enables fine-tuning on consumer hardware; easy to swap adapters per task.

"What is temperature in an LLM?" → Controls randomness in token sampling. Low = deterministic/precise (code, facts). High = creative/diverse (stories, brainstorming). Set to 0 for fully reproducible output.


Related

  • [[AI & ML/AI & ML Topics]] — index
  • [[AI & ML/Langchain]] — LangChain chains, agents, RAG pipelines
  • [[AI & ML/MCP]] — Model Context Protocol for tools/resources
  • [[System Design/Problem Designs/RAG & LLM System]] — RAG architecture deep dive
  • [[synthesis/LLM & AI Stack]] — AI stack positioning for interviews