LLM & AI Stack

Cross-domain synthesis: LangChain + MCP + RAG patterns + AWS AI services. Primary moat for targeting AI companies. LLM: update when new AI/ML content ingested or work experience notes added.

The Stack

Application layer    →  LangChain agents, LCEL, tool use
Protocol layer       →  MCP (Model Context Protocol) — LLM ↔ tools
RAG layer            →  Chunking → embedding → vector store → retrieval → generation
ML platform          →  AWS SageMaker, Bedrock, MLflow
Fundamentals         →  Attention, transformers, fine-tuning, eval

RAG — Production Reality

Most candidates know the toy RAG diagram. The moat is knowing what breaks in production.

Standard RAG Pipeline

Documents → Chunk → Embed → Vector Store
Query → Embed → ANN Search → Top-K chunks → LLM → Answer

What actually breaks (from AskTGE work)

Problem	Root cause	Fix
Irrelevant chunks retrieved	Chunk size too large, loses context	Smaller chunks + parent document retrieval
Missing context across chunks	Naive fixed-size chunking splits semantic units	Sentence-boundary chunking, sliding window overlap
Citation hallucination	LLM cites chunks that don't support claim	Source ID in prompt + strict citation template
Slow retrieval at scale	Full scan ANN index	HNSW index, filtering pre/post retrieval
SSE streaming bugs	Event protocol mismatch (`final_response` vs `complete`)	Strict event schema, integration tests

See blog draft: [[Private/Project_2026/Blog/RAG in Production Draft]]

RAG System Design

[[System Design/Problem Designs/RAG & LLM System]] — ingestion pipeline, vector DB, retrieval strategies, evaluation.

LangChain

[[AI & ML/Langchain]] — full notes.

Core Components

Chain (legacy): sequential steps, LLMChain, SequentialChain
LCEL: pipe operator |, composable, streaming-native. prompt | llm | parser
Agents: LLM decides which tool to call, observes output, loops until done
Tools: Python functions wrapped with @tool or StructuredTool
Memory: ConversationBufferMemory, ConversationSummaryMemory

Agent Types (know the tradeoffs)

Agent	How	When
ReAct	Thought → Action → Observation loop	General tool use
OpenAI Functions	Structured JSON tool calls (native)	When using GPT-3.5/4
Plan-and-Execute	Plan all steps, then execute	Long multi-step tasks
Conversational	Memory + tool use	Chatbot with tools

Production tips

Use LCEL over legacy chains — streaming, tracing, composable
LangSmith for tracing in production (W08 plan)
Tool descriptions matter more than prompts — LLM uses description to decide when to call
Async chains with ainvoke for concurrent tool calls

MCP — Model Context Protocol

[[AI & ML/MCP]] — full notes.

What it is

Standardized protocol for LLMs to interact with external tools, data sources, and systems. Anthropic-designed. Claude Code uses it natively.

Architecture

Host (Claude/app) ←→ MCP Server ←→ Tools/Resources/Prompts

Tools: functions LLM can call (read file, run SQL, search web)
Resources: data LLM can read (files, DB tables, API responses)
Prompts: reusable prompt templates exposed by server

Why it matters for interviews

MCP is where the industry is moving — from bespoke tool integrations to standardized protocol. Knowing it signals you're building with current LLM infrastructure, not last year's patterns.

Talking point: "I've used MCP in Claude Code to integrate with GitHub, Gmail, and local tools. The protocol separates concerns cleanly — the LLM doesn't need to know how a tool works, just what it does and what to pass it."

AWS AI Services

[[AWS/ML Associate Exam/06 - AWS AI Services]] — full notes.

Bedrock

Managed LLMs (Claude, Llama, Mistral, Titan)
Knowledge Bases — managed RAG (S3 → embeddings → OpenSearch Serverless)
Agents — ReAct agents with AWS tool integrations
Guardrails — content filtering, PII detection
Evaluation — automated model evaluation

Use case positioning: "If a client needs RAG without managing vector DB infra, Bedrock Knowledge Bases + Agents is the fastest path to production."

SageMaker

Training jobs, tuning jobs, processing jobs
Endpoints (real-time) + batch transform
Pipelines — MLOps workflow orchestration
Feature Store — online + offline feature serving
JumpStart — foundation model fine-tuning

Key AI Services (MLA-C01 scope)

Service	What it does
Rekognition	Image/video analysis, face detection
Comprehend	NLP: sentiment, entities, key phrases, PII
Transcribe	Speech-to-text, speaker diarization
Polly	Text-to-speech
Textract	Document OCR, form/table extraction
Kendra	Enterprise search with ML ranking
Lex	Conversational AI (chatbot)
Translate	Neural machine translation

ML Fundamentals (for AI company interviews)

Transformer / Attention (know the intuition)

Self-attention: each token attends to all others, learns context
Q, K, V: Query (what I'm looking for), Key (what I have), Value (what I return)
Multi-head: parallel attention heads capture different relationships
Positional encoding: adds position info since attention is permutation-invariant

Fine-tuning vs RAG vs Prompt Engineering

Approach	When	Cost	Freshness
Prompt engineering	Knowledge already in model, style changes	$0	Stale (training cutoff)
RAG	External/private/fresh knowledge	Medium	Fresh (retrieval)
Fine-tuning	Behavior change, style, domain vocab	High	Stale (training cutoff)
Full pre-training	New domain from scratch	Very high	As fresh as data

Rule of thumb: RAG first. Fine-tune only when RAG fails on behavior (not knowledge).

Evaluation Metrics

RAGAS: faithfulness (no hallucination), answer relevance, context precision/recall
Classification: precision, recall, F1, AUC-ROC
Generation: BLEU, ROUGE, human eval, LLM-as-judge

AI Company Interview Talking Points

If asked "Tell me about a production ML/LLM system you built": AskTGE (use as war story, keep code private):

RAG + multiagent pipeline, Python + LangChain
Production issues: SSE streaming, citation hallucination, latency at scale
What I'd do differently: structured outputs, RAGAS eval, Bedrock for managed infra

If asked "What's your view on RAG vs fine-tuning?": Use the table above. Emphasize: RAG first for knowledge, fine-tune for behavior. Production experience shows most teams overestimate the need for fine-tuning.

If asked "How do you evaluate an LLM system?": RAGAS metrics + LLM-as-judge + human eval on samples + latency/cost monitoring.

[[AI & ML/Langchain]] — LangChain full notes
[[AI & ML/MCP]] — MCP full notes
[[System Design/Problem Designs/RAG & LLM System]] — RAG system design
[[System Design/Problem Designs/ML Feature Store]] — feature store design
[[AWS/ML Associate Exam/00 - Index & Exam Guide]] — AWS ML cert path
[[AWS/ML Associate Exam/06 - AWS AI Services]] — Bedrock, Rekognition, etc.
[[synthesis/Tech Stack Overview]] — positioning guide

LLM & AI Stack

LLM & AI Stack

The Stack

RAG — Production Reality

Standard RAG Pipeline

What actually breaks (from AskTGE work)

RAG System Design

LangChain

Core Components

Agent Types (know the tradeoffs)

Production tips

MCP — Model Context Protocol

What it is

Architecture

Why it matters for interviews

AWS AI Services

Bedrock

SageMaker

Key AI Services (MLA-C01 scope)

ML Fundamentals (for AI company interviews)

Transformer / Attention (know the intuition)

Fine-tuning vs RAG vs Prompt Engineering

Evaluation Metrics

AI Company Interview Talking Points

Related