LLM & AI Stack
LLM & AI Stack
Cross-domain synthesis: LangChain + MCP + RAG patterns + AWS AI services. Primary moat for targeting AI companies. LLM: update when new AI/ML content ingested or work experience notes added.
The Stack
Application layer → LangChain agents, LCEL, tool use
Protocol layer → MCP (Model Context Protocol) — LLM ↔ tools
RAG layer → Chunking → embedding → vector store → retrieval → generation
ML platform → AWS SageMaker, Bedrock, MLflow
Fundamentals → Attention, transformers, fine-tuning, eval
RAG — Production Reality
Most candidates know the toy RAG diagram. The moat is knowing what breaks in production.
Standard RAG Pipeline
Documents → Chunk → Embed → Vector Store
Query → Embed → ANN Search → Top-K chunks → LLM → Answer
What actually breaks (from AskTGE work)
| Problem | Root cause | Fix |
|---|---|---|
| Irrelevant chunks retrieved | Chunk size too large, loses context | Smaller chunks + parent document retrieval |
| Missing context across chunks | Naive fixed-size chunking splits semantic units | Sentence-boundary chunking, sliding window overlap |
| Citation hallucination | LLM cites chunks that don't support claim | Source ID in prompt + strict citation template |
| Slow retrieval at scale | Full scan ANN index | HNSW index, filtering pre/post retrieval |
| SSE streaming bugs | Event protocol mismatch (final_response vs complete) | Strict event schema, integration tests |
See blog draft: [[Private/Project_2026/Blog/RAG in Production Draft]]
RAG System Design
[[System Design/Problem Designs/RAG & LLM System]] — ingestion pipeline, vector DB, retrieval strategies, evaluation.
LangChain
[[AI & ML/Langchain]] — full notes.
Core Components
- Chain (legacy): sequential steps,
LLMChain,SequentialChain - LCEL: pipe operator
|, composable, streaming-native.prompt | llm | parser - Agents: LLM decides which tool to call, observes output, loops until done
- Tools: Python functions wrapped with
@toolorStructuredTool - Memory:
ConversationBufferMemory,ConversationSummaryMemory
Agent Types (know the tradeoffs)
| Agent | How | When |
|---|---|---|
| ReAct | Thought → Action → Observation loop | General tool use |
| OpenAI Functions | Structured JSON tool calls (native) | When using GPT-3.5/4 |
| Plan-and-Execute | Plan all steps, then execute | Long multi-step tasks |
| Conversational | Memory + tool use | Chatbot with tools |
Production tips
- Use LCEL over legacy chains — streaming, tracing, composable
- LangSmith for tracing in production (W08 plan)
- Tool descriptions matter more than prompts — LLM uses description to decide when to call
- Async chains with
ainvokefor concurrent tool calls
MCP — Model Context Protocol
[[AI & ML/MCP]] — full notes.
What it is
Standardized protocol for LLMs to interact with external tools, data sources, and systems. Anthropic-designed. Claude Code uses it natively.
Architecture
Host (Claude/app) ←→ MCP Server ←→ Tools/Resources/Prompts
- Tools: functions LLM can call (read file, run SQL, search web)
- Resources: data LLM can read (files, DB tables, API responses)
- Prompts: reusable prompt templates exposed by server
Why it matters for interviews
MCP is where the industry is moving — from bespoke tool integrations to standardized protocol. Knowing it signals you're building with current LLM infrastructure, not last year's patterns.
Talking point: "I've used MCP in Claude Code to integrate with GitHub, Gmail, and local tools. The protocol separates concerns cleanly — the LLM doesn't need to know how a tool works, just what it does and what to pass it."
AWS AI Services
[[AWS/ML Associate Exam/06 - AWS AI Services]] — full notes.
Bedrock
- Managed LLMs (Claude, Llama, Mistral, Titan)
- Knowledge Bases — managed RAG (S3 → embeddings → OpenSearch Serverless)
- Agents — ReAct agents with AWS tool integrations
- Guardrails — content filtering, PII detection
- Evaluation — automated model evaluation
Use case positioning: "If a client needs RAG without managing vector DB infra, Bedrock Knowledge Bases + Agents is the fastest path to production."
SageMaker
- Training jobs, tuning jobs, processing jobs
- Endpoints (real-time) + batch transform
- Pipelines — MLOps workflow orchestration
- Feature Store — online + offline feature serving
- JumpStart — foundation model fine-tuning
Key AI Services (MLA-C01 scope)
| Service | What it does |
|---|---|
| Rekognition | Image/video analysis, face detection |
| Comprehend | NLP: sentiment, entities, key phrases, PII |
| Transcribe | Speech-to-text, speaker diarization |
| Polly | Text-to-speech |
| Textract | Document OCR, form/table extraction |
| Kendra | Enterprise search with ML ranking |
| Lex | Conversational AI (chatbot) |
| Translate | Neural machine translation |
ML Fundamentals (for AI company interviews)
Transformer / Attention (know the intuition)
- Self-attention: each token attends to all others, learns context
- Q, K, V: Query (what I'm looking for), Key (what I have), Value (what I return)
- Multi-head: parallel attention heads capture different relationships
- Positional encoding: adds position info since attention is permutation-invariant
Fine-tuning vs RAG vs Prompt Engineering
| Approach | When | Cost | Freshness |
|---|---|---|---|
| Prompt engineering | Knowledge already in model, style changes | $0 | Stale (training cutoff) |
| RAG | External/private/fresh knowledge | Medium | Fresh (retrieval) |
| Fine-tuning | Behavior change, style, domain vocab | High | Stale (training cutoff) |
| Full pre-training | New domain from scratch | Very high | As fresh as data |
Rule of thumb: RAG first. Fine-tune only when RAG fails on behavior (not knowledge).
Evaluation Metrics
- RAGAS: faithfulness (no hallucination), answer relevance, context precision/recall
- Classification: precision, recall, F1, AUC-ROC
- Generation: BLEU, ROUGE, human eval, LLM-as-judge
AI Company Interview Talking Points
If asked "Tell me about a production ML/LLM system you built": AskTGE (use as war story, keep code private):
- RAG + multiagent pipeline, Python + LangChain
- Production issues: SSE streaming, citation hallucination, latency at scale
- What I'd do differently: structured outputs, RAGAS eval, Bedrock for managed infra
If asked "What's your view on RAG vs fine-tuning?": Use the table above. Emphasize: RAG first for knowledge, fine-tune for behavior. Production experience shows most teams overestimate the need for fine-tuning.
If asked "How do you evaluate an LLM system?": RAGAS metrics + LLM-as-judge + human eval on samples + latency/cost monitoring.
Related
- [[AI & ML/Langchain]] — LangChain full notes
- [[AI & ML/MCP]] — MCP full notes
- [[System Design/Problem Designs/RAG & LLM System]] — RAG system design
- [[System Design/Problem Designs/ML Feature Store]] — feature store design
- [[AWS/ML Associate Exam/00 - Index & Exam Guide]] — AWS ML cert path
- [[AWS/ML Associate Exam/06 - AWS AI Services]] — Bedrock, Rekognition, etc.
- [[synthesis/Tech Stack Overview]] — positioning guide