Back to Notes

LLM & AI Stack

LLM & AI Stack

Cross-domain synthesis: LangChain + MCP + RAG patterns + AWS AI services. Primary moat for targeting AI companies. LLM: update when new AI/ML content ingested or work experience notes added.


The Stack

Application layer    →  LangChain agents, LCEL, tool use
Protocol layer       →  MCP (Model Context Protocol) — LLM ↔ tools
RAG layer            →  Chunking → embedding → vector store → retrieval → generation
ML platform          →  AWS SageMaker, Bedrock, MLflow
Fundamentals         →  Attention, transformers, fine-tuning, eval

RAG — Production Reality

Most candidates know the toy RAG diagram. The moat is knowing what breaks in production.

Standard RAG Pipeline

Documents → Chunk → Embed → Vector Store
Query → Embed → ANN Search → Top-K chunks → LLM → Answer

What actually breaks (from AskTGE work)

ProblemRoot causeFix
Irrelevant chunks retrievedChunk size too large, loses contextSmaller chunks + parent document retrieval
Missing context across chunksNaive fixed-size chunking splits semantic unitsSentence-boundary chunking, sliding window overlap
Citation hallucinationLLM cites chunks that don't support claimSource ID in prompt + strict citation template
Slow retrieval at scaleFull scan ANN indexHNSW index, filtering pre/post retrieval
SSE streaming bugsEvent protocol mismatch (final_response vs complete)Strict event schema, integration tests

See blog draft: [[Private/Project_2026/Blog/RAG in Production Draft]]

RAG System Design

[[System Design/Problem Designs/RAG & LLM System]] — ingestion pipeline, vector DB, retrieval strategies, evaluation.


LangChain

[[AI & ML/Langchain]] — full notes.

Core Components

  • Chain (legacy): sequential steps, LLMChain, SequentialChain
  • LCEL: pipe operator |, composable, streaming-native. prompt | llm | parser
  • Agents: LLM decides which tool to call, observes output, loops until done
  • Tools: Python functions wrapped with @tool or StructuredTool
  • Memory: ConversationBufferMemory, ConversationSummaryMemory

Agent Types (know the tradeoffs)

AgentHowWhen
ReActThought → Action → Observation loopGeneral tool use
OpenAI FunctionsStructured JSON tool calls (native)When using GPT-3.5/4
Plan-and-ExecutePlan all steps, then executeLong multi-step tasks
ConversationalMemory + tool useChatbot with tools

Production tips

  • Use LCEL over legacy chains — streaming, tracing, composable
  • LangSmith for tracing in production (W08 plan)
  • Tool descriptions matter more than prompts — LLM uses description to decide when to call
  • Async chains with ainvoke for concurrent tool calls

MCP — Model Context Protocol

[[AI & ML/MCP]] — full notes.

What it is

Standardized protocol for LLMs to interact with external tools, data sources, and systems. Anthropic-designed. Claude Code uses it natively.

Architecture

Host (Claude/app) ←→ MCP Server ←→ Tools/Resources/Prompts
  • Tools: functions LLM can call (read file, run SQL, search web)
  • Resources: data LLM can read (files, DB tables, API responses)
  • Prompts: reusable prompt templates exposed by server

Why it matters for interviews

MCP is where the industry is moving — from bespoke tool integrations to standardized protocol. Knowing it signals you're building with current LLM infrastructure, not last year's patterns.

Talking point: "I've used MCP in Claude Code to integrate with GitHub, Gmail, and local tools. The protocol separates concerns cleanly — the LLM doesn't need to know how a tool works, just what it does and what to pass it."


AWS AI Services

[[AWS/ML Associate Exam/06 - AWS AI Services]] — full notes.

Bedrock

  • Managed LLMs (Claude, Llama, Mistral, Titan)
  • Knowledge Bases — managed RAG (S3 → embeddings → OpenSearch Serverless)
  • Agents — ReAct agents with AWS tool integrations
  • Guardrails — content filtering, PII detection
  • Evaluation — automated model evaluation

Use case positioning: "If a client needs RAG without managing vector DB infra, Bedrock Knowledge Bases + Agents is the fastest path to production."

SageMaker

  • Training jobs, tuning jobs, processing jobs
  • Endpoints (real-time) + batch transform
  • Pipelines — MLOps workflow orchestration
  • Feature Store — online + offline feature serving
  • JumpStart — foundation model fine-tuning

Key AI Services (MLA-C01 scope)

ServiceWhat it does
RekognitionImage/video analysis, face detection
ComprehendNLP: sentiment, entities, key phrases, PII
TranscribeSpeech-to-text, speaker diarization
PollyText-to-speech
TextractDocument OCR, form/table extraction
KendraEnterprise search with ML ranking
LexConversational AI (chatbot)
TranslateNeural machine translation

ML Fundamentals (for AI company interviews)

Transformer / Attention (know the intuition)

  • Self-attention: each token attends to all others, learns context
  • Q, K, V: Query (what I'm looking for), Key (what I have), Value (what I return)
  • Multi-head: parallel attention heads capture different relationships
  • Positional encoding: adds position info since attention is permutation-invariant

Fine-tuning vs RAG vs Prompt Engineering

ApproachWhenCostFreshness
Prompt engineeringKnowledge already in model, style changes$0Stale (training cutoff)
RAGExternal/private/fresh knowledgeMediumFresh (retrieval)
Fine-tuningBehavior change, style, domain vocabHighStale (training cutoff)
Full pre-trainingNew domain from scratchVery highAs fresh as data

Rule of thumb: RAG first. Fine-tune only when RAG fails on behavior (not knowledge).

Evaluation Metrics

  • RAGAS: faithfulness (no hallucination), answer relevance, context precision/recall
  • Classification: precision, recall, F1, AUC-ROC
  • Generation: BLEU, ROUGE, human eval, LLM-as-judge

AI Company Interview Talking Points

If asked "Tell me about a production ML/LLM system you built": AskTGE (use as war story, keep code private):

  • RAG + multiagent pipeline, Python + LangChain
  • Production issues: SSE streaming, citation hallucination, latency at scale
  • What I'd do differently: structured outputs, RAGAS eval, Bedrock for managed infra

If asked "What's your view on RAG vs fine-tuning?": Use the table above. Emphasize: RAG first for knowledge, fine-tune for behavior. Production experience shows most teams overestimate the need for fine-tuning.

If asked "How do you evaluate an LLM system?": RAGAS metrics + LLM-as-judge + human eval on samples + latency/cost monitoring.


Related

  • [[AI & ML/Langchain]] — LangChain full notes
  • [[AI & ML/MCP]] — MCP full notes
  • [[System Design/Problem Designs/RAG & LLM System]] — RAG system design
  • [[System Design/Problem Designs/ML Feature Store]] — feature store design
  • [[AWS/ML Associate Exam/00 - Index & Exam Guide]] — AWS ML cert path
  • [[AWS/ML Associate Exam/06 - AWS AI Services]] — Bedrock, Rekognition, etc.
  • [[synthesis/Tech Stack Overview]] — positioning guide