Module 14: Generative AI & AgentsThe Frontier of Modern AI
Generative AI has moved from research curiosity to industrial reality faster than any previous AI technology. Large Language Models can write code, analyze legal documents, generate synthetic training data, and power autonomous agents that browse the web and execute multi-step tasks. This module provides a deep technical understanding of the systems you will be building with for the next decade—LLM architectures, prompt engineering as a precise discipline, intelligent agents, and Retrieval-Augmented Generation.
🧬 LLMs Deep Dive — Architecture and Scale
Modern LLMs are decoder-only Transformer architectures (GPT style) trained with causal language modeling at scales previously unimaginable. Understanding their architecture enables you to use them effectively and debug failures intelligently.
Tokenization — The Interface to Language:
LLMs don't process characters or words—they process tokens. Tokenization splits text into subword units using algorithms like Byte Pair Encoding (BPE) or SentencePiece. A token is typically 4 characters on average in English. GPT-4 has a vocabulary of ~100,000 tokens. Understanding tokenization is critical: "ChatGPT" is 1 token; "Supercalifragilistic" might be 5. Non-English text tokenizes less efficiently—one Chinese character may be one token but some special characters consume many tokens.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
text = "Understanding tokenization is critical for LLM engineering."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
def estimate_cost(text, model="gpt-4", cost_per_1k=0.03):
num_tokens = len(enc.encode(text))
return (num_tokens / 1000) * cost_per_1k
cost = estimate_cost("Your 10,000 word document text here...")
print(f"Estimated input cost: ${cost:.4f}")
Scaling Laws — Why Bigger Models Are Better: Kaplan et al. (2020) and Hoffmann et al. (2022, Chinchilla) established precise mathematical relationships between model performance, number of parameters, and training data size. The Chinchilla scaling law: optimal training requires roughly 20 tokens of training data per parameter. A 70B parameter model should train on ~1.4T tokens. These laws guide every frontier AI lab's compute budget allocation.
Context Windows — The Working Memory of LLMs: The context window is the maximum number of tokens an LLM can process simultaneously. GPT-4 Turbo: 128K tokens. Claude 3: 200K tokens. This matters enormously for applications that process long documents, conversations, and code repositories. Longer context doesn't mean equal attention—models often struggle to recall information from the middle of very long contexts (the "lost in the middle" problem).
Using LLM APIs Effectively:
from anthropic import Anthropic
client = Anthropic()
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system="You are an expert Python code reviewer. Be precise and constructive.",
messages=[
{
"role": "user",
"content": "Review this function for bugs and style:\n\ndef calc_avg(lst):\n return sum(lst)/len(lst)"
}
]
)
print(message.content[0].text)
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
🎯 Prompt Engineering — The Science of Instruction
Prompt engineering is the discipline of designing inputs to language models that reliably elicit desired outputs. It is not "magic words"—it is a systematic engineering practice with reproducible techniques backed by empirical research.
Zero-Shot Prompting: Ask the model to perform a task without providing examples. Works when the task is within the model's training distribution and the instruction is unambiguous. Simplest prompting strategy.
zero_shot_prompt = """Classify the sentiment of the following customer review.
Classify as: POSITIVE, NEGATIVE, or NEUTRAL.
Respond with only the classification label.
Review: "The product arrived three days late and the packaging was damaged,
but the item itself works perfectly and customer support was excellent."
"""
Few-Shot Prompting: Provide 3–5 examples of the desired input-output mapping before asking the model to generalize. Dramatically improves performance for tasks requiring specific output formats, domain-specific reasoning, or unusual classification schemes.
few_shot_prompt = """Classify SQL query complexity.
Query: SELECT * FROM users WHERE id = 1
Complexity: LOW
Query: SELECT u.name, COUNT(o.id) FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY u.id HAVING COUNT(o.id) > 5
Complexity: MEDIUM
Query: WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rn FROM employees), stats AS (SELECT dept, AVG(salary) avg_sal FROM employees GROUP BY dept) SELECT r.*, s.avg_sal FROM ranked r JOIN stats s ON r.dept = s.dept WHERE r.rn <= 3
Complexity: HIGH
Query: SELECT name, age FROM customers WHERE country = 'Ghana' ORDER BY age DESC LIMIT 10
Complexity:"""
Chain-of-Thought (CoT) Prompting: Instruct the model to show its reasoning step by step before producing the final answer. For complex reasoning tasks (math, logic, multi-step planning), CoT can improve accuracy by 30–50%. The model's internal "thinking" enforces logical consistency. Simply adding "Think step by step" often significantly improves results.
cot_prompt = """Solve this problem. Think through each step carefully before giving the final answer.
Problem: A store has 150 products. 40% are electronics, 30% are clothing, and the rest are home goods.
Electronics have a 25% profit margin, clothing has 40%, and home goods have 15%.
If total revenue is $500,000, what is the total profit?
Let's think step by step:"""
Structured Output Prompting: Request JSON, XML, or other structured output for downstream processing. Specify the exact schema and add instruction to validate format.
import json
structured_prompt = """Analyze the following product review and extract key information.
Return ONLY a valid JSON object with this exact schema, no other text:
{
"sentiment": "positive|negative|neutral",
"score": 1-5,
"key_issues": ["list of specific problems mentioned"],
"key_praises": ["list of specific positives mentioned"],
"would_recommend": true|false,
"urgency": "high|medium|low"
}
Review: "Been using this for 3 months. Battery life is great (easily 2 days) and the camera
is stunning for the price. However, the software has bugs - it crashed twice last week and
the Bluetooth keeps disconnecting from my car. Really frustrated by these software issues
but the hardware is solid."
"""
response = client.messages.create(...)
try:
analysis = json.loads(response.content[0].text)
print(f"Sentiment: {analysis['sentiment']}")
print(f"Issues: {analysis['key_issues']}")
except json.JSONDecodeError:
print("Model returned invalid JSON — tighten the prompt")
System Prompts — Defining Model Behavior: The system prompt establishes the model's persona, constraints, response format, and behavioral guidelines. Well-engineered system prompts are a competitive moat—they encode significant product logic and can dramatically change model behavior:
system_prompt = """You are a senior Python code reviewer at a fintech company.
Your responsibilities:
- Identify bugs, security vulnerabilities, and performance issues
- Suggest improvements following PEP 8 and modern Python idioms
- Check for proper error handling and input validation
- Flag any security concerns (SQL injection, hardcoded credentials, etc.)
Response format:
1. Critical Issues (must fix before deployment)
2. Important Improvements (should fix)
3. Minor Suggestions (nice to have)
4. Overall Assessment (1-10 rating with justification)
Be direct and specific. Include corrected code snippets for each issue."""
🤖 Intelligent Agents — The ReAct Framework
LLMs become dramatically more powerful when combined with the ability to use tools. An agent is an LLM that can reason about a task, choose and execute tools (web search, code execution, API calls, database queries), observe results, and iterate until the task is complete. This is the architecture powering AI assistants, autonomous research systems, and software engineering agents.
The ReAct Framework (Reason + Act): ReAct interleaves reasoning traces with tool-use actions. The model produces Thought (reasoning about what to do next) → Action (tool call) → Observation (tool output) → Thought → Action... until reaching the Final Answer. This loop enables complex multi-step problem solving that single-shot prompting cannot achieve.
from anthropic import Anthropic
import json
client = Anthropic()
tools = [
{
"name": "web_search",
"description": "Search the web for current information on any topic",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "The search query"}
},
"required": ["query"]
}
},
{
"name": "python_executor",
"description": "Execute Python code and return the output",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute"}
},
"required": ["code"]
}
}
]
def execute_tool(tool_name, tool_input):
if tool_name == "web_search":
return f"[Search results for '{tool_input['query']}': ...]"
elif tool_name == "python_executor":
try:
exec_globals = {}
exec(tool_input['code'], exec_globals)
return str(exec_globals.get('result', 'Code executed successfully'))
except Exception as e:
return f"Error: {str(e)}"
def run_agent(user_task, max_iterations=10):
messages = [{"role": "user", "content": user_task}]
for iteration in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
final_text = next(b.text for b in response.content if hasattr(b, 'text'))
return final_text
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f"[Agent] Using tool: {block.name}({block.input})")
result = execute_tool(block.name, block.input)
print(f"[Agent] Tool result: {result[:200]}...")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
return "Agent reached maximum iterations without completing task"
result = run_agent("Research the current AI chip market and calculate the market share percentages for the top 3 vendors. Show your calculations.")
print(result)
📚 Retrieval-Augmented Generation (RAG)
LLMs have a critical limitation: they only know what was in their training data (knowledge cutoff) and they cannot access your private documents, internal databases, or real-time information. RAG solves this by retrieving relevant information from an external knowledge base at inference time and injecting it into the prompt context. This enables LLMs to answer questions about your specific documents, data, and knowledge without expensive fine-tuning.
The RAG Pipeline:
- Indexing (Offline): Load documents → chunk into segments (512–1024 tokens) → embed each chunk using an embedding model → store embeddings in a vector database.
- Retrieval (Online): Embed the user's query → search vector database for nearest neighbors (most semantically similar chunks) using cosine similarity or approximate nearest neighbor algorithms.
- Augmented Generation: Inject retrieved chunks into the LLM prompt → LLM answers the question using both its parametric knowledge and the retrieved context.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_anthropic import ChatAnthropic
import os
def build_rag_system(documents_path):
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(documents_path, glob="**/*.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} documents")
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", "! ", "? ", " "]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
vectorstore.persist()
print(f"Indexed {len(chunks)} chunks into vector store")
llm = ChatAnthropic(model="claude-sonnet-4-6", max_tokens=2048)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True,
verbose=True
)
return qa_chain
qa_system = build_rag_system("./company_documents/")
query = "What is our Q3 revenue target and which regions are underperforming?"
result = qa_system({"query": query})
print(f"Answer: {result['result']}")
print(f"\nSources used:")
for doc in result['source_documents']:
print(f" - {doc.metadata.get('source', 'Unknown')}")