Artificial Intelligence | VoidX Academy

10. Natural Language Processing and Large Language Models

Module 10: NLP & LLMs

Teaching Machines to Read and Write

Natural Language Processing is the branch of AI concerned with enabling computers to understand, interpret, and generate human language. Language is arguably the most complex signal humans produce—it is context-dependent, ambiguous, hierarchical, and culturally situated. For decades, NLP relied on hand-crafted rules and statistical models. Then the Transformer architecture arrived in 2017 and everything changed. Today, large language models built on Transformers have transformed NLP from a specialized research field into the foundation of AI products used by billions of people.

📝 Text Preprocessing — Preparing Language for Models

Raw text cannot be fed directly to mathematical models. It must be converted into numerical representations through a preprocessing pipeline.

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download(['punkt', 'stopwords', 'wordnet'])

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)   # remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)       # keep only letters and spaces
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

text = "The neural networks are learning representations from massive datasets!!!"
processed = preprocess_text(text)
print(processed)  # ['neural', 'network', 'learn', 'representation', 'massive', 'dataset']

🔢 Text Representations — From Words to Vectors

Models need numbers, not words. The evolution of text representations tells the story of modern NLP:

Bag of Words (BoW): Represent text as a vector of word counts. Ignores word order entirely. Simple but surprisingly effective for simple classification tasks. Produces sparse, high-dimensional vectors.

TF-IDF (Term Frequency-Inverse Document Frequency): Weights words by how frequently they appear in a document (TF) relative to how common they are across all documents (IDF). Common words like "the" get low weights; rare, distinctive words get high weights. Better than raw counts for information retrieval and classification.

Word Embeddings (Word2Vec, GloVe): Dense vectors (typically 50–300 dimensions) where semantically similar words have similar vectors. "King - Man + Woman ≈ Queen" is the famous example—these vectors encode semantic relationships. Trained on large text corpora using self-supervised objectives (predict surrounding words, or predict a word from its context).

import gensim.downloader as api

word2vec = api.load('word2vec-google-news-300')

print(word2vec.most_similar('king', topn=5))
print(word2vec.similarity('cat', 'dog'))
print(word2vec.similarity('cat', 'car'))

result = word2vec.most_similar(positive=['king', 'woman'], negative=['man'])
print(result[0])  # ('queen', 0.71...)

🔁 The Transformer Architecture — The Foundation of Modern AI

The Transformer, introduced by Vaswani et al. (2017) in "Attention Is All You Need," is the architecture underlying GPT, BERT, Claude, Gemini, and virtually every modern AI system working with language, audio, or video. Understanding it is non-negotiable for modern AI engineers.

The Core Innovation — Self-Attention:

Previous sequence models (RNNs, LSTMs) processed text sequentially—word by word—which prevented parallelization and made it hard to relate words far apart in a sequence. Self-attention allows each word to directly attend to every other word in the sequence simultaneously, computing a weighted representation based on relevance.

How Attention Works:

For each token, compute three vectors: Query (Q), Key (K), and Value (V) via learned weight matrices.
Compute attention scores: score(Q, K) = Q·K^T / √d_k where d_k is the key dimension. The division by √d_k prevents scores from becoming too large.
Apply softmax to scores to get attention weights summing to 1.
Compute the weighted sum of Value vectors: output = softmax(Q·K^T/√d_k) · V
The output for each token is now a context-aware representation that incorporates information from all other tokens, weighted by relevance.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
    
    def forward(self, x, mask=None):
        B, T, C = x.shape   # batch, sequence length, embed dim
        
        Q = self.q_proj(x).reshape(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.k_proj(x).reshape(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.v_proj(x).reshape(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        
        attended = torch.matmul(weights, V)
        attended = attended.transpose(1, 2).reshape(B, T, C)
        return self.out_proj(attended)

Multi-Head Attention: Rather than computing a single attention function, Transformers run multiple attention heads in parallel, each learning to attend to different types of relationships (syntax, semantics, coreference). Their outputs are concatenated and projected.

The Full Transformer Block:

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attention = SelfAttention(embed_dim, num_heads)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Linear(ff_dim, embed_dim),
        )
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        x = x + self.dropout(self.attention(self.norm1(x), mask))  # residual + attention
        x = x + self.dropout(self.feed_forward(self.norm2(x)))     # residual + FFN
        return x

🤗 Hugging Face — NLP in Practice

from transformers import pipeline, AutoTokenizer, AutoModel
import torch

sentiment = pipeline("sentiment-analysis")
result = sentiment("The product exceeded all my expectations!")
print(result)

qa = pipeline("question-answering")
context = "The Transformer architecture was introduced by Google in 2017 in the paper 'Attention Is All You Need'."
result = qa(question="When was the Transformer introduced?", context=context)
print(result)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "Understanding transformers is key to modern AI."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)

sentence_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token
print(f"Sentence embedding shape: {sentence_embedding.shape}")  # (1, 768)

🧬 Foundations of Large Language Models (LLMs)

Large Language Models are Transformer-based models trained on massive text corpora with billions of parameters. They represent the current frontier of AI capability. Understanding their architecture and training is essential for modern AI engineers.

The Pretraining Objective — Next Token Prediction: GPT-style models are trained on a deceptively simple objective: given the previous tokens in a sequence, predict the next token. This is called autoregressive language modeling or causal language modeling. Trained on hundreds of billions of tokens of internet text, the model must implicitly learn grammar, facts, reasoning patterns, programming, mathematics, and world knowledge—because all of these help predict the next token accurately.

Emergent Capabilities: LLMs exhibit capabilities that were not explicitly trained for and that scale predictably with model size. Chain-of-thought reasoning, in-context learning (learning from examples in the prompt), instruction following, and code generation all emerged as models scaled beyond certain parameter thresholds. This emergence is both fascinating and poorly understood.

RLHF — Aligning LLMs with Human Preferences: Raw pretrained LLMs generate text that completes the statistical pattern of training data—which can be harmful, biased, or unhelpful. Reinforcement Learning from Human Feedback (RLHF) fine-tunes the model to be helpful, harmless, and honest. Human raters compare LLM responses and rank them. A reward model is trained on these preferences. The LLM is then fine-tuned using RL to maximize the reward model's score. This is how ChatGPT, Claude, and Gemini are aligned.

Knowledge Check

Ready to test your understanding of 10. Natural Language Processing and Large Language Models?

9. Computer Vision

11. Reinforcement Learning