Artificial Intelligence | VoidX Academy

9. Computer Vision

Module 09: Computer Vision

Teaching Machines to See

Computer vision is the field of AI concerned with enabling machines to interpret and understand visual information from the world. For most of human history, interpreting images was exclusively biological. In 2012, a convolutional neural network called AlexNet changed that permanently. Today, CV systems perform quality control in factories, diagnose disease from medical images, navigate autonomous vehicles, and power face unlock on your phone—often at superhuman accuracy.

🖼️ Image Representation and Preprocessing

A digital image is a matrix of pixel values. A grayscale image is a 2D matrix (height × width). A color image is a 3D tensor (height × width × 3) where the three channels represent Red, Green, and Blue intensity (0–255 for uint8, or 0.0–1.0 when normalized).

from PIL import Image
import torchvision.transforms as T
import torch

img = Image.open('cat.jpg')
print(f"Size: {img.size}, Mode: {img.mode}")  # (width, height), 'RGB'

transform = T.Compose([
    T.Resize((224, 224)),               # resize to standard input size
    T.RandomHorizontalFlip(p=0.5),      # augmentation: flip randomly
    T.RandomRotation(degrees=15),        # augmentation: rotate ±15°
    T.ColorJitter(brightness=0.2,        # augmentation: vary colors
                  contrast=0.2, 
                  saturation=0.2),
    T.ToTensor(),                        # PIL → torch.Tensor (H,W,C) → (C,H,W), scales 0-255 to 0.0-1.0
    T.Normalize(mean=[0.485, 0.456, 0.406],    # ImageNet statistics
                std=[0.229, 0.224, 0.225])     # normalize to match pretrained model expectations
])

tensor = transform(img)
print(f"Tensor shape: {tensor.shape}")  # (3, 224, 224)

Data Augmentation: Artificially increases the effective size and diversity of training data by applying random, label-preserving transformations. Critical for CV—prevents overfitting, improves robustness to real-world variation (different lighting, angles, scales). Common augmentations: horizontal flip, random crop, color jitter, rotation, Gaussian blur, CutMix, Mixup.

🔬 Convolutional Neural Networks (CNNs)

Standard fully-connected layers don't scale to images. A 224×224 color image has 150,528 pixels. Connecting each to even 1,000 hidden units requires 150 million parameters for a single layer—computationally infeasible and prone to massive overfitting.

CNNs solve this with three key ideas:

Local Connectivity: Each neuron connects only to a small local region of the input (the receptive field), not the entire input. Edges, textures, and shapes are local—this is a valid inductive bias for vision.
Weight Sharing: The same filter (kernel) is applied at every location across the entire image. A filter that detects horizontal edges detects them everywhere. This reduces parameters by orders of magnitude while encoding translational invariance.
Hierarchical Features: Early layers detect low-level features (edges, colors, textures). Middle layers detect mid-level features (corners, curves, patterns). Deep layers detect high-level features (faces, objects, scenes). This hierarchy mirrors how biological visual cortex processes images.

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (B, 32, 224, 224)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # (B, 32, 112, 112)
            
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # (B, 64, 112, 112)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # (B, 64, 56, 56)
            
            nn.Conv2d(64, 128, kernel_size=3, padding=1), # (B, 128, 56, 56)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4)),                  # (B, 128, 4, 4)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        return self.classifier(self.features(x))

🚀 Transfer Learning — Standing on Giants

Training a CNN from scratch on ImageNet (1.4M images, 1000 classes) requires weeks on multiple GPUs. Transfer learning circumvents this by using weights pretrained on ImageNet as the starting point. These weights encode rich visual representations—edges, textures, object parts—learned from millions of diverse images. You fine-tune the pretrained network on your specific dataset, requiring far less data and compute.

import torchvision.models as models
import torch.nn as nn

model = models.resnet50(pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Sequential(
    nn.Linear(model.fc.in_features, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, num_classes)  # your number of classes
)

params_to_update = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(params_to_update, lr=1e-3)

for param in model.layer4.parameters():
    param.requires_grad = True
optimizer.add_param_group({'params': model.layer4.parameters(), 'lr': 1e-5})

Transfer learning strategy: Freeze all layers → train only the new head → unfreeze the last few blocks → fine-tune with a low learning rate. This prevents "catastrophic forgetting"—the pretrained representations being destroyed by learning your specific task.

📦 Object Detection — Beyond Classification

Image classification says "this image contains a cat." Object detection says "there is a cat at coordinates (x=120, y=85, w=150, h=200) with 94% confidence, and a dog at (x=300, y=100, w=200, h=180) with 87% confidence." It localizes and classifies multiple objects simultaneously.

Key Architectures:

YOLO (You Only Look Once): Single-pass detection—the entire image is processed once to predict all boxes and classes simultaneously. Extremely fast (real-time inference). YOLOv8 is the current standard for real-time object detection in production applications.
Faster R-CNN: Two-stage detector: a Region Proposal Network first proposes candidate regions, then a classifier refines them. More accurate but slower than YOLO. Used when accuracy matters more than speed.
DETR (Detection Transformer): Applies Transformers to object detection. Set-based prediction with learned object queries. State-of-the-art accuracy on many benchmarks.

from ultralytics import YOLO

model = YOLO('yolov8n.pt')   # nano model — fastest
results = model('image.jpg')

for result in results:
    boxes = result.boxes
    for box in boxes:
        cls = model.names[int(box.cls)]
        conf = float(box.conf)
        coords = box.xyxy[0].tolist()  # [x1, y1, x2, y2]
        print(f"{cls}: {conf:.2%} at {[int(c) for c in coords]}")

model.train(data='custom_dataset.yaml', epochs=100, imgsz=640, batch=16)

9. Computer Vision

Module 09: Computer Vision

Teaching Machines to See

🖼️ Image Representation and Preprocessing

from PIL import Image
import torchvision.transforms as T
import torch

img = Image.open('cat.jpg')
print(f"Size: {img.size}, Mode: {img.mode}")  # (width, height), 'RGB'

transform = T.Compose([
    T.Resize((224, 224)),               # resize to standard input size
    T.RandomHorizontalFlip(p=0.5),      # augmentation: flip randomly
    T.RandomRotation(degrees=15),        # augmentation: rotate ±15°
    T.ColorJitter(brightness=0.2,        # augmentation: vary colors
                  contrast=0.2, 
                  saturation=0.2),
    T.ToTensor(),                        # PIL → torch.Tensor (H,W,C) → (C,H,W), scales 0-255 to 0.0-1.0
    T.Normalize(mean=[0.485, 0.456, 0.406],    # ImageNet statistics
                std=[0.229, 0.224, 0.225])     # normalize to match pretrained model expectations
])

tensor = transform(img)
print(f"Tensor shape: {tensor.shape}")  # (3, 224, 224)

🔬 Convolutional Neural Networks (CNNs)

CNNs solve this with three key ideas:

Local Connectivity: Each neuron connects only to a small local region of the input (the receptive field), not the entire input. Edges, textures, and shapes are local—this is a valid inductive bias for vision.
Weight Sharing: The same filter (kernel) is applied at every location across the entire image. A filter that detects horizontal edges detects them everywhere. This reduces parameters by orders of magnitude while encoding translational invariance.
Hierarchical Features: Early layers detect low-level features (edges, colors, textures). Middle layers detect mid-level features (corners, curves, patterns). Deep layers detect high-level features (faces, objects, scenes). This hierarchy mirrors how biological visual cortex processes images.

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (B, 32, 224, 224)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # (B, 32, 112, 112)
            
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # (B, 64, 112, 112)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # (B, 64, 56, 56)
            
            nn.Conv2d(64, 128, kernel_size=3, padding=1), # (B, 128, 56, 56)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4)),                  # (B, 128, 4, 4)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        return self.classifier(self.features(x))

🚀 Transfer Learning — Standing on Giants

import torchvision.models as models
import torch.nn as nn

model = models.resnet50(pretrained=True)

for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Sequential(
    nn.Linear(model.fc.in_features, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, num_classes)  # your number of classes
)

params_to_update = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(params_to_update, lr=1e-3)

for param in model.layer4.parameters():
    param.requires_grad = True
optimizer.add_param_group({'params': model.layer4.parameters(), 'lr': 1e-5})

📦 Object Detection — Beyond Classification

Key Architectures:

YOLO (You Only Look Once): Single-pass detection—the entire image is processed once to predict all boxes and classes simultaneously. Extremely fast (real-time inference). YOLOv8 is the current standard for real-time object detection in production applications.
Faster R-CNN: Two-stage detector: a Region Proposal Network first proposes candidate regions, then a classifier refines them. More accurate but slower than YOLO. Used when accuracy matters more than speed.
DETR (Detection Transformer): Applies Transformers to object detection. Set-based prediction with learned object queries. State-of-the-art accuracy on many benchmarks.

from ultralytics import YOLO

model = YOLO('yolov8n.pt')   # nano model — fastest
results = model('image.jpg')

for result in results:
    boxes = result.boxes
    for box in boxes:
        cls = model.names[int(box.cls)]
        conf = float(box.conf)
        coords = box.xyxy[0].tolist()  # [x1, y1, x2, y2]
        print(f"{cls}: {conf:.2%} at {[int(c) for c in coords]}")

model.train(data='custom_dataset.yaml', epochs=100, imgsz=640, batch=16)

9. Computer Vision

Teaching Machines to See

🖼️ Image Representation and Preprocessing

🔬 Convolutional Neural Networks (CNNs)

🚀 Transfer Learning — Standing on Giants

📦 Object Detection — Beyond Classification

Knowledge Check

9. Computer Vision

Teaching Machines to See

🖼️ Image Representation and Preprocessing

🔬 Convolutional Neural Networks (CNNs)

🚀 Transfer Learning — Standing on Giants

📦 Object Detection — Beyond Classification

Knowledge Check