9. Computer Vision
Teaching Machines to See
Computer vision is the field of AI concerned with enabling machines to interpret and understand visual information from the world. For most of human history, interpreting images was exclusively biological. In 2012, a convolutional neural network called AlexNet changed that permanently. Today, CV systems perform quality control in factories, diagnose disease from medical images, navigate autonomous vehicles, and power face unlock on your phone—often at superhuman accuracy.
🖼️ Image Representation and Preprocessing
A digital image is a matrix of pixel values. A grayscale image is a 2D matrix (height × width). A color image is a 3D tensor (height × width × 3) where the three channels represent Red, Green, and Blue intensity (0–255 for uint8, or 0.0–1.0 when normalized).
Data Augmentation: Artificially increases the effective size and diversity of training data by applying random, label-preserving transformations. Critical for CV—prevents overfitting, improves robustness to real-world variation (different lighting, angles, scales). Common augmentations: horizontal flip, random crop, color jitter, rotation, Gaussian blur, CutMix, Mixup.
🔬 Convolutional Neural Networks (CNNs)
Standard fully-connected layers don't scale to images. A 224×224 color image has 150,528 pixels. Connecting each to even 1,000 hidden units requires 150 million parameters for a single layer—computationally infeasible and prone to massive overfitting.
CNNs solve this with three key ideas:
- Local Connectivity: Each neuron connects only to a small local region of the input (the receptive field), not the entire input. Edges, textures, and shapes are local—this is a valid inductive bias for vision.
- Weight Sharing: The same filter (kernel) is applied at every location across the entire image. A filter that detects horizontal edges detects them everywhere. This reduces parameters by orders of magnitude while encoding translational invariance.
- Hierarchical Features: Early layers detect low-level features (edges, colors, textures). Middle layers detect mid-level features (corners, curves, patterns). Deep layers detect high-level features (faces, objects, scenes). This hierarchy mirrors how biological visual cortex processes images.
🚀 Transfer Learning — Standing on Giants
Training a CNN from scratch on ImageNet (1.4M images, 1000 classes) requires weeks on multiple GPUs. Transfer learning circumvents this by using weights pretrained on ImageNet as the starting point. These weights encode rich visual representations—edges, textures, object parts—learned from millions of diverse images. You fine-tune the pretrained network on your specific dataset, requiring far less data and compute.
Transfer learning strategy: Freeze all layers → train only the new head → unfreeze the last few blocks → fine-tune with a low learning rate. This prevents "catastrophic forgetting"—the pretrained representations being destroyed by learning your specific task.
📦 Object Detection — Beyond Classification
Image classification says "this image contains a cat." Object detection says "there is a cat at coordinates (x=120, y=85, w=150, h=200) with 94% confidence, and a dog at (x=300, y=100, w=200, h=180) with 87% confidence." It localizes and classifies multiple objects simultaneously.
Key Architectures:
- YOLO (You Only Look Once): Single-pass detection—the entire image is processed once to predict all boxes and classes simultaneously. Extremely fast (real-time inference). YOLOv8 is the current standard for real-time object detection in production applications.
- Faster R-CNN: Two-stage detector: a Region Proposal Network first proposes candidate regions, then a classifier refines them. More accurate but slower than YOLO. Used when accuracy matters more than speed.
- DETR (Detection Transformer): Applies Transformers to object detection. Set-based prediction with learned object queries. State-of-the-art accuracy on many benchmarks.
Knowledge Check
Ready to test your understanding of 9. Computer Vision?