7. Unsupervised Learning
Finding Structure Without Labels
Most data in the world is unlabeled. Getting humans to label data is expensive, slow, and sometimes impossible—you cannot label every network packet for intrusion detection, every astronomical object for classification, or every customer behavior pattern for segmentation. Unsupervised learning discovers structure, patterns, and representations from data without requiring a single label. It is also increasingly the engine behind foundation models—GPT's pretraining was a form of self-supervised (unsupervised) learning on billions of text tokens.
🎯 K-Means Clustering
K-Means partitions data into K clusters by iteratively assigning each point to the nearest cluster center (centroid) and recomputing centroids as the mean of assigned points. Simple, fast, and widely used for customer segmentation, image compression, and data summarization.
Choosing K — The Elbow Method: Plot inertia (within-cluster sum of squares) for K from 1 to 10. The "elbow"—where inertia stops decreasing sharply—suggests the optimal K. Silhouette score provides another evaluation metric: values close to 1 indicate tight, well-separated clusters.
K-Means Limitations: Assumes spherical clusters of similar size. Fails on elongated, non-convex, or very different sized clusters. Sensitive to initialization and outliers. Always run with multiple random initializations (n_init=10).
🌲 Hierarchical Clustering
Builds a tree (dendrogram) of nested clusters by progressively merging (agglomerative) or splitting (divisive) data points. Unlike K-Means, you don't need to specify K in advance—you can cut the dendrogram at any level to produce any number of clusters.
Linkage Methods: Ward (minimize within-cluster variance—best for compact clusters), Complete (max distance between clusters—good for well-separated groups), Average (mean distance), Single (min distance—tends to produce chains).
📉 Principal Component Analysis (PCA)
PCA is the most important dimensionality reduction technique in AI. It projects high-dimensional data onto lower-dimensional directions (principal components) that capture maximum variance. It reduces noise, removes redundant correlated features, enables visualization of high-dimensional data, and speeds up downstream model training.
How PCA Works: Computes the covariance matrix of the features, finds its eigenvectors (the principal components—directions of maximum variance), and projects data onto these directions. The first principal component explains the most variance; each subsequent component is orthogonal to all previous ones and explains the most remaining variance.
🕵️ Anomaly Detection
Anomaly detection identifies unusual patterns that don't conform to expected behavior. Critical for fraud detection, intrusion detection, equipment failure prediction, and quality control—precisely the use cases where labeled examples of anomalies are scarce (because anomalies, by definition, rarely occur).
Key Algorithms:
- Isolation Forest: Randomly partitions data using decision trees. Anomalies are isolated with fewer splits (shorter path length) because they are different from the majority. Fast, effective, and works well in high dimensions.
- Local Outlier Factor (LOF): Compares the local density of each point to its neighbors. Points in low-density regions surrounded by high-density neighborhoods are anomalies. Better for detecting anomalies in varying-density regions.
- Autoencoder-based Detection (Deep Learning): Train an autoencoder to reconstruct normal data. Anomalies have high reconstruction error because the model never learned to represent them. Powerful for complex, high-dimensional data like network traffic and sensor readings.
Knowledge Check
Ready to test your understanding of 7. Unsupervised Learning?