6. Supervised Learning Algorithms
The Algorithm Arsenal
Supervised learning has produced an arsenal of algorithms, each with distinct strengths, weaknesses, and appropriate use cases. A common engineering mistake is to jump directly to deep learning for every problem. Classical ML algorithms—linear regression, decision trees, random forests, SVMs—are faster to train, easier to interpret, less data-hungry, and outperform deep learning on many tabular data problems. Know all the tools in the arsenal and choose the right one for each job.
📈 Linear Regression
The simplest and most interpretable regression model. Assumes the relationship between features and the target is linear. Despite its simplicity, linear regression is used in production across finance, economics, and science because its coefficients are directly interpretable as the impact of each feature on the prediction.
The Model: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b where w are weights (learned), x are features, and b is the bias term.
Regularization: Ridge (L2) penalizes large weights, shrinking them toward zero. Lasso (L1) shrinks some weights to exactly zero, performing automatic feature selection. Use Ridge when all features are potentially relevant. Use Lasso when you believe only a subset of features are truly predictive.
🎯 Logistic Regression
Despite the name, Logistic Regression is a classification algorithm. It models the probability that an example belongs to a class, outputting a value between 0 and 1 via the sigmoid function. If the probability exceeds a threshold (usually 0.5), predict class 1; otherwise predict class 0.
🌳 Decision Trees
Decision trees learn a sequence of if-then rules that partition the feature space. Highly interpretable—you can visualize and explain exactly why the model made each prediction. Prone to overfitting without depth constraints.
Key Hyperparameters: max_depth controls overfitting—deeper trees memorize training data. min_samples_split requires a minimum number of examples before splitting a node. criterion is the splitting criterion (gini impurity or entropy/information gain).
🌲 Random Forest
An ensemble of decision trees trained on random subsets of the data (bootstrap sampling) and features. Predictions are aggregated by majority vote (classification) or averaging (regression). Random forests almost universally outperform single decision trees and are one of the most reliably effective algorithms for tabular data.
Why Random Forests Work: Individual trees are high-variance (overfit easily). Averaging many uncorrelated trees reduces variance while maintaining low bias. The random feature selection at each split ensures trees are decorrelated—different trees make different types of errors, which cancel out when averaged.
Gradient Boosting (XGBoost, LightGBM, CatBoost): Instead of training trees in parallel (random forest), gradient boosting trains trees sequentially—each new tree corrects the errors of the ensemble so far. More accurate than random forest but slower to train and more sensitive to hyperparameters. XGBoost and LightGBM have won more Kaggle competitions than any other algorithm class.
⚔️ Support Vector Machines (SVM)
SVMs find the decision boundary (hyperplane) that maximizes the margin between classes. The examples closest to the boundary are called support vectors—they define the boundary. SVMs work well in high-dimensional spaces and are effective when the number of features exceeds the number of samples.
The Kernel Trick: The RBF (Radial Basis Function) kernel implicitly maps data into an infinite-dimensional space where it becomes linearly separable. This lets SVMs find nonlinear decision boundaries without explicitly computing the high-dimensional transformation. Elegant mathematics; computationally expensive for large datasets.
📍 K-Nearest Neighbors (KNN)
KNN makes predictions by finding the K most similar training examples to the query point and taking a majority vote (classification) or average (regression) of their labels. No training phase—all computation happens at prediction time. Simple and surprisingly effective, but scales poorly with dataset size.
📊 Model Evaluation Metrics — The Complete Reference
Choosing the wrong evaluation metric is one of the most common production AI failures. Accuracy is almost never sufficient.
- Accuracy: (Correct predictions) / (Total predictions). Misleading when classes are imbalanced. A model that always predicts "not fraud" achieves 99.7% accuracy on fraud data—and is completely useless.
- Precision: Of all the times the model predicted "positive," what fraction was actually positive? Precision = TP / (TP + FP). Optimize for precision when false positives are costly (spam filters—legitimate email in spam folder is very bad).
- Recall (Sensitivity): Of all the actual positives, what fraction did the model catch? Recall = TP / (TP + FN). Optimize for recall when false negatives are costly (cancer detection—missing a true cancer case is catastrophic).
- F1 Score: Harmonic mean of precision and recall. F1 = 2 × (Precision × Recall) / (Precision + Recall). Use when you need a single metric balancing both. Especially useful for imbalanced classes.
- ROC-AUC: Area Under the Receiver Operating Characteristic curve. Measures discriminative ability across all possible thresholds. AUC=1.0 is perfect; AUC=0.5 is random guessing. Threshold-independent—useful for ranking problems.
- RMSE (Root Mean Squared Error): For regression. Square root of the average squared error. In the same units as the target. Penalizes large errors heavily.
- MAE (Mean Absolute Error): For regression. Average of absolute errors. More robust to outliers than RMSE. Easier to interpret (average error in target units).
Knowledge Check
Ready to test your understanding of 6. Supervised Learning Algorithms?