Artificial Intelligence | VoidX Academy

6. Supervised Learning Algorithms

Module 06: Supervised Learning

The Algorithm Arsenal

Supervised learning has produced an arsenal of algorithms, each with distinct strengths, weaknesses, and appropriate use cases. A common engineering mistake is to jump directly to deep learning for every problem. Classical ML algorithms—linear regression, decision trees, random forests, SVMs—are faster to train, easier to interpret, less data-hungry, and outperform deep learning on many tabular data problems. Know all the tools in the arsenal and choose the right one for each job.

📈 Linear Regression

The simplest and most interpretable regression model. Assumes the relationship between features and the target is linear. Despite its simplicity, linear regression is used in production across finance, economics, and science because its coefficients are directly interpretable as the impact of each feature on the prediction.

The Model: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b where w are weights (learned), x are features, and b is the bias term.

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print("Coefficients:", dict(zip(feature_names, model.coef_)))
print(f"R² Score: {r2_score(y_test, predictions):.4f}")
print(f"RMSE: {mean_squared_error(y_test, predictions)**0.5:.4f}")

ridge = Ridge(alpha=1.0)        # L2 regularization
lasso = Lasso(alpha=1.0)        # L1 regularization (produces sparse weights)

Regularization: Ridge (L2) penalizes large weights, shrinking them toward zero. Lasso (L1) shrinks some weights to exactly zero, performing automatic feature selection. Use Ridge when all features are potentially relevant. Use Lasso when you believe only a subset of features are truly predictive.

🎯 Logistic Regression

Despite the name, Logistic Regression is a classification algorithm. It models the probability that an example belongs to a class, outputting a value between 0 and 1 via the sigmoid function. If the probability exceeds a threshold (usually 0.5), predict class 1; otherwise predict class 0.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000, C=1.0)  # C = 1/regularization strength
model.fit(X_train, y_train)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

print("\nFeature Importance (coefficients):")
for feat, coef in zip(feature_names, model.coef_[0]):
    print(f"  {feat}: {coef:.4f}")

🌳 Decision Trees

Decision trees learn a sequence of if-then rules that partition the feature space. Highly interpretable—you can visualize and explain exactly why the model made each prediction. Prone to overfitting without depth constraints.

from sklearn.tree import DecisionTreeClassifier, export_text

model = DecisionTreeClassifier(max_depth=5, min_samples_split=10)
model.fit(X_train, y_train)

print(export_text(model, feature_names=feature_names))

importances = dict(zip(feature_names, model.feature_importances_))
sorted_importances = sorted(importances.items(), key=lambda x: x[1], reverse=True)
print("Feature Importances:", sorted_importances[:5])

Key Hyperparameters: max_depth controls overfitting—deeper trees memorize training data. min_samples_split requires a minimum number of examples before splitting a node. criterion is the splitting criterion (gini impurity or entropy/information gain).

🌲 Random Forest

An ensemble of decision trees trained on random subsets of the data (bootstrap sampling) and features. Predictions are aggregated by majority vote (classification) or averaging (regression). Random forests almost universally outperform single decision trees and are one of the most reliably effective algorithms for tabular data.

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb

rf = RandomForestClassifier(
    n_estimators=300,       # more trees = better (up to a point)
    max_features='sqrt',    # features considered per split
    max_depth=None,         # trees grow until pure
    n_jobs=-1,              # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

xgb_model = xgb.XGBClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8
)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=20)

Why Random Forests Work: Individual trees are high-variance (overfit easily). Averaging many uncorrelated trees reduces variance while maintaining low bias. The random feature selection at each split ensures trees are decorrelated—different trees make different types of errors, which cancel out when averaged.

Gradient Boosting (XGBoost, LightGBM, CatBoost): Instead of training trees in parallel (random forest), gradient boosting trains trees sequentially—each new tree corrects the errors of the ensemble so far. More accurate than random forest but slower to train and more sensitive to hyperparameters. XGBoost and LightGBM have won more Kaggle competitions than any other algorithm class.

⚔️ Support Vector Machines (SVM)

SVMs find the decision boundary (hyperplane) that maximizes the margin between classes. The examples closest to the boundary are called support vectors—they define the boundary. SVMs work well in high-dimensional spaces and are effective when the number of features exceeds the number of samples.

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),   # SVM requires scaling
    ('svm', SVC(kernel='rbf', C=10, gamma='scale', probability=True))
])

svm_pipeline.fit(X_train, y_train)
predictions = svm_pipeline.predict(X_test)

The Kernel Trick: The RBF (Radial Basis Function) kernel implicitly maps data into an infinite-dimensional space where it becomes linearly separable. This lets SVMs find nonlinear decision boundaries without explicitly computing the high-dimensional transformation. Elegant mathematics; computationally expensive for large datasets.

📍 K-Nearest Neighbors (KNN)

KNN makes predictions by finding the K most similar training examples to the query point and taking a majority vote (classification) or average (regression) of their labels. No training phase—all computation happens at prediction time. Simple and surprisingly effective, but scales poorly with dataset size.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [3, 5, 7, 11, 15], 'weights': ['uniform', 'distance']}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

📊 Model Evaluation Metrics — The Complete Reference

Choosing the wrong evaluation metric is one of the most common production AI failures. Accuracy is almost never sufficient.

Accuracy: (Correct predictions) / (Total predictions). Misleading when classes are imbalanced. A model that always predicts "not fraud" achieves 99.7% accuracy on fraud data—and is completely useless.
Precision: Of all the times the model predicted "positive," what fraction was actually positive? Precision = TP / (TP + FP). Optimize for precision when false positives are costly (spam filters—legitimate email in spam folder is very bad).
Recall (Sensitivity): Of all the actual positives, what fraction did the model catch? Recall = TP / (TP + FN). Optimize for recall when false negatives are costly (cancer detection—missing a true cancer case is catastrophic).
F1 Score: Harmonic mean of precision and recall. F1 = 2 × (Precision × Recall) / (Precision + Recall). Use when you need a single metric balancing both. Especially useful for imbalanced classes.
ROC-AUC: Area Under the Receiver Operating Characteristic curve. Measures discriminative ability across all possible thresholds. AUC=1.0 is perfect; AUC=0.5 is random guessing. Threshold-independent—useful for ranking problems.
RMSE (Root Mean Squared Error): For regression. Square root of the average squared error. In the same units as the target. Penalizes large errors heavily.
MAE (Mean Absolute Error): For regression. Average of absolute errors. More robust to outliers than RMSE. Easier to interpret (average error in target units).

from sklearn.metrics import (classification_report, roc_auc_score,
                             average_precision_score, confusion_matrix)
import seaborn as sns

y_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Avg Precision: {average_precision_score(y_test, y_prob):.4f}")

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Pred 0','Pred 1'],
            yticklabels=['True 0','True 1'])

6. Supervised Learning Algorithms

Module 06: Supervised Learning

The Algorithm Arsenal

📈 Linear Regression

The Model: y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b where w are weights (learned), x are features, and b is the bias term.

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print("Coefficients:", dict(zip(feature_names, model.coef_)))
print(f"R² Score: {r2_score(y_test, predictions):.4f}")
print(f"RMSE: {mean_squared_error(y_test, predictions)**0.5:.4f}")

ridge = Ridge(alpha=1.0)        # L2 regularization
lasso = Lasso(alpha=1.0)        # L1 regularization (produces sparse weights)

🎯 Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000, C=1.0)  # C = 1/regularization strength
model.fit(X_train, y_train)

predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

print("\nFeature Importance (coefficients):")
for feat, coef in zip(feature_names, model.coef_[0]):
    print(f"  {feat}: {coef:.4f}")

🌳 Decision Trees

from sklearn.tree import DecisionTreeClassifier, export_text

model = DecisionTreeClassifier(max_depth=5, min_samples_split=10)
model.fit(X_train, y_train)

print(export_text(model, feature_names=feature_names))

importances = dict(zip(feature_names, model.feature_importances_))
sorted_importances = sorted(importances.items(), key=lambda x: x[1], reverse=True)
print("Feature Importances:", sorted_importances[:5])

🌲 Random Forest

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb

rf = RandomForestClassifier(
    n_estimators=300,       # more trees = better (up to a point)
    max_features='sqrt',    # features considered per split
    max_depth=None,         # trees grow until pure
    n_jobs=-1,              # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

xgb_model = xgb.XGBClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8
)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=20)

⚔️ Support Vector Machines (SVM)

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),   # SVM requires scaling
    ('svm', SVC(kernel='rbf', C=10, gamma='scale', probability=True))
])

svm_pipeline.fit(X_train, y_train)
predictions = svm_pipeline.predict(X_test)

📍 K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()
param_grid = {'n_neighbors': [3, 5, 7, 11, 15], 'weights': ['uniform', 'distance']}

grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

📊 Model Evaluation Metrics — The Complete Reference

Choosing the wrong evaluation metric is one of the most common production AI failures. Accuracy is almost never sufficient.

Accuracy: (Correct predictions) / (Total predictions). Misleading when classes are imbalanced. A model that always predicts "not fraud" achieves 99.7% accuracy on fraud data—and is completely useless.
Precision: Of all the times the model predicted "positive," what fraction was actually positive? Precision = TP / (TP + FP). Optimize for precision when false positives are costly (spam filters—legitimate email in spam folder is very bad).
Recall (Sensitivity): Of all the actual positives, what fraction did the model catch? Recall = TP / (TP + FN). Optimize for recall when false negatives are costly (cancer detection—missing a true cancer case is catastrophic).
F1 Score: Harmonic mean of precision and recall. F1 = 2 × (Precision × Recall) / (Precision + Recall). Use when you need a single metric balancing both. Especially useful for imbalanced classes.
ROC-AUC: Area Under the Receiver Operating Characteristic curve. Measures discriminative ability across all possible thresholds. AUC=1.0 is perfect; AUC=0.5 is random guessing. Threshold-independent—useful for ranking problems.
RMSE (Root Mean Squared Error): For regression. Square root of the average squared error. In the same units as the target. Penalizes large errors heavily.
MAE (Mean Absolute Error): For regression. Average of absolute errors. More robust to outliers than RMSE. Easier to interpret (average error in target units).

from sklearn.metrics import (classification_report, roc_auc_score,
                             average_precision_score, confusion_matrix)
import seaborn as sns

y_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Avg Precision: {average_precision_score(y_test, y_prob):.4f}")

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Pred 0','Pred 1'],
            yticklabels=['True 0','True 1'])

6. Supervised Learning Algorithms

The Algorithm Arsenal

📈 Linear Regression

🎯 Logistic Regression

🌳 Decision Trees

🌲 Random Forest

⚔️ Support Vector Machines (SVM)

📍 K-Nearest Neighbors (KNN)

📊 Model Evaluation Metrics — The Complete Reference

Knowledge Check

6. Supervised Learning Algorithms

The Algorithm Arsenal

📈 Linear Regression

🎯 Logistic Regression

🌳 Decision Trees

🌲 Random Forest

⚔️ Support Vector Machines (SVM)

📍 K-Nearest Neighbors (KNN)

📊 Model Evaluation Metrics — The Complete Reference

Knowledge Check