Data Science | VoidX Academy

9. Machine Learning Core

Module 09: ML Core

Supervised vs Unsupervised, Training, and Evaluation

Machine learning is the practice of building systems that learn patterns from data to make predictions or find structure. The gap between a model that works in a notebook and one that works in production is almost entirely about evaluation — choosing the right metrics, avoiding data leakage, understanding generalization, and communicating uncertainty. This module covers the foundational ML framework used across every algorithm and application.

🗺️ The ML Algorithm Map

The right algorithm choice depends on: the type of target variable, the size and dimensionality of your data, interpretability requirements, and the prediction task.

Supervised Learning: You have labeled data (X → y). Regression (predict a number), Classification (predict a category).
Unsupervised Learning: No labels. Clustering (find natural groups), Dimensionality Reduction (compress features), Anomaly Detection (find outliers).
Semi-Supervised: A small amount of labeled data + a large amount of unlabeled data. LLM pretraining is a form of self-supervised learning.
Reinforcement Learning: Agent learns optimal actions through rewards. Not covered in this track but used in recommendation systems and game AI.

⚙️ The Training Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Load and split data
df = pd.read_csv('churn_dataset.csv')
X = df.drop('churned', axis=1)
y = df['churned']

# CRITICAL: Split FIRST, then fit any transformers on training data only
# This prevents data leakage from test set statistics into the model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
    stratify=y  # maintain class distribution in both sets
)
print(f'Train: {X_train.shape}, Test: {X_test.shape}')
print(f'Train churn rate: {y_train.mean():.3f}, Test: {y_test.mean():.3f}')

# Use Pipeline to prevent leakage — transformer fits on training data only
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])

# Cross-validation for reliable performance estimate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model_pipeline, X_train, y_train,
                            cv=cv, scoring='roc_auc', n_jobs=-1)
print(f'\n5-Fold CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}')

# Final fit and evaluation
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)
y_prob = model_pipeline.predict_proba(X_test)[:, 1]

print('\n=== TEST SET EVALUATION ===')
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}')

📊 The Evaluation Metrics Playbook

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix,
    mean_squared_error, mean_absolute_error, r2_score
)
import numpy as np

# Classification Metrics
def evaluate_classifier(y_true, y_pred, y_prob, positive_label='Churn'):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    print('=== CLASSIFICATION EVALUATION ===')
    print(f'Accuracy:          {accuracy_score(y_true, y_pred):.4f}')
    print(f'Precision:         {precision_score(y_true, y_pred):.4f}  (of predicted positives, how many are correct?)')
    print(f'Recall/Sensitivity:{recall_score(y_true, y_pred):.4f}  (of actual positives, how many were found?)')
    print(f'Specificity:       {tn/(tn+fp):.4f}  (of actual negatives, how many were correctly identified?)')
    print(f'F1 Score:          {f1_score(y_true, y_pred):.4f}  (harmonic mean of precision and recall)')
    print(f'ROC-AUC:           {roc_auc_score(y_true, y_prob):.4f}  (ranking quality, threshold-independent)')
    print(f'PR-AUC:            {average_precision_score(y_true, y_prob):.4f}  (better for imbalanced data)')
    print(f'\nConfusion Matrix:')
    print(f'  True Positives:  {tp} | False Positives: {fp}')
    print(f'  False Negatives: {fn} | True Negatives:  {tn}')

# Regression Metrics
def evaluate_regressor(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print('=== REGRESSION EVALUATION ===')
    print(f'RMSE: {np.sqrt(mse):.4f}  (in same units as target, penalizes large errors)')
    print(f'MAE:  {mae:.4f}  (in same units, more robust to outliers than RMSE)')
    print(f'MAPE: {np.mean(np.abs((y_true - y_pred) / y_true)) * 100:.2f}%  (percentage error)')
    print(f'R²:   {r2:.4f}  (proportion of variance explained by model)')

Data Science: Model Training Arena

Epochs

Mean Squared Error (Loss)

---

Independent Variable (X)

Target Variable (Y)

train_model.py

X_train, y_train loaded

TRAINING OUTPUT

[09:45:38][SYSTEM] Environment ready. Waiting for model training initialization...

9. Machine Learning Core

Module 09: ML Core

Supervised vs Unsupervised, Training, and Evaluation

🗺️ The ML Algorithm Map

The right algorithm choice depends on: the type of target variable, the size and dimensionality of your data, interpretability requirements, and the prediction task.

Supervised Learning: You have labeled data (X → y). Regression (predict a number), Classification (predict a category).
Unsupervised Learning: No labels. Clustering (find natural groups), Dimensionality Reduction (compress features), Anomaly Detection (find outliers).
Semi-Supervised: A small amount of labeled data + a large amount of unlabeled data. LLM pretraining is a form of self-supervised learning.
Reinforcement Learning: Agent learns optimal actions through rewards. Not covered in this track but used in recommendation systems and game AI.

⚙️ The Training Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

# Load and split data
df = pd.read_csv('churn_dataset.csv')
X = df.drop('churned', axis=1)
y = df['churned']

# CRITICAL: Split FIRST, then fit any transformers on training data only
# This prevents data leakage from test set statistics into the model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
    stratify=y  # maintain class distribution in both sets
)
print(f'Train: {X_train.shape}, Test: {X_test.shape}')
print(f'Train churn rate: {y_train.mean():.3f}, Test: {y_test.mean():.3f}')

# Use Pipeline to prevent leakage — transformer fits on training data only
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])

# Cross-validation for reliable performance estimate
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model_pipeline, X_train, y_train,
                            cv=cv, scoring='roc_auc', n_jobs=-1)
print(f'\n5-Fold CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}')

# Final fit and evaluation
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)
y_prob = model_pipeline.predict_proba(X_test)[:, 1]

print('\n=== TEST SET EVALUATION ===')
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}')

📊 The Evaluation Metrics Playbook

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix,
    mean_squared_error, mean_absolute_error, r2_score
)
import numpy as np

# Classification Metrics
def evaluate_classifier(y_true, y_pred, y_prob, positive_label='Churn'):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    print('=== CLASSIFICATION EVALUATION ===')
    print(f'Accuracy:          {accuracy_score(y_true, y_pred):.4f}')
    print(f'Precision:         {precision_score(y_true, y_pred):.4f}  (of predicted positives, how many are correct?)')
    print(f'Recall/Sensitivity:{recall_score(y_true, y_pred):.4f}  (of actual positives, how many were found?)')
    print(f'Specificity:       {tn/(tn+fp):.4f}  (of actual negatives, how many were correctly identified?)')
    print(f'F1 Score:          {f1_score(y_true, y_pred):.4f}  (harmonic mean of precision and recall)')
    print(f'ROC-AUC:           {roc_auc_score(y_true, y_prob):.4f}  (ranking quality, threshold-independent)')
    print(f'PR-AUC:            {average_precision_score(y_true, y_prob):.4f}  (better for imbalanced data)')
    print(f'\nConfusion Matrix:')
    print(f'  True Positives:  {tp} | False Positives: {fp}')
    print(f'  False Negatives: {fn} | True Negatives:  {tn}')

# Regression Metrics
def evaluate_regressor(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print('=== REGRESSION EVALUATION ===')
    print(f'RMSE: {np.sqrt(mse):.4f}  (in same units as target, penalizes large errors)')
    print(f'MAE:  {mae:.4f}  (in same units, more robust to outliers than RMSE)')
    print(f'MAPE: {np.mean(np.abs((y_true - y_pred) / y_true)) * 100:.2f}%  (percentage error)')
    print(f'R²:   {r2:.4f}  (proportion of variance explained by model)')

Data Science: Model Training Arena

Epochs

Mean Squared Error (Loss)

---

Independent Variable (X)

Target Variable (Y)

train_model.py

X_train, y_train loaded

TRAINING OUTPUT

[09:45:38][SYSTEM] Environment ready. Waiting for model training initialization...

9. Machine Learning Core

Supervised vs Unsupervised, Training, and Evaluation

🗺️ The ML Algorithm Map

⚙️ The Training Pipeline

📊 The Evaluation Metrics Playbook

Knowledge Check

9. Machine Learning Core

Supervised vs Unsupervised, Training, and Evaluation

🗺️ The ML Algorithm Map

⚙️ The Training Pipeline

📊 The Evaluation Metrics Playbook

Knowledge Check