95 MINS intermediate
8. Feature Engineering
Module 08: Feature Engineering
Transforming Raw Variables Into Model Signal
Feature engineering is where domain expertise meets statistical knowledge to create the inputs that make models actually work. A model is only as good as its inputs. In competitions like Kaggle, the winning team almost always has better features, not better models. In production data science, feature engineering is what makes a model useful for years rather than months.
🔧 Encoding Categorical Variables
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, TargetEncoder
from category_encoders import BinaryEncoder, WOEEncoder
df = pd.read_csv('customer_data.csv')
# Strategy 1: One-Hot Encoding — for low-cardinality nominals
# Use when: < 10-15 unique values, no ordinal relationship
df_ohe = pd.get_dummies(df, columns=['region', 'product_type'],
drop_first=True, # avoid multicollinearity
dtype=int)
# Strategy 2: Ordinal Encoding — for ordered categories
# Use when: categories have a meaningful order (Low < Medium < High)
size_order = [['Small', 'Medium', 'Large', 'Enterprise']]
ordinal_enc = OrdinalEncoder(categories=size_order)
df['company_size_encoded'] = ordinal_enc.fit_transform(df[['company_size']])
# Strategy 3: Target Encoding — for high-cardinality categoricals
# Use when: 50+ unique values (city, zip code, product ID)
# Replaces each category with the mean target value for that category
# IMPORTANT: Must be done with cross-validation to prevent leakage
from sklearn.model_selection import cross_val_predict, KFold
from category_encoders import TargetEncoder as CrossValidatedTE
# Cross-validated target encoding (prevents data leakage)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
target_enc = CrossValidatedTE(cols=['city'], smoothing=10) # smoothing prevents overfitting rare categories
df['city_encoded'] = target_enc.fit_transform(df['city'], df['revenue'])
# Strategy 4: Frequency Encoding — for very high cardinality
# Replaces category with how often it appears in the dataset
freq_map = df['customer_segment'].value_counts(normalize=True)
df['segment_frequency'] = df['customer_segment'].map(freq_map)
print(f'Original: {df.shape[1]} columns')
print(f'After OHE: {df_ohe.shape[1]} columns')📅 Temporal Feature Engineering
import pandas as pd
import numpy as np
df = pd.read_csv('transactions.csv', parse_dates=['transaction_date', 'signup_date'])
# Extract every meaningful signal from datetime columns
df['year'] = df['transaction_date'].dt.year
df['month'] = df['transaction_date'].dt.month
df['quarter'] = df['transaction_date'].dt.quarter
df['day_of_week'] = df['transaction_date'].dt.dayofweek # 0=Monday
df['day_of_month'] = df['transaction_date'].dt.day
df['week_of_year'] = df['transaction_date'].dt.isocalendar().week
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_end'] = df['transaction_date'].dt.is_month_end.astype(int)
df['is_month_start'] = df['transaction_date'].dt.is_month_start.astype(int)
df['hour'] = df['transaction_date'].dt.hour
df['is_business_hours'] = df['hour'].between(9, 17).astype(int)
# Cyclical encoding — preserve the circular nature of time
# Month 1 should be 'close' to month 12 in feature space
for col, period in [('month', 12), ('day_of_week', 7), ('hour', 24)]:
df[f'{col}_sin'] = np.sin(2 * np.pi * df[col] / period)
df[f'{col}_cos'] = np.cos(2 * np.pi * df[col] / period)
# Customer lifetime features
df['customer_age_days'] = (df['transaction_date'] - df['signup_date']).dt.days
df['days_since_first_purchase'] = (
df.groupby('customer_id')['transaction_date']
.transform(lambda x: (x - x.min()).dt.days)
)
# RFM Features — classic customer analytics
reference_date = df['transaction_date'].max()
rfm = df.groupby('customer_id').agg(
recency=('transaction_date', lambda x: (reference_date - x.max()).days),
frequency=('transaction_id', 'count'),
monetary=('amount', 'sum')
).reset_index()
print(rfm.describe())📉 Dimensionality Reduction with PCA
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# PCA requires scaled inputs — apply StandardScaler first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.select_dtypes(include='number'))
# Step 1: Fit PCA to understand explained variance
pca_full = PCA()
pca_full.fit(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
print('Components needed to explain:')
for threshold in [0.80, 0.90, 0.95, 0.99]:
n_components = np.argmax(cumulative_variance >= threshold) + 1
print(f' {threshold*100:.0f}%: {n_components} components')
# Step 2: Apply PCA with chosen components
n_components = np.argmax(cumulative_variance >= 0.95) + 1
pca = PCA(n_components=n_components, random_state=42)
X_pca = pca.fit_transform(X_scaled)
print(f'\nReduced from {X_scaled.shape[1]} to {X_pca.shape[1]} dimensions')
# Step 3: Interpret components — what does each one represent?
feature_names = df.select_dtypes(include='number').columns
component_df = pd.DataFrame(
pca.components_.T,
index=feature_names,
columns=[f'PC{i+1}' for i in range(n_components)]
)
print('\nTop 5 features by absolute loading for PC1:')
print(component_df['PC1'].abs().sort_values(ascending=False).head(5))
# Truncated SVD — PCA alternative for sparse data (text features, etc.)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50, random_state=42)
X_svd = svd.fit_transform(sparse_matrix) # works on scipy sparse matricesData Science: Feature Forge
| ID | Age | Tier | Income | Score |
|---|---|---|---|---|
| 101 | 28 | Pro | $85,000 | 7.2 |
| 102 | NaN | Free | $42,000 | 4.1 |
| 103 | 45 | Max | $150,000 | 9.8 |
| 104 | NaN | Pro | $92,000 | 8 |
| 105 | 22 | Free | $35,000 | 5.5 |
pipeline.py
Python 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CONSOLE OUTPUT
Awaiting transformation pipeline...
Knowledge Check
Ready to test your understanding of 8. Feature Engineering?