4. Data Collection and Processing
Garbage In, Garbage Out
The most sophisticated neural network in the world cannot overcome poor data. "Garbage in, garbage out" is the iron law of machine learning. In real AI projects, data collection, cleaning, and preprocessing consume the majority of engineering time—and determine the ceiling of model performance far more than architecture choices.
This module teaches you to treat data as a first-class engineering concern: to collect it systematically, clean it rigorously, engineer features intelligently, and explore it deeply before any model ever sees it.
📥 Data Collection Techniques
Every AI project begins with data. Where it comes from determines what you can build.
Structured Sources:
- Relational Databases (SQL): Most enterprise AI projects begin with existing databases. Learn SQL. Even as an AI engineer, you will spend significant time writing queries to extract, aggregate, and join data from production databases.
- APIs: RESTful APIs provide structured JSON or CSV data. Useful for enriching datasets with external information (weather, financial data, social signals). Rate limiting and authentication are practical engineering challenges.
- Data Warehouses: BigQuery, Snowflake, Redshift. Purpose-built for analytical queries over massive datasets. The standard data infrastructure in data-mature organizations.
Unstructured Sources:
- Web Scraping: Use Scrapy, BeautifulSoup, or Playwright to extract data from public web pages. Always check robots.txt and terms of service. Consider ethical and legal implications.
- Public Datasets: Kaggle, Hugging Face Datasets, UCI ML Repository, Google Dataset Search, government open data portals. Always start here before collecting custom data—free, labeled datasets save months of work.
- Synthetic Data Generation: For rare events or privacy-sensitive scenarios, generate synthetic training data using simulation, GANs, or LLM-based augmentation.
Data Versioning: Track your datasets like code. Use DVC (Data Version Control) to version datasets alongside model code. Without versioning, reproducing results from three months ago becomes impossible when your dataset has changed.
🧹 Data Cleaning — The Critical 60%
Real-world data is messy. Missing values, duplicates, incorrect entries, inconsistent formats, and outliers are the norm, not the exception. Cleaning is not glamorous, but it is where AI projects succeed or fail.
Systematic Cleaning Checklist:
Handling Missing Data — Strategies and When to Use Them:
- Drop rows:
df.dropna()— Only when missingness is random and you can afford to lose data. Dangerous with more than 5–10% missing data. - Mean/Median Imputation:
df['col'].fillna(df['col'].median())— Simple, preserves sample size. Mean for symmetric distributions, median for skewed or when outliers are present. Loses variance information. - Mode Imputation: For categorical features.
df['col'].fillna(df['col'].mode()[0]) - Model-based Imputation: Train a model to predict missing values from other features. Most accurate but computationally expensive.
- Missing as a category: For categorical features, treat "missing" as its own category. Sometimes missingness itself carries signal—a customer with no transaction history behaves differently from one with many.
Duplicate Detection and Handling:
Outlier Detection and Treatment:
- IQR Method: Values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR are statistical outliers. Visualize with box plots before acting.
- Z-Score Method: Values more than 3 standard deviations from the mean. Works best for normally distributed features.
- Domain Knowledge: A human age of 150 is an outlier. An income of -$5,000 is likely a data entry error. Statistical methods help identify candidates; domain knowledge decides what to do with them.
- Options: Remove (if truly erroneous), cap/winsorize (replace with percentile boundaries), or keep (if legitimately extreme and informative).
⚙️ Feature Engineering — Creating Signal from Noise
Feature engineering is the process of transforming raw data into inputs that make the model's job easier. Domain knowledge + creative feature engineering frequently outperforms complex model architectures on messy, real-world data.
- Date/Time Decomposition: A raw timestamp is almost never useful. Extract: hour of day, day of week (Monday vs Sunday has very different patterns in retail data), month, quarter, is_weekend, is_holiday, days_since_last_event.
- Interaction Features: Multiply or divide related features.
revenue_per_user = total_revenue / user_count. Often captures nonlinear relationships that linear models miss. - Polynomial Features: Adding x², x³ allows linear models to fit curved relationships. scikit-learn's PolynomialFeatures automates this.
- Text Features from Categorical Data: Word count in a text field, presence of specific keywords, sentiment score—each converts unstructured text into a numeric feature.
- Binning/Discretization: Convert continuous features into discrete categories. Age → [0-18, 18-35, 35-55, 55+]. Useful when the relationship between the feature and target is non-monotonic.
- Lag Features (Time Series): For temporal data, include values from previous timesteps as features. Yesterday's sales, last week's price, last month's user count—these are often the most predictive features for forecasting.
📏 Data Normalization and Scaling
Most machine learning algorithms are sensitive to the scale of input features. A feature with values in the thousands (income) will dominate a feature with values between 0 and 1 (probability), unless you normalize.
- Standardization (Z-score normalization): Subtracts mean, divides by standard deviation. Output has mean 0 and standard deviation 1. Best for algorithms assuming normally distributed data (linear regression, SVM, neural networks).
StandardScalerin scikit-learn. - Min-Max Normalization: Scales values to [0, 1] range. Preserves the shape of the distribution but is sensitive to outliers. Use when you need bounded values.
MinMaxScalerin scikit-learn. - Robust Scaling: Uses median and IQR instead of mean and standard deviation. Not affected by outliers. Use when your data has significant outliers you're keeping.
RobustScalerin scikit-learn. - Log Transformation: Applies log(x+1) to highly skewed distributions (income, population, pageviews). Compresses extreme values and reveals log-linear relationships.
Critical Rule: Always fit scalers on training data only, then apply the learned parameters to transform both training and test data. Fitting on all data leaks test set statistics into training—a form of data leakage that produces optimistic evaluation results that don't generalize.
🔍 Exploratory Data Analysis (EDA)
EDA is the practice of systematically exploring your dataset before modeling to understand its structure, detect problems, and generate hypotheses. Skipping EDA is one of the most costly mistakes in AI engineering.
The EDA Checklist:
- Dataset Overview: Shape, data types, memory usage, first/last rows
- Target Variable Analysis: Distribution of the label you're predicting. Is it balanced (50/50) or imbalanced (99/1)? Class imbalance changes your entire approach.
- Feature Distributions: Histogram for each numeric feature. Bar chart for each categorical feature. Look for: unexpected distributions, values that shouldn't exist, extreme skew.
- Missing Data Patterns: Is missingness random or systematic? Certain users, time periods, or data collection methods may have systematically missing data.
- Correlation Analysis: Which features correlate with the target? Which features correlate with each other (multicollinearity)? Does any feature have a suspiciously perfect correlation with the target (data leakage)?
- Outlier Analysis: Box plots for every numeric feature. Are outliers legitimate extreme values or data errors?
- Temporal Patterns (if applicable): Plot features over time. Detect seasonality, trends, and distribution shifts.
Knowledge Check
Ready to test your understanding of 4. Data Collection and Processing?