18. Tools & Ecosystem
Jupyter vs Production, Version Control, and the Full Data Science Stack
Professional data scientists work across two distinct environments: exploratory (Jupyter notebooks for experimentation) and production (version-controlled Python packages for reliable, repeatable systems). Knowing when you're in each mode — and how to transition between them — is a career-defining skill. This module maps the complete data science tooling ecosystem and explains the professional workflows that distinguish individual contributors from senior practitioners.
🔬 Jupyter vs Production: Know the Boundary
Jupyter notebooks excel at exploration. They fail at production for specific, avoidable reasons:
- Hidden State: Cells can be run out of order. Restarting the kernel can produce different results. Production code must be deterministic and order-independent.
- No Testing: There is no standard test framework for notebooks. A function that was correct last week may have been overwritten without detection.
- No Version Diffing: Git diffs of notebooks are JSON blobs — almost unreadable for code review.
- Import Nightmares: Global imports and dependencies scattered throughout cells create fragile environments.
The professional pattern: Use notebooks for EDA and prototyping. Once an approach is validated, refactor the core logic into Python modules in src/ and import them from notebooks. The notebook becomes a thin orchestration layer that calls testable, documented, version-controlled functions.
# Professional data science project structure my_ds_project/ ├── data/ │ ├── raw/ # Never modify — original sources │ ├── processed/ # Cleaned, validated datasets │ └── features/ # Feature-engineered datasets ├── notebooks/ │ ├── 01_eda.ipynb # Exploration only │ └── 02_modeling.ipynb # Model prototyping ├── src/ │ ├── __init__.py │ ├── data/ │ │ ├── ingest.py # Data loading functions │ │ ├── validate.py # Validation logic │ │ └── transform.py # Cleaning functions │ ├── features/ │ │ └── engineering.py # Feature engineering │ └── models/ │ ├── train.py # Training pipeline │ ├── evaluate.py # Evaluation metrics │ └── predict.py # Inference functions ├── tests/ │ ├── test_transform.py # Unit tests for src/ │ └── test_features.py ├── config/ │ └── config.yaml # Configuration (not code) ├── Makefile # Common commands ├── requirements.txt # Pinned dependencies ├── pyproject.toml # Package metadata └── README.md
🔧 The Complete Data Science Stack
| Category | Tool | Use Case |
|---|---|---|
| Data Manipulation | Pandas, Polars | DataFrame operations (Polars 10-50x faster for large data) |
| ML | scikit-learn | Classical ML, preprocessing, pipelines, evaluation |
| Gradient Boosting | XGBoost, LightGBM, CatBoost | Tabular data, competitions, production models |
| Deep Learning | PyTorch, TensorFlow | Images, NLP, sequential data, custom architectures |
| Experiment Tracking | MLflow, Weights & Biases | Log parameters, metrics, artifacts across training runs |
| Hyperparameter Tuning | Optuna, Ray Tune | Bayesian and distributed hyperparameter optimization |
| Orchestration | Prefect, Airflow, Dagster | Schedule and monitor pipeline execution |
| Feature Store | Feast, Tecton, Hopsworks | Consistent features across training and serving |
| Visualization | Matplotlib, Seaborn, Plotly | Static (Matplotlib/Seaborn) and interactive (Plotly) charts |
| Dashboards | Dash, Streamlit, Metabase | Code-driven dashboards (Dash/Streamlit) or BI tools (Metabase) |
| Big Data | Apache Spark (PySpark) | Data larger than memory, distributed computation |
| Data Lake | Delta Lake, Apache Iceberg | ACID transactions, time travel, schema evolution on lakes |
| Interpretability | SHAP, LIME | Feature importance and individual prediction explanations |
Knowledge Check
Ready to test your understanding of 18. Tools & Ecosystem?