Data Science | VoidX Academy

18. Tools & Ecosystem

Module 18: Tools

Jupyter vs Production, Version Control, and the Full Data Science Stack

Professional data scientists work across two distinct environments: exploratory (Jupyter notebooks for experimentation) and production (version-controlled Python packages for reliable, repeatable systems). Knowing when you're in each mode — and how to transition between them — is a career-defining skill. This module maps the complete data science tooling ecosystem and explains the professional workflows that distinguish individual contributors from senior practitioners.

🔬 Jupyter vs Production: Know the Boundary

Jupyter notebooks excel at exploration. They fail at production for specific, avoidable reasons:

Hidden State: Cells can be run out of order. Restarting the kernel can produce different results. Production code must be deterministic and order-independent.
No Testing: There is no standard test framework for notebooks. A function that was correct last week may have been overwritten without detection.
No Version Diffing: Git diffs of notebooks are JSON blobs — almost unreadable for code review.
Import Nightmares: Global imports and dependencies scattered throughout cells create fragile environments.

The professional pattern: Use notebooks for EDA and prototyping. Once an approach is validated, refactor the core logic into Python modules in src/ and import them from notebooks. The notebook becomes a thin orchestration layer that calls testable, documented, version-controlled functions.

# Professional data science project structure
my_ds_project/
├── data/
│   ├── raw/          # Never modify — original sources
│   ├── processed/    # Cleaned, validated datasets
│   └── features/     # Feature-engineered datasets
├── notebooks/
│   ├── 01_eda.ipynb         # Exploration only
│   └── 02_modeling.ipynb    # Model prototyping
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── ingest.py         # Data loading functions
│   │   ├── validate.py       # Validation logic
│   │   └── transform.py      # Cleaning functions
│   ├── features/
│   │   └── engineering.py    # Feature engineering
│   └── models/
│       ├── train.py          # Training pipeline
│       ├── evaluate.py       # Evaluation metrics
│       └── predict.py        # Inference functions
├── tests/
│   ├── test_transform.py     # Unit tests for src/
│   └── test_features.py
├── config/
│   └── config.yaml           # Configuration (not code)
├── Makefile                  # Common commands
├── requirements.txt          # Pinned dependencies
├── pyproject.toml            # Package metadata
└── README.md

🔧 The Complete Data Science Stack

Category	Tool	Use Case
Data Manipulation	Pandas, Polars	DataFrame operations (Polars 10-50x faster for large data)
ML	scikit-learn	Classical ML, preprocessing, pipelines, evaluation
Gradient Boosting	XGBoost, LightGBM, CatBoost	Tabular data, competitions, production models
Deep Learning	PyTorch, TensorFlow	Images, NLP, sequential data, custom architectures
Experiment Tracking	MLflow, Weights & Biases	Log parameters, metrics, artifacts across training runs
Hyperparameter Tuning	Optuna, Ray Tune	Bayesian and distributed hyperparameter optimization
Orchestration	Prefect, Airflow, Dagster	Schedule and monitor pipeline execution
Feature Store	Feast, Tecton, Hopsworks	Consistent features across training and serving
Visualization	Matplotlib, Seaborn, Plotly	Static (Matplotlib/Seaborn) and interactive (Plotly) charts
Dashboards	Dash, Streamlit, Metabase	Code-driven dashboards (Dash/Streamlit) or BI tools (Metabase)
Big Data	Apache Spark (PySpark)	Data larger than memory, distributed computation
Data Lake	Delta Lake, Apache Iceberg	ACID transactions, time travel, schema evolution on lakes
Interpretability	SHAP, LIME	Feature importance and individual prediction explanations

18. Tools & Ecosystem

Module 18: Tools

Jupyter vs Production, Version Control, and the Full Data Science Stack

🔬 Jupyter vs Production: Know the Boundary

Jupyter notebooks excel at exploration. They fail at production for specific, avoidable reasons:

Hidden State: Cells can be run out of order. Restarting the kernel can produce different results. Production code must be deterministic and order-independent.
No Testing: There is no standard test framework for notebooks. A function that was correct last week may have been overwritten without detection.
No Version Diffing: Git diffs of notebooks are JSON blobs — almost unreadable for code review.
Import Nightmares: Global imports and dependencies scattered throughout cells create fragile environments.

# Professional data science project structure
my_ds_project/
├── data/
│   ├── raw/          # Never modify — original sources
│   ├── processed/    # Cleaned, validated datasets
│   └── features/     # Feature-engineered datasets
├── notebooks/
│   ├── 01_eda.ipynb         # Exploration only
│   └── 02_modeling.ipynb    # Model prototyping
├── src/
│   ├── __init__.py
│   ├── data/
│   │   ├── ingest.py         # Data loading functions
│   │   ├── validate.py       # Validation logic
│   │   └── transform.py      # Cleaning functions
│   ├── features/
│   │   └── engineering.py    # Feature engineering
│   └── models/
│       ├── train.py          # Training pipeline
│       ├── evaluate.py       # Evaluation metrics
│       └── predict.py        # Inference functions
├── tests/
│   ├── test_transform.py     # Unit tests for src/
│   └── test_features.py
├── config/
│   └── config.yaml           # Configuration (not code)
├── Makefile                  # Common commands
├── requirements.txt          # Pinned dependencies
├── pyproject.toml            # Package metadata
└── README.md

🔧 The Complete Data Science Stack

Category	Tool	Use Case
Data Manipulation	Pandas, Polars	DataFrame operations (Polars 10-50x faster for large data)
ML	scikit-learn	Classical ML, preprocessing, pipelines, evaluation
Gradient Boosting	XGBoost, LightGBM, CatBoost	Tabular data, competitions, production models
Deep Learning	PyTorch, TensorFlow	Images, NLP, sequential data, custom architectures
Experiment Tracking	MLflow, Weights & Biases	Log parameters, metrics, artifacts across training runs
Hyperparameter Tuning	Optuna, Ray Tune	Bayesian and distributed hyperparameter optimization
Orchestration	Prefect, Airflow, Dagster	Schedule and monitor pipeline execution
Feature Store	Feast, Tecton, Hopsworks	Consistent features across training and serving
Visualization	Matplotlib, Seaborn, Plotly	Static (Matplotlib/Seaborn) and interactive (Plotly) charts
Dashboards	Dash, Streamlit, Metabase	Code-driven dashboards (Dash/Streamlit) or BI tools (Metabase)
Big Data	Apache Spark (PySpark)	Data larger than memory, distributed computation
Data Lake	Delta Lake, Apache Iceberg	ACID transactions, time travel, schema evolution on lakes
Interpretability	SHAP, LIME	Feature importance and individual prediction explanations

18. Tools & Ecosystem

Jupyter vs Production, Version Control, and the Full Data Science Stack

🔬 Jupyter vs Production: Know the Boundary

🔧 The Complete Data Science Stack

Knowledge Check

18. Tools & Ecosystem

Jupyter vs Production, Version Control, and the Full Data Science Stack

🔬 Jupyter vs Production: Know the Boundary

🔧 The Complete Data Science Stack

Knowledge Check