Data Science | VoidX Academy

6. Data Visualization Systems

Module 06: Visualization

Matplotlib, Seaborn, Plotly, and Dashboard Design

A visualization that confuses its audience is worse than no visualization at all. Data visualization is not about making charts — it is about making data argue clearly for a specific interpretation. This module covers the complete visualization toolkit from production-grade static plots to interactive dashboards, with emphasis on the design principles that separate analysis from storytelling.

📊 The Visualization Decision Framework

Every visualization choice should be driven by the question you're answering:

Distribution: Histogram, KDE plot, box plot, violin plot — what does the spread look like?
Comparison: Bar chart, grouped bar, dot plot — how do values differ across categories?
Relationship: Scatter plot, bubble chart, heat map — how do two variables co-vary?
Composition: Stacked bar, area chart, treemap — how do parts contribute to the whole?
Trend over time: Line chart, area chart, candlestick — how does a value change through time?

🎨 Production-Quality Static Plots

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Professional style configuration
plt.rcParams.update({
    'figure.dpi': 150,
    'figure.facecolor': 'white',
    'font.family': 'DejaVu Sans',
    'font.size': 11,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'axes.grid': True,
    'grid.alpha': 0.3,
    'axes.titlesize': 14,
    'axes.titleweight': 'bold',
})
sns.set_palette('husl', 8)

df = pd.read_csv('sales_data.csv', parse_dates=['date'])

# Figure 1: Multi-panel EDA dashboard
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Sales Performance Dashboard — Q1 2024', fontsize=18, fontweight='bold', y=1.02)

# Panel 1: Revenue distribution
sns.histplot(df['revenue'], kde=True, ax=axes[0,0], color='steelblue', bins=40)
axes[0,0].axvline(df['revenue'].median(), color='red', linestyle='--', label=f'Median: ${df["revenue"].median():,.0f}')
axes[0,0].set_title('Revenue Distribution')
axes[0,0].legend()

# Panel 2: Revenue by region (with sample sizes)
region_stats = df.groupby('region').agg(mean=('revenue','mean'), count=('revenue','size')).reset_index()
bars = axes[0,1].bar(region_stats['region'], region_stats['mean'],
                     color=sns.color_palette('husl', len(region_stats)))
# Add count labels on bars
for bar, (_, row) in zip(bars, region_stats.iterrows()):
    axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                   f'n={int(row["count"])}', ha='center', va='bottom', fontsize=9)
axes[0,1].set_title('Average Revenue by Region')
axes[0,1].set_ylabel('Average Revenue ($)')

# Panel 3: Correlation heatmap
numeric_cols = df.select_dtypes(include='number')
corr = numeric_cols.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))  # show only lower triangle
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, ax=axes[0,2], square=True, cbar_kws={'shrink': 0.8})
axes[0,2].set_title('Feature Correlation Matrix')

# Panel 4: Time series trend
monthly = df.resample('ME', on='date')['revenue'].sum().reset_index()
axes[1,0].plot(monthly['date'], monthly['revenue'], marker='o', linewidth=2, color='steelblue')
axes[1,0].fill_between(monthly['date'], monthly['revenue'], alpha=0.2, color='steelblue')
axes[1,0].set_title('Monthly Revenue Trend')
axes[1,0].set_ylabel('Total Revenue ($)')
axes[1,0].tick_params(axis='x', rotation=45)

# Panel 5: Box plots for outlier visualization
df.boxplot(column='revenue', by='region', ax=axes[1,1])
axes[1,1].set_title('Revenue Distribution by Region')
plt.sca(axes[1,1])
plt.title('Revenue by Region')
plt.suptitle('')

# Panel 6: Scatter with regression line
sns.regplot(data=df, x='units_sold', y='revenue', ax=axes[1,2],
            scatter_kws={'alpha': 0.4, 'color': 'steelblue'},
            line_kws={'color': 'red', 'linewidth': 2})
axes[1,2].set_title('Units Sold vs Revenue')

plt.tight_layout()
plt.savefig('eda_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

🚀 Interactive Visualizations with Plotly

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

df = pd.read_csv('sales_data.csv', parse_dates=['date'])

# Interactive time series with range selector
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df['date'], y=df['revenue'],
    mode='lines', name='Revenue',
    line=dict(color='royalblue', width=1.5),
    hovertemplate='%{x|%B %d, %Y}
Revenue: $%{y:,.2f}'
))
# Add 30-day rolling average
df['revenue_30d'] = df['revenue'].rolling(30).mean()
fig.add_trace(go.Scatter(
    x=df['date'], y=df['revenue_30d'],
    mode='lines', name='30-Day MA',
    line=dict(color='orange', width=2, dash='dash')
))
fig.update_layout(
    title='Revenue Trend with Moving Average',
    xaxis_title='Date',
    yaxis_title='Revenue ($)',
    xaxis_rangeslider_visible=True,
    template='plotly_white',
    hovermode='x unified'
)
fig.write_html('interactive_chart.html')  # save for sharing
fig.show()

# Animated scatter plot — show dimension of change over time
fig_animated = px.scatter(
    df.groupby(['month', 'region']).agg({'revenue': 'sum', 'units_sold': 'sum'}).reset_index(),
    x='units_sold', y='revenue',
    color='region', size='revenue',
    animation_frame='month',
    title='Revenue vs Units by Region Over Time',
    template='plotly_white'
)
fig_animated.show()

Data Science: EDA Terminal

ecommerce_churn.csv

1000 rows × 4 columns

customer_id	tenure_months	total_spend	churn
usr_892	12	450.50	0
usr_104	2	45.00	1
usr_443	36	2100.00	0
usr_991	1	12.99	1
usr_202	24	1250.75	0
usr_331	48	3400.20	0
usr_705	3	110.00	1

Run df.describe() in Python to see statistical summaries.

analyze.py

Jupyter Runtime

Awaiting execution...

6. Data Visualization Systems

Module 06: Visualization

Matplotlib, Seaborn, Plotly, and Dashboard Design

📊 The Visualization Decision Framework

Every visualization choice should be driven by the question you're answering:

Distribution: Histogram, KDE plot, box plot, violin plot — what does the spread look like?
Comparison: Bar chart, grouped bar, dot plot — how do values differ across categories?
Relationship: Scatter plot, bubble chart, heat map — how do two variables co-vary?
Composition: Stacked bar, area chart, treemap — how do parts contribute to the whole?
Trend over time: Line chart, area chart, candlestick — how does a value change through time?

🎨 Production-Quality Static Plots

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Professional style configuration
plt.rcParams.update({
    'figure.dpi': 150,
    'figure.facecolor': 'white',
    'font.family': 'DejaVu Sans',
    'font.size': 11,
    'axes.spines.top': False,
    'axes.spines.right': False,
    'axes.grid': True,
    'grid.alpha': 0.3,
    'axes.titlesize': 14,
    'axes.titleweight': 'bold',
})
sns.set_palette('husl', 8)

df = pd.read_csv('sales_data.csv', parse_dates=['date'])

# Figure 1: Multi-panel EDA dashboard
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('Sales Performance Dashboard — Q1 2024', fontsize=18, fontweight='bold', y=1.02)

# Panel 1: Revenue distribution
sns.histplot(df['revenue'], kde=True, ax=axes[0,0], color='steelblue', bins=40)
axes[0,0].axvline(df['revenue'].median(), color='red', linestyle='--', label=f'Median: ${df["revenue"].median():,.0f}')
axes[0,0].set_title('Revenue Distribution')
axes[0,0].legend()

# Panel 2: Revenue by region (with sample sizes)
region_stats = df.groupby('region').agg(mean=('revenue','mean'), count=('revenue','size')).reset_index()
bars = axes[0,1].bar(region_stats['region'], region_stats['mean'],
                     color=sns.color_palette('husl', len(region_stats)))
# Add count labels on bars
for bar, (_, row) in zip(bars, region_stats.iterrows()):
    axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                   f'n={int(row["count"])}', ha='center', va='bottom', fontsize=9)
axes[0,1].set_title('Average Revenue by Region')
axes[0,1].set_ylabel('Average Revenue ($)')

# Panel 3: Correlation heatmap
numeric_cols = df.select_dtypes(include='number')
corr = numeric_cols.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))  # show only lower triangle
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, ax=axes[0,2], square=True, cbar_kws={'shrink': 0.8})
axes[0,2].set_title('Feature Correlation Matrix')

# Panel 4: Time series trend
monthly = df.resample('ME', on='date')['revenue'].sum().reset_index()
axes[1,0].plot(monthly['date'], monthly['revenue'], marker='o', linewidth=2, color='steelblue')
axes[1,0].fill_between(monthly['date'], monthly['revenue'], alpha=0.2, color='steelblue')
axes[1,0].set_title('Monthly Revenue Trend')
axes[1,0].set_ylabel('Total Revenue ($)')
axes[1,0].tick_params(axis='x', rotation=45)

# Panel 5: Box plots for outlier visualization
df.boxplot(column='revenue', by='region', ax=axes[1,1])
axes[1,1].set_title('Revenue Distribution by Region')
plt.sca(axes[1,1])
plt.title('Revenue by Region')
plt.suptitle('')

# Panel 6: Scatter with regression line
sns.regplot(data=df, x='units_sold', y='revenue', ax=axes[1,2],
            scatter_kws={'alpha': 0.4, 'color': 'steelblue'},
            line_kws={'color': 'red', 'linewidth': 2})
axes[1,2].set_title('Units Sold vs Revenue')

plt.tight_layout()
plt.savefig('eda_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

🚀 Interactive Visualizations with Plotly

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

df = pd.read_csv('sales_data.csv', parse_dates=['date'])

# Interactive time series with range selector
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df['date'], y=df['revenue'],
    mode='lines', name='Revenue',
    line=dict(color='royalblue', width=1.5),
    hovertemplate='%{x|%B %d, %Y}
Revenue: $%{y:,.2f}'
))
# Add 30-day rolling average
df['revenue_30d'] = df['revenue'].rolling(30).mean()
fig.add_trace(go.Scatter(
    x=df['date'], y=df['revenue_30d'],
    mode='lines', name='30-Day MA',
    line=dict(color='orange', width=2, dash='dash')
))
fig.update_layout(
    title='Revenue Trend with Moving Average',
    xaxis_title='Date',
    yaxis_title='Revenue ($)',
    xaxis_rangeslider_visible=True,
    template='plotly_white',
    hovermode='x unified'
)
fig.write_html('interactive_chart.html')  # save for sharing
fig.show()

# Animated scatter plot — show dimension of change over time
fig_animated = px.scatter(
    df.groupby(['month', 'region']).agg({'revenue': 'sum', 'units_sold': 'sum'}).reset_index(),
    x='units_sold', y='revenue',
    color='region', size='revenue',
    animation_frame='month',
    title='Revenue vs Units by Region Over Time',
    template='plotly_white'
)
fig_animated.show()

Data Science: EDA Terminal

ecommerce_churn.csv

1000 rows × 4 columns

customer_id	tenure_months	total_spend	churn
usr_892	12	450.50	0
usr_104	2	45.00	1
usr_443	36	2100.00	0
usr_991	1	12.99	1
usr_202	24	1250.75	0
usr_331	48	3400.20	0
usr_705	3	110.00	1

Run df.describe() in Python to see statistical summaries.

analyze.py

Jupyter Runtime

Awaiting execution...

6. Data Visualization Systems

Matplotlib, Seaborn, Plotly, and Dashboard Design

📊 The Visualization Decision Framework

🎨 Production-Quality Static Plots

🚀 Interactive Visualizations with Plotly

ecommerce_churn.csv

Knowledge Check

6. Data Visualization Systems

Matplotlib, Seaborn, Plotly, and Dashboard Design

📊 The Visualization Decision Framework

🎨 Production-Quality Static Plots

🚀 Interactive Visualizations with Plotly

ecommerce_churn.csv

Knowledge Check