Artificial Intelligence | VoidX Academy

12. Systems Design and MLOps

Module 12: Systems Design

From Notebook to Production

A model that only works in a Jupyter notebook is not a product—it is a prototype. Transforming a trained model into a reliable, scalable, maintainable production system is a distinct engineering discipline called MLOps (Machine Learning Operations). Most AI projects fail not because the models are bad, but because the engineering around them is inadequate. This module covers the complete path from trained model to production deployment.

🏗️ End-to-End AI System Architecture

A production AI system consists of several interconnected components, each requiring careful engineering:

Data Pipeline: Ingestion from sources → validation → preprocessing → feature store. Automated, monitored, and versioned. Failures here silently corrupt model inputs.
Training Pipeline: Data loading → model training → evaluation → experiment tracking → model registry. Reproducible and automated.
Serving Infrastructure: Model server → API → load balancer → monitoring. Low latency, high availability, auto-scaling.
Monitoring and Observability: Input distribution monitoring, prediction monitoring, data drift detection, business metric tracking, alerting.

from fastapi import FastAPI
import torch
import torch.nn as nn
from pydantic import BaseModel
from typing import List
import uvicorn

app = FastAPI(title="AI Inference API", version="1.0")

model = torch.jit.load('model_scripted.pt')
model.eval()

class PredictionRequest(BaseModel):
    features: List[float]

class PredictionResponse(BaseModel):
    prediction: int
    confidence: float
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = torch.tensor([request.features], dtype=torch.float32)
    
    with torch.no_grad():
        logits = model(features)
        probabilities = torch.softmax(logits, dim=1)
        prediction = int(torch.argmax(probabilities))
        confidence = float(probabilities.max())
    
    return PredictionResponse(
        prediction=prediction,
        confidence=confidence,
        model_version="1.0.0"
    )

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model_loaded": model is not None}

📦 Model Export and Optimization

import torch

model = load_trained_model()
model.eval()

scripted = torch.jit.script(model)
scripted.save('model_scripted.pt')
torch.jit.save(scripted, 'model_scripted.pt')

dummy_input = torch.randn(1, input_size)
torch.onnx.export(
    model, dummy_input, 'model.onnx',
    input_names=['features'],
    output_names=['predictions'],
    dynamic_axes={'features': {0: 'batch_size'}}
)

quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)
print(f"Original size: {get_model_size(model):.2f} MB")
print(f"Quantized size: {get_model_size(quantized):.2f} MB")

Optimization techniques for production:

Quantization: Reduce model precision from float32 to int8. Reduces model size 4x and inference latency 2–4x with minimal accuracy loss. Critical for edge deployment and cost reduction in cloud serving.
ONNX Runtime: Export to ONNX format and serve with ONNX Runtime, which applies hardware-specific optimizations (operator fusion, memory planning) automatically.
TensorRT (NVIDIA): NVIDIA's inference optimizer. Applies layer fusion, precision calibration, and kernel auto-tuning for maximum GPU throughput. 2–5x speedup over baseline PyTorch inference on NVIDIA hardware.
Batching: Process multiple requests together in a single forward pass. Dramatically improves GPU utilization. Dynamic batching (waiting up to Nms for additional requests to fill a batch) maximizes throughput.

☁️ Cloud AI Services

AWS SageMaker: End-to-end ML platform. Managed Jupyter environments, distributed training, one-click model deployment with auto-scaling, A/B testing between model versions, and built-in monitoring. The most feature-complete managed ML platform.
Google Vertex AI: Google's unified ML platform. Deep integration with BigQuery for data, TensorFlow ecosystem, AutoML for low-code model training, and Vertex AI Predictions for serving.
Azure ML: Microsoft's ML platform. Deep integration with Azure services, strong enterprise features, and good support for Responsible AI tooling.
Hugging Face Inference Endpoints: Deploy any Hugging Face model to a managed endpoint with one click. Zero infrastructure management. Ideal for NLP models and LLM-based APIs.

📊 Monitoring and Observability

A deployed model is not a static artifact—its performance degrades as the world changes. Robust monitoring is what separates reliable AI systems from ones that silently fail:

Data Drift Detection: Monitor the distribution of input features over time. If the input distribution drifts significantly from the training distribution, the model's learned patterns may no longer apply. Use statistical tests (KS test, PSI—Population Stability Index) to detect drift automatically.
Prediction Distribution Monitoring: Track the distribution of model outputs. A sudden shift in prediction distribution often signals a data pipeline problem or environmental change.
Business Metric Tracking: The ultimate measure of model performance in production. Conversion rate for recommendation systems, default rate for credit models, detection rate for fraud models. Correlate business metrics with model metrics to detect degradation.
Latency and Throughput: Track p50, p95, p99 latency percentiles. Alert on increases. Track requests per second and error rates. Set SLOs (Service Level Objectives) and page on-call when violated.

12. Systems Design and MLOps

Module 12: Systems Design

From Notebook to Production

🏗️ End-to-End AI System Architecture

A production AI system consists of several interconnected components, each requiring careful engineering:

Data Pipeline: Ingestion from sources → validation → preprocessing → feature store. Automated, monitored, and versioned. Failures here silently corrupt model inputs.
Training Pipeline: Data loading → model training → evaluation → experiment tracking → model registry. Reproducible and automated.
Serving Infrastructure: Model server → API → load balancer → monitoring. Low latency, high availability, auto-scaling.
Monitoring and Observability: Input distribution monitoring, prediction monitoring, data drift detection, business metric tracking, alerting.

from fastapi import FastAPI
import torch
import torch.nn as nn
from pydantic import BaseModel
from typing import List
import uvicorn

app = FastAPI(title="AI Inference API", version="1.0")

model = torch.jit.load('model_scripted.pt')
model.eval()

class PredictionRequest(BaseModel):
    features: List[float]

class PredictionResponse(BaseModel):
    prediction: int
    confidence: float
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = torch.tensor([request.features], dtype=torch.float32)
    
    with torch.no_grad():
        logits = model(features)
        probabilities = torch.softmax(logits, dim=1)
        prediction = int(torch.argmax(probabilities))
        confidence = float(probabilities.max())
    
    return PredictionResponse(
        prediction=prediction,
        confidence=confidence,
        model_version="1.0.0"
    )

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model_loaded": model is not None}

📦 Model Export and Optimization

import torch

model = load_trained_model()
model.eval()

scripted = torch.jit.script(model)
scripted.save('model_scripted.pt')
torch.jit.save(scripted, 'model_scripted.pt')

dummy_input = torch.randn(1, input_size)
torch.onnx.export(
    model, dummy_input, 'model.onnx',
    input_names=['features'],
    output_names=['predictions'],
    dynamic_axes={'features': {0: 'batch_size'}}
)

quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)
print(f"Original size: {get_model_size(model):.2f} MB")
print(f"Quantized size: {get_model_size(quantized):.2f} MB")

Optimization techniques for production:

Quantization: Reduce model precision from float32 to int8. Reduces model size 4x and inference latency 2–4x with minimal accuracy loss. Critical for edge deployment and cost reduction in cloud serving.
ONNX Runtime: Export to ONNX format and serve with ONNX Runtime, which applies hardware-specific optimizations (operator fusion, memory planning) automatically.
TensorRT (NVIDIA): NVIDIA's inference optimizer. Applies layer fusion, precision calibration, and kernel auto-tuning for maximum GPU throughput. 2–5x speedup over baseline PyTorch inference on NVIDIA hardware.
Batching: Process multiple requests together in a single forward pass. Dramatically improves GPU utilization. Dynamic batching (waiting up to Nms for additional requests to fill a batch) maximizes throughput.

☁️ Cloud AI Services

AWS SageMaker: End-to-end ML platform. Managed Jupyter environments, distributed training, one-click model deployment with auto-scaling, A/B testing between model versions, and built-in monitoring. The most feature-complete managed ML platform.
Google Vertex AI: Google's unified ML platform. Deep integration with BigQuery for data, TensorFlow ecosystem, AutoML for low-code model training, and Vertex AI Predictions for serving.
Azure ML: Microsoft's ML platform. Deep integration with Azure services, strong enterprise features, and good support for Responsible AI tooling.
Hugging Face Inference Endpoints: Deploy any Hugging Face model to a managed endpoint with one click. Zero infrastructure management. Ideal for NLP models and LLM-based APIs.

📊 Monitoring and Observability

A deployed model is not a static artifact—its performance degrades as the world changes. Robust monitoring is what separates reliable AI systems from ones that silently fail:

Data Drift Detection: Monitor the distribution of input features over time. If the input distribution drifts significantly from the training distribution, the model's learned patterns may no longer apply. Use statistical tests (KS test, PSI—Population Stability Index) to detect drift automatically.
Prediction Distribution Monitoring: Track the distribution of model outputs. A sudden shift in prediction distribution often signals a data pipeline problem or environmental change.
Business Metric Tracking: The ultimate measure of model performance in production. Conversion rate for recommendation systems, default rate for credit models, detection rate for fraud models. Correlate business metrics with model metrics to detect degradation.
Latency and Throughput: Track p50, p95, p99 latency percentiles. Alert on increases. Track requests per second and error rates. Set SLOs (Service Level Objectives) and page on-call when violated.

12. Systems Design and MLOps

From Notebook to Production

🏗️ End-to-End AI System Architecture

📦 Model Export and Optimization

☁️ Cloud AI Services

📊 Monitoring and Observability

Knowledge Check

12. Systems Design and MLOps

From Notebook to Production

🏗️ End-to-End AI System Architecture

📦 Model Export and Optimization

☁️ Cloud AI Services

📊 Monitoring and Observability

Knowledge Check