Data Science | VoidX Academy

Build With Me: PySpark Big Data

Build With Me 05

Taming the Data Lake

You have been tasked with parsing server logs. The problem? You are receiving 500 Megabytes of logs every second. Pandas will instantly crash your laptop. We must use PySpark to distribute this workload across a cluster of worker nodes. Let's build the pipeline.

Step 1: Cluster Config & Extraction

We start by telling Spark how much computing power we want, and reading from our Data Lake.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan

# Configure our computing cluster
WORKER_NODES = 4
BATCH_SIZE = 500  # MB per second incoming

spark = SparkSession.builder.appName("Log_ETL").getOrCreate()

# Read raw JSON logs from S3
df = spark.read.json("s3://voidx-data-lake/raw_events/*.json")

Step 2: The Bottleneck

Let's add our transformation layer. We need to filter out corrupted logs where the user_id is missing.

# Filter out bad rows
clean_df = df.filter(~isnan(col("user_id")))

# Save the clean data to our Warehouse
clean_df.write.format("parquet").save("s3://warehouse/clean/")

Your Turn: Hit Submit Job. Look at the Spark Cluster node in the center. With only 4 worker nodes, the cluster cannot process 500MB/s fast enough. The box turns red, and the backlog skyrockets. Your pipeline is failing!

Step 3: Scaling the Cluster

This is the magic of distributed computing. We don't need to rewrite our code to make it faster; we just throw more hardware at it.

Your Turn: Click Stop Job. In your code editor, change WORKER_NODES = 4 to WORKER_NODES = 12. Hit Submit Job again.

Watch the visualizer! With 12 nodes, the cluster processes the data instantly. The backlog stays at 0.0 GB, and the clean data flows beautifully into the Data Warehouse. You just scaled a Big Data pipeline!

Data Science: Big Data Pipeline

S3 Data Lake

Raw Unstructured JSON

0.0 GB

Read Volume

Spark Cluster

4 Executor Nodes Active

0.0 GB

Processing Backlog

Filtered Out (Bad Rows):0.0 GB

Data Warehouse

Structured Parquet

0.0 GB

Clean Data Written

pyspark_job.py

PySpark 3.5

SPARK CONSOLE

[06:05:19]SparkSession v3.5.0 initialized.

[06:05:19]Awaiting PySpark script execution...

Build With Me: PySpark Big Data

Build With Me 05

Taming the Data Lake

Step 1: Cluster Config & Extraction

We start by telling Spark how much computing power we want, and reading from our Data Lake.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan

# Configure our computing cluster
WORKER_NODES = 4
BATCH_SIZE = 500  # MB per second incoming

spark = SparkSession.builder.appName("Log_ETL").getOrCreate()

# Read raw JSON logs from S3
df = spark.read.json("s3://voidx-data-lake/raw_events/*.json")

Step 2: The Bottleneck

Let's add our transformation layer. We need to filter out corrupted logs where the user_id is missing.

# Filter out bad rows
clean_df = df.filter(~isnan(col("user_id")))

# Save the clean data to our Warehouse
clean_df.write.format("parquet").save("s3://warehouse/clean/")

Step 3: Scaling the Cluster

This is the magic of distributed computing. We don't need to rewrite our code to make it faster; we just throw more hardware at it.

Your Turn: Click Stop Job. In your code editor, change WORKER_NODES = 4 to WORKER_NODES = 12. Hit Submit Job again.

Data Science: Big Data Pipeline

S3 Data Lake

Raw Unstructured JSON

0.0 GB

Read Volume

Spark Cluster

4 Executor Nodes Active

0.0 GB

Processing Backlog

Filtered Out (Bad Rows):0.0 GB

Data Warehouse

Structured Parquet

0.0 GB

Clean Data Written

pyspark_job.py

PySpark 3.5

SPARK CONSOLE

[06:05:19]SparkSession v3.5.0 initialized.

[06:05:19]Awaiting PySpark script execution...

Build With Me: PySpark Big Data

Taming the Data Lake

Step 1: Cluster Config & Extraction

Step 2: The Bottleneck

Step 3: Scaling the Cluster

S3 Data Lake

Spark Cluster

Data Warehouse

Knowledge Check

Build With Me: PySpark Big Data

Taming the Data Lake

Step 1: Cluster Config & Extraction

Step 2: The Bottleneck

Step 3: Scaling the Cluster

S3 Data Lake

Spark Cluster

Data Warehouse

Knowledge Check