Spark

Introduction

Spark is a distributed data processing framework to process big data. It is a powerful data processing tool that is used as an ultra-fast in-memory analytics engine. Spark was first created by the University of California at Berkeley in 2009 and became open source in 2010 (Databricks 2022)

Benefits of Spark

There are some key benefits of using Spark, including speed, multi-language support, and advanced analytics.

Spark's speed is a significant benefit because it can decrease processing times when handling big data. By minimizing the number of read/write executions, Spark is 100 times faster than Hadoop when running on memory and 10 times faster when running on disk storage because it reduces the number of executions (TutorialsPoint 2021). Batch processing further improves Spark's speed.

Spark also offers multi-language support. Apache Spark accepts multiple programming languages, Java, Python, SQL. Spark also generates nearly 100 high-level operations for robust querying.

Spark's advanced analytics make it an ideal platform for large-scale analytics. Spark supports Machine Learning (ML) algorithms, graph processing, SQL queries, and data streaming. This allows Spark to process even the most power-consuming applications.

Run Spark in a Container

It's probably best to deploy Spark in a container, even for development and testing reasons. A good standard for starting complex container(s) systems is to use container compose files like docker compose.

Below is an example docker-compose file that can be used as a good starting point. Note that the version 3.2 is a bit out of date, but during the PCDE course it was the one that worked with the PySpark shell.

There's also the Big Data Europe Spark Github Repository

# Copyright VMware, Inc.
# SPDX-License-Identifier: APACHE-2.0

version: '2'

services:
  spark:
    image: docker.io/bitnami/spark:3.2
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8081:8080'
  spark-worker:
    image: docker.io/bitnami/spark:3.2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

Spark in Python

Note: For more detailed information, read PySpark (Python Spark Library).

With Spark deployed in a container, it's time to start programming Spark sessions. One of the ways to do this most relevant to data engineers and scientists is with Spark's PySpark module.

PySpark is a commonly used tool in which Python applications can be run with Spark. Spark will act as the Python compiler and execute the steps using the most efficient process. In the sample Python application that you will create, the goal is to utilize Spark when applications become larger and require more resources.

Web Links