Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

January 7, 2022 09:22 am GMT

Run Spark locally with Docker

When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.

When looking for a docker image with spark and jupyter we find the pyspark-notebook.

In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile based on the pyspark-notebook.
The additional libraries needed are boto3 for AWS and python-dotenv to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y if the operating system is asking something during the install process, we will answer with yes.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.

Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:

{    "NotebookApp": {        "allow_root": true,        "token": ""    }}

In the Dockerfile we copy everything we need into to /home/jovyan/ directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.

The final Dockerfile looks like this:

FROM jupyter/pyspark-notebookUSER root# add needed packagesRUN apt-get update && apt-get install python3-boto3 -y# Install Python requirementsCOPY requirements.txt /home/jovyan/RUN pip install -r /home/jovyan/requirements.txtCOPY jupyter_lab_config.json /home/jovyan/# Run the notebookCMD ["/opt/conda/bin/jupyter", "lab", "--config", "/jupyter_lab_config.json"]

In the docker-compose.yaml we

need to map the ports,
map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
tell Docker where the .env file is located
tell Docker to build the Dockerfile in the same folder, instead of using an image.

The final docker-compose.yaml looks like this:

version: "3.7"services:  # jupyterlab with pyspark  pyspark:    #image: jupyter/pyspark-notebook    build: .    env_file:       - .env    environment:      JUPYTER_ENABLE_LAB: "yes"    ports:      - "8888:8888"    volumes:      - ./data:/home/jovyan/work# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook

To start the container use docker-compose up, if you changed something in the config use docker-compose up --force-recreate --build to make sure the changes are build.

Have fun.

You can find the code also here.

Original Link: https://dev.to/barbara/run-spark-locally-with-docker-4com

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To