Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
January 7, 2022 09:22 am GMT

Run Spark locally with Docker

When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.

When looking for a docker image with spark and jupyter we find the pyspark-notebook.

In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile based on the pyspark-notebook.
The additional libraries needed are boto3 for AWS and python-dotenv to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y if the operating system is asking something during the install process, we will answer with yes.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.

Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:

{    "NotebookApp": {        "allow_root": true,        "token": ""    }}

In the Dockerfile we copy everything we need into to /home/jovyan/ directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.

The final Dockerfile looks like this:

FROM jupyter/pyspark-notebookUSER root# add needed packagesRUN apt-get update && apt-get install python3-boto3 -y# Install Python requirementsCOPY requirements.txt /home/jovyan/RUN pip install -r /home/jovyan/requirements.txtCOPY jupyter_lab_config.json /home/jovyan/# Run the notebookCMD ["/opt/conda/bin/jupyter", "lab", "--config", "/jupyter_lab_config.json"]

In the docker-compose.yaml we

  • need to map the ports,
  • map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
  • tell Docker where the .env file is located
  • tell Docker to build the Dockerfile in the same folder, instead of using an image.

The final docker-compose.yaml looks like this:

version: "3.7"services:  # jupyterlab with pyspark  pyspark:    #image: jupyter/pyspark-notebook    build: .    env_file:       - .env    environment:      JUPYTER_ENABLE_LAB: "yes"    ports:      - "8888:8888"    volumes:      - ./data:/home/jovyan/work# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook

To start the container use docker-compose up, if you changed something in the config use docker-compose up --force-recreate --build to make sure the changes are build.

Have fun.

You can find the code also here.


Original Link: https://dev.to/barbara/run-spark-locally-with-docker-4com

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To