An Interest In:
Web News this Week
- April 19, 2024
- April 18, 2024
- April 17, 2024
- April 16, 2024
- April 15, 2024
- April 14, 2024
- April 13, 2024
Run Spark locally with Docker
When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.
When looking for a docker image with spark and jupyter we find the pyspark-notebook.
In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile
based on the pyspark-notebook.
The additional libraries needed are boto3
for AWS and python-dotenv
to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y
if the operating system is asking something during the install process, we will answer with yes
.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.
Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:
{ "NotebookApp": { "allow_root": true, "token": "" }}
In the Dockerfile we copy everything we need into to /home/jovyan/
directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.
The final Dockerfile looks like this:
FROM jupyter/pyspark-notebookUSER root# add needed packagesRUN apt-get update && apt-get install python3-boto3 -y# Install Python requirementsCOPY requirements.txt /home/jovyan/RUN pip install -r /home/jovyan/requirements.txtCOPY jupyter_lab_config.json /home/jovyan/# Run the notebookCMD ["/opt/conda/bin/jupyter", "lab", "--config", "/jupyter_lab_config.json"]
In the docker-compose.yaml
we
- need to map the ports,
- map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
- tell Docker where the
.env
file is located - tell Docker to build the Dockerfile in the same folder, instead of using an image.
The final docker-compose.yaml
looks like this:
version: "3.7"services: # jupyterlab with pyspark pyspark: #image: jupyter/pyspark-notebook build: . env_file: - .env environment: JUPYTER_ENABLE_LAB: "yes" ports: - "8888:8888" volumes: - ./data:/home/jovyan/work# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook
To start the container use docker-compose up
, if you changed something in the config use docker-compose up --force-recreate --build
to make sure the changes are build.
Have fun.
You can find the code also here.
Original Link: https://dev.to/barbara/run-spark-locally-with-docker-4com
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To