Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
March 25, 2022 10:40 pm GMT

Developing in Dagster

tl;dr: Use Poetry, Docker, and sensible folder structures to create a streamlined dev experience for creating dagster pipelines. This technical blog post dives into how this was accomplished. This post is about environment management more than it is about writing actual dagster ops, jobs, etc. The goal is to make your life easier while you do those things :)

The associated code repo can be found here

Fixing containerized code in (2x) real-time

Ive been exploring dagster for some of Mile Twos data orchestration needs and have been absolutely loving it. It hits all of the sweet spots for gradually developing data pipelines, but I found myself in a familiar situation: trying to logically structure my code such that it can easily be containerized and thrown into a CI/CD process. To that end, Ive open-sourced a boilerplate project that enhances the dagster development experience with these valuable features:

  • Uses one multi-stage Dockerfile for development & deployment which can easily integrate with CI/CD processes
  • Containerized environment picks up code changes immediately (just hit Reload in dagit); *no more waiting or containers to spin down and up!*
  • Uses poetry for virtual environment creation and tractable package management
  • Dependencies are specified according to PEP 518 using pyproject.toml instead of setup.py, which means no more hideous pip freeze > requirements.txt

Below, I start with a brief comparison of dagster new-project and my project structure. Then, I walk through some features & configuration of poetry. Finally, I dive into the multi-stage dockerfile and how it bridges the gap from development to deployment

Improvements to New Projects

dagster comes with the ability to create template projects. Even though its currently marked experimental, its an excellent starting point for the project structure

$ dagster new-project fresh-user-codeExperimentalWarning: "new_project_command" is an experimental function. Creating a new Dagster repository in fresh-user-code...Done.

And the resulting project structure

The resulting project structure

Overall, its lovely! Code is organized into appropriate submodules and has auto-generated environment setup instructions (as long as youre using conda or virtualenv). It even configures user code as an editable package and creates setup.py for packaging.

Lets compare it against the enhanced project structure (differences highlighted on the left)

Our enhanced dagster user code boilerplate. The photo above contains the entire setup process! :)

Our enhanced dagster user code boilerplate. The photo above contains the entire setup process! :)

Change #1pyproject.toml and the generated poetry.lock replace setup.py
Change #2.venv contains our virtual environment, including the installed dependencies (exists only after running poetry)
Change #3Notice the nested folder! This allows poetry to auto-resolve & package our code. Also, this project doesnt have subdirectories for job, op, etc for demonstration purposes, but they could be easily added
Change #4Docker-related files
Change #5I like to use a convention where each job will have a corresponding default YAML configuration using a naming convention job_name.yaml so they can easily be loaded in a programmatic fashion; each of those configs are located in this directory

The first three changes are poetry- and PEP 517/518-related and are discussed in the next section. In the section after that, Ill dive into the contents of Dockerfile and docker-compose and how they support both local development and deployment

Managing Via Poetry

Poetry is a great choice when working exclusively in a python ecosystem because it allows us to distinguish between

  • specified dependenciespackages we explicitly include in pyproject.toml
  • resolved dependenciesany package in poetry.lock

If we were using condas environment.yml or a more traditional requirements.txt , the specified dependencies would not be tracked and so we lose the context of which packages are desired. When managing packages later in a projects lifecycle, its helpful to understand which packages are intended to be included and which ones can be pruned

you vs the package manager they told you not to worry about

you vs the package manager they told you not to worry about

To understand why the ability to track specified dependencies is important, imagine you have been asked to remove dagster and dagit from the project (for some silly reason). With poetry, you remove both packages from the dependencies sections of pyproject.toml and run poetry update. In pip, you would do pip uninstall dagster dagit, but that doesnt clean up any of their dependencies. Over time, the requirements.txt grows with more and more unnecessary packages until the painful day you decide to sift through the codebase in search of Which packages am I actually importing? The following video demonstrates just how easy this cleanup can be when using poetry:

When removing dagster, poetry removes *59 packages* for us that are no longer needed. If we were using pip, those 59 packages would still be cluttering up our environment and our requirements.txt

Major sections of pyproject.toml:

Below, I break down the sections of pyproject.toml and what each one does. For even more detail, take a look at the poetry pyproject documentation

# Section 1[tool.poetry]name = "dagster-example-pipeline"version = "1.0.0"description = ""authors = ["Alex Service <[email protected]>"]

The first section defines our python package. A couple of notable things happen automatically here:

  • When packaging our source code, poetry will automatically search src for a subdirectory with a matching name. This behavior can be overridden if desired
    • Note: pyproject.toml expects hyphens for the name, but the directory itself should use underscores, e.g. src/dagster_example_pipeline
  • poetry respects semantic versioning. If you wish to bump the version number, you can manually change it, or use the poetry version command
    • e.g. poetry version minor would change the version to 1.1.0
# Section 2[tool.poetry.dependencies]python = "~3.9"pandas = "^1.3.2"google-cloud-storage = "^1.42"dagster = "0.13.19"dagster-gcp = "0.13.19"

The second section is where we include our specified dependencies. These are the packages we want at all times, both in production and during development. This section should only include the names of packages you explicitly want to define. Do not fill this with the output of pip freeze! poetry will resolve each packages dependencies for us.

# Section 3[tool.poetry.dev-dependencies]dagit = "0.13.19"debugpy = "^1.4.1"# jupyterlab = "^3.2.2"

The third section specifies our dev-dependencies, which are packages we only want to install during development. dagit is a good example because we already have an existing dagit deployment, but I want to be able to test in the UI locally. It doesnt need to be deployed with my user code, so it can be included as a dev-dependency. For my workflow, I often include a few types of dev-dependencies

  • Packages for Exploratory Data Analysis, e.g. jupyterlab, matplotlib
  • Debugging packages. As a VSCode user, I find debugpy to be very helpful
  • New packages Im trialing to see if they solve my problems; if they do, Ill promote them to become a regular dependency by moving them out of the dev-dependencies
# Section 4[build-system]requires = ["poetry-core>=1.0.0"]build-backend = "poetry.core.masonry.api"

The final section configures the python build system to use poetry instead of setuptools in accordance with PEP 517

poetry install

TIP: Before running the following commands, if you configure poetry to create the virtualenv inside of the project (via poetry config virtualenvs.in-project true), then VSCode will automatically recognize the new environment and ask you to select it as your environment :)

The command poetry install does a few things

  1. Creates a lock file and resolves the dependency tree (i.e. it resolves all sub-dependencies for our specified dependencies), marking each packages as either main or dev
  2. Downloads & caches all of the dependencies and sub-dependencies from the previous step
  3. Adds our code as an editable package to the environment
$ poetry installUpdating dependenciesResolving dependencies... (9.5s)Writing lock filePackage operations: 124 installs, 0 updates, 0 removals   Installing protobuf (3.19.4)   Installing pyasn1 (0.4.8)# ... omitted output   Installing pytest (6.2.5)Installing the current project: dagster-example-pipeline (1.0.0)

Activate the Environment

To actually use all of these packages, its very simple:

$ poetry shellSpawning shell within /path/to/.venv. /path/to/.venv/bin/activate(.venv) bash-3.2$

Run Dagster Daemon and Dagit (without a container)

Well explore containerization in a moment, but first lets demonstrate that the environment is properly set up:

(.venv) bash-3.2$ dagit$ dagitUsing temporary directory /path/to/dagster-example-pipeline/tmp7wdyoxas for storage. This will be removed when dagit exits.To persist information across sessions, set the environment variable DAGSTER_HOME to a directory to use.2022-02-15 16:07:08 -0500 - dagit - INFO - Serving dagit on http://127.0.0.1:3000 in process 14650

Navigate to https://localhost:3000 and try running the job, which simply grabs the top 5 items from Hacker News :)

Job result

The Problem With Containerizing Dagster

A major selling point of containerization is how it blurs the lines between works on my machine and deploying to production. The fundamental problem is this: there is a tradeoff between support for hot-loading code changes and support for CI/CD build processes. This problem isnt dagster-specificit exists almost everywhere when trying to containerize a dev environment

In more detail, this problem might sound familiar:

  • I want my python code to be editable, so that code changes are loaded immediately and I have a faster development loop. So, I will mount my project inside of a docker container with a configured python environment
  • My CI/CD build process expects a container with my project copied inside of it. I could use this container for local development, but will have to rebuild and rerun the container with each code change

It sounds like we have to either write multiple dockerfiles, or we have to give up the ability to hot-load our code*

*To be fair, this is a false dichotomy. Other approaches, such as VSCode devcontainers do exist, but in my experience, they dont quite scratch the itch

The Solution: Multi-Stage Dockerfile

Using poetry and docker, we can use a multi-stage Dockerfile to support both needs and speed up the development of dagster user-code environments! Heres how:

  1. Create a Dockerfile with 3 stages: dev, build, and deploy
    1. dev installs all of the necessary dependencies using poetry and runs dagit when targeted; it only expects code to be volume-mounted if the dev stage is targeted
    2. build uninstalls dev dependencies, copies our project into the container, and then builds a python package of our code, which gives us a standard python wheel file
    3. deploy copies only the wheel file and installs it using pip (no poetry, no volume mount, no mess)
  2. Create a docker-compose file that targets the dev stage of our Dockerfile and mounts our project as a volume in the container. This will be used for local development
    1. Bonus: Use an external environment variable manager like direnv to centralize all project environment variables into a single .envrc file and simply reference these variables in docker-compose.yml
  3. Let our CI/CD process run through all stages of the Dockerfile, resulting in a container ready to be deployed as a dagster user-code environment

Lets dive into each of the three stages to understand whats going on

  • A quick note about deployments: Elementl provide an example of deploying via docker, but even their documentation for it states how the user code container has to be restarted to reflect code changes

Dockerfile Stage 1: dev

Here are the critical bits from the first stage:

ARG BASE_IMAGE=python:3.9.8-slim-busterFROM "${BASE_IMAGE}" as dev

The only exciting part above is that we label our first stage so it can be referenced later in the build stage

COPY poetry.lock pyproject.toml ./RUN poetry install

poetry.lock and pyproject.toml are the only files copied into the dev container, because it is expected that everything else will be mounted. As a result, the only reason to restart the dev container is if we make changes to our dependencies :)

RUN echo "poetry install" > /usr/bin/dev_command.shRUN echo "poetry run dagit -h 0.0.0.0 -p 3000" >> /usr/bin/dev_command.shRUN chmod +x /usr/bin/dev_command.shCMD ["bash", "dev_command.sh"]

It might seem weird that poetry install gets called a second time, but because dev_command.sh is executed after our code is mounted, its necessary in order to add our code to the environment

To use the newly-created dev environment, In docker-compose.yml, simply specify the build and image tags for a service:

dagsterdev:    build:       context: .      dockerfile: Dockerfile      target: dev    image: dagster-example-pipeline-dev    volumes:      - ./:/usr/src/app

With a simple docker compose up, the dev environment is ready to go!

Dockerfile Stage 2: build

This stage is wonderfully simple

FROM dev as buildRUN poetry install --no-devCOPY . .

The build stage extends the dev stage, meaning all installed packages are still present. Above, poetry searches for any dependencies labeled dev and removes them. Also, we finally copy the actual project into the container

RUN poetry build --format wheel | grep "Built" | sed 's/^.*\s\(.*\.whl\)/\1/' > package_name

The magic happens! poetry builds a python wheel from our code and packages it up with only the necessary dependencies. The rest of the line looks scary, but its just extracting and saving the filename of the wheel. For reference, the output of poetry build looks like this:

$ poetry build --format wheelBuilding dagster-example-pipeline (1.0.0)  - Building wheel  - Built dagster_example_pipeline-1.0.0-py3-none-any.whl

Dockerfile Stage 3: deploy

Now that the code is packaged as a wheel, poetrys no longer needed. In fact, nothing is needed outside of a fresh python environment, the wheel, and any configuration for dagster!

FROM "${BASE_IMAGE}"# remember, BASE_IMAGE is just a python image# ... omitted some python setup. I'll be honest, not sure how much #     of this is actually needed :) ...# copy the directory with our wheelCOPY --from=build /usr/src/app/dist repo_package# copy the file containing our wheel filenameCOPY --from=build /usr/src/app/package_name package_nameRUN pip install --no-cache-dir repo_package/$(cat package_name)COPY workspace.yaml workspace.yamlCOPY job_configs job_configs

And there we go! Everything from the previous stages is discarded except for the wheel that was just created. Once installed and configured, this final stage is ready to be deployed

The Result: Faster Dev, Easier Deploys, & Cleaner Repositories

In the end, I now have everything I wanted:

  • The ability to develop & test jobs without constantly waiting for containers to build and spin up or down
  • Containerization handled without cluttering up my project (and mental) workspace
  • Package management that maintains a history of specified, intended packages so I dont have to consider, months later, whether the package I want to remove is a dependency of a dependency of a dependency of...

Conclusion

Even if you don't need the repository, I hope you've found the technical discussion above to be useful to your projects. I'd love if you could clone the repo and try it for yourself!


Original Link: https://dev.to/alexserviceml/developing-in-dagster-2flh

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To