Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
October 29, 2021 01:30 pm GMT

Spark is lit onceagain

@pdambrauskas and I are marking hactoberfest by releasing our little in-house project...

Lighter - Running Spark applications on Kubernetes

Here at Exacaster Spark applications have been used extensively for years. We started using them on our Hadoop clusters with YARN as an application manager. However, with our recent product, we started moving towards a Cloud-based solution and decided to use Kubernetes for our infrastructure needs.

Livy

When running Spark applications on YARN, you can submit jobs using:

  • Spark client
  • Apache Livy - an open-source REST API for interacting with Apache Spark from anywhere.

Latter was a go-to solution at the time when we were only using Spark on YARN. Sadly Apache Livy is not maintained anymore: it has no K8s support, Spark client is more and more outdated with every passing day. For some time we used @jahstreet's fork which had K8s available. But then we saw that the Livy project hadn't received any updates and we decided to implement our own solution - Exacaster Lighter.

Lighter

Exacaster Lighter is heavily inspired by Apache Livy. The Idea is the same: hide Spark application client under the REST API. However, we are focusing on running those applications on the K8s cluster. YARN mode is also supported. We designed our application to be extendible with different execution backends.

Lighter has lightweight, React based UI written in TS and back-end written in Java with minor Python integration points.

Simplified illustration of the architecture:

                                                                                             Lighter                                                                                                                                                                                                                                                                                                                                Internal storage                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          store app                     getnew apps            sync status                                                                 check status                                                                                                                                                                                                                                    Submit                                                                                                                                                                                            Client                                   REST api                   App executor         Status tracker                            Check status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        execute               get status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Backend                                                                                                                (YARN/K8s)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

More information can be found on our documentation page.

UI

This is the job list view:
Job list

You can see the configuration of the submitted job inside:
Job configurations

Driver logs are also available for each job:
Job logs

How it works?

Glad you asked. It is quite simple. Lighter uses Spark Launcher to launch Spark applications on Kubernetes cluster. The launcher takes care of creating all Pods needed for the Spark application to run. When launching applications we tag them with a unique identifier by setting config property spark.kubernetes.driver.label.spark-app-tag. Then we use that identifier to check application status and retrieve application logs by calling pods API with labelSelector property.

Things get a bit more complicated on interactive sessions. We've created Sparkmagic compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.

Uscases

Spark on K8s

Since Apache Spark 2.4, applications can be executed on the K8s cluster. When you submit your Spark application, driver and executor pods are created for your application and removed after the application completes. But if you want to track application status and report them to end-users in a nice manner it gets complicated. Haha.

Spark on YARN

In the early days of the Big Data era when K8s hasn't even been born yet, the common open source go-to solution was the Hadoop stack. We have written several old-fashioned Map-Reduce jobs, scripts using Pig until we came across Spark. Since then Spark has became one of the most popular data processing engines. It is very easy to start using Lighter on YARN deployments. Just run a docker with proper configuration and mount necessary configurations in all the default paths.

Jupyterlab

For ad-hoc data analysis Jupyterlab on top of Spark is an elegant solution. Between themselves, however, these two great tools cannot communicate so Lighter together with SparkMagic acts as a bridge. You only need to provide the correct configuration to SparkMagic to have it working.

Closing remarks

Lighter is a freshly baked tool and open-sourced for everyone to use. Since we developed it to the use-cases that are familiar to us, feel free to contribute if you see any opportunities to make it better.


Original Link: https://dev.to/exacaster/spark-is-lit-once-again-41p7

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To