Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 14, 2022 06:13 pm GMT

Building dashboards over a semantic layer with Superset and Cube

What can be more troubling and horrifying for an Apache Superset user than a scarlet error message revealing that the underlying dataset is broken and charts arent going to provide any insights anytime soon? Sigh.

Screenshot 2022-04-10 at 22.49.20.png

Indeed, modern data pipelines consist of many movingand sometimes fragileparts. Superset, the Swiss knife of data exploration and visualization, deserves rock-solid data sources but in the real world, upstream changes are inevitable. They may be connected to immature technology in the data stack as well as ever-changing business requirements and balkanized ownership of data pipelines.

Within minutes, you can find conversations on Supersets Slack or GitHub revolving around identifying broken dashboards and finding ways to reduce the impact on users:

Screenshot 2022-04-11 at 16.55.39.png

Im happy to share that a solution exists. You can use Cube together with Superset and define your datasets over a data modeling layer (or a semantic layer) rather than raw data:

Cube, Superset, and semantic layer

In this tutorial, well explore how Cube, a headless BI platform, can be used together with Superset to build reliable and robust data visualizations on top of a semantic layer. Youll learn techniques to prevent your dashboards from breakingand tools that would notify you about incompatible upstream changes in data before your users even notice.

What is Headless BI?

Cubeis an open-sourceheadless BI platformwith nearly 13,000 stars onGitHubto date. It can be broken down into four layers that are complemented by a head or multiple heads like Apache Superset, another BI tool, or a front-end application with embedded analytics. Here are the layers:

  • Data modeling. Contains consistent metrics definitions and provides a view of the upstream data in terms of measures and dimensions. Makes data semantic.
  • Access control. Safeguards data access and enables row-level security, role-based access, and multitenancy. Makes data secure.
  • Caching. Accelerates data access and makes end-user applications responsive and rewarding to use. Makes apps fast.
  • APIs. Exposes data to downstream applications via REST, GraphQL, or SQL API. Makes data accessible.

You can configure Cube to connect to any database, declaratively define your metrics, and instantly get an API that you can use with Superset or many other BI tools.

In case of any upstream change (be it a renamed column, a dropped table, or a migrated data source), you can update the data model in Cube and keep hundreds of charts and dozens of dashboards in your Superset installation intact. Sounds refreshing, huh?

Lets try Superset with Cube and see how it works!

Running Cube with Superset

To keep things simple and save time, we'll run both Cube and Superset in fully managed environments: Cube Cloud and Preset Cloud. Note that both tools have generous free tiers, e.g., you can use Preset Cloud forever for free on the Starter plan. (If you'd like to run Cube or Superset locally with Docker, please see Cube and Supeset instructions, respectively.)

Running Cube. First, please proceed to Cube Clouds sign up page and fill in your details. Note that Cube Cloud supports signing up with your GitHub account. Within a few seconds you will be taken to your account where you can create your first Cube deployment:

Screenshot 2021-11-24 at 03.06.16.png

Proceed with providing a name for your deployment, selecting a cloud provider and a region:

Screenshot 2022-04-09 at 02.16.36.png

At the next step, choose Create to start a new Cube project from scratch. Then, pick Postgres to proceed to the screen where you can enter the following credentials:

Hostname:  demo-db-examples.cube.devPort:      5432Database:  ecom_supersetUsername:  cubePassword:  12345

Screenshot 2022-04-09 at 02.18.04.png

Cube will connect to a publicly available Postgres database that I've already set up.

The last part of configuration is thedata schemawhich declaratively describes the metrics we'll be putting on the dashboard. Actually, Cube can generate it for us! Pick the top-level public database from the list and select only the orders table:

Screenshot 2022-04-09 at 02.18.53.png

In a while, your Cube deployment will be up and running:

Screenshot 2022-04-09 at 02.25.27.png

Defining metrics. Please navigate to the "Schema" tab. You will see the Orders.js file under the "schema" folder. Lets review it.

Screenshot 2022-04-09 at 02.27.36.png

This file defines the metrics within the "Orders" cube. It is different from the one in Cube Cloud, but we'll take care of that later.

cube(`Orders`, {  sql: `SELECT * FROM public.orders`,  dimensions: {    id: {      sql: `id`,      type: `number`,      primaryKey: true    },    // 'processing', 'shipped', or 'completed'    status: {      sql: `status`,      type: `string`    },    createdAt: {      sql: `created_at`,      type: `time`    },    shippedAt: {      sql: `shipped_at`,      type: `time`    },    completedAt: {      sql: `completed_at`,      type: `time`    },    // Calculated, uses SQL functions and column references    daysBeforeShipped: {      sql: `EXTRACT(DAYS FROM (shipped_at - created_at))`,      type: `number`    },    // Calculated, uses SQL functions and column references    daysBeforeCompleted: {      sql: `EXTRACT(DAYS FROM (completed_at - created_at))`,      type: `number`    }  },  measures: {    count: {      type: `count`    },    // Calculated, uses a dimension reference    avgDaysBeforeShipped: {      sql: `${daysBeforeShipped}`,      type: `avg`    },    // Calculated, uses a dimension reference    avgDaysBeforeCompleted: {      sql: `${daysBeforeCompleted}`,      type: `avg`    }  },});

Key learnings here:

  • the cube is a logical, domain-level entity that includes measures and dimensions
  • using the sql statement, this cube is defined over the entire public.line_items table; actually, cubes can be defined over arbitrary SQL statements that select data
  • dimensions (qualitative data features) are defined over textual, temporal, or numeric columns in the dataset
  • measures (quantitative data features) are defined as aggregations (e.g., count, avg, etc.) over columns or dimensions in the dataset
  • you can define complex measures and dimensions with custom sql statements or references to other measures and dimensions

Development mode. Now, let's update the schema file in Cube Cloud to match the contents above. First, click Enter Development Mode to unlock the schema files for editing. This essentially creates a "fork" of the Cube API that tracks your changes in the data schema.

Navigate to Orders.js and replace its contents with the code above. Then, save your changes by clicking Save All to apply the changes to the development version of your API. You can apply as many changes as you wish, but we're done for now. Click Commit & Push to merge your changes back to the main branch:

Screenshot 2022-04-10 at 19.28.29.png

On the "Overview" tab you will see your changes deployed:

Screenshot 2022-04-10 at 19.29.40.png

Now you can explore the metrics on the "Playground" tab:

Screenshot 2022-04-10 at 20.25.46.png

When youre ready, head over to the Overview tab, click Deploy SQL API and use the How to connect your BI tool link shortly thereafter.

Screenshot 2022-04-10 at 19.48.16.png

You will see the credentials that well use to connect the metrics store that weve built with Cube to Superset in a few moments. How?

Running Superset. Please proceed to Presets sign up page and fill in your details. Note that Preset Cloud supports signing up with your Google account. Within a few seconds you will be taken to your account with a readily available workspace:

Screenshot 2021-11-24 at 02.50.55.png

Switching to that workspace will reveal a few example dashboards that you can review later.

Screenshot 2021-11-24 at 02.54.51.png

Connect Superset to Cube. Navigate to Data / Databases via the top menu, click + Database, select MySQL, and fill in the credentials from your Cube Cloud instance or use the credentials below:

  • Host: green-yak.sql.aws-us-east-2.cubecloudapp.dev
  • Port: 3306
  • Database name: db
  • Username: cube@green-yak
  • Password: 82e26064349b19c340f8b074dfcee4af
  • Display name: Cube Cloud (it's important)

Screenshot 2022-04-10 at 20.03.08.png

You can press Connect now.

Define the datasets. Navigate to Data / Datasets via the top menu, click + Dataset, and fill in the following credentials:

  • Database: Cube Cloud (the one we've just created)
  • Schema: db
  • See table schema: Orders

Screenshot 2022-04-10 at 20.05.21.png

You can press Add now. Then, hover over the newly added dataset and click the pencil icon in the Actions column. Well need to make sure Superset recognizes a couple of measures. First, navigate to the Columns tab and deselect Is dimension for avgDaysBeforeShipped and avgDaysBeforeCompleted:

Screenshot 2022-04-10 at 20.21.55.png

Then, switch to the Metrics tab and add these two:

  • Metric: avgDaysBeforeShipped; SQL expression: AVG(avgDaysBeforeShipped)
  • Metric: avgDaysBeforeCompleted; SQL expression: AVG(avgDaysBeforeCompleted)

Screenshot 2022-04-10 at 20.21.50.png

Press Save in the end. Whew, were all set! Now, lets display the data on a chart.

Building a dashboard. Navigate to Charts via the top menu, click + Chart, and use the following configuration for your first chart:

  • Choose a dataset: Orders (the one we've just created)
  • Choose a chart type: Table

When youre done, click Create new chart. Feel free to experiment with parameters on this screen and click Run Query to explore the data. To proceed, please set up a chart like this:

  • Visualization type: Table
  • Time column: createdAt
  • Time grain: any value
  • Group by: status
  • Metrics: COUNT(*), avgDaysBeforeShipped, avgDaysBeforeCompleted

Heres the data table youll get:

Screenshot 2022-04-10 at 22.19.47.png

You can click Save to give your chart a name and add it to a new dashboard:

Screenshot 2022-04-10 at 22.21.47.png

Looks good? We now have a dashboard in Superset displaying a chart that queries data from Cube using the measures and dimensions in Cubes data model.

Screenshot 2022-04-10 at 22.23.31.png

Cube translates queries coming from Superset into queries to the upstream data source. What would happen if anything happens upstream? Heres when all the fun starts!

Mishap 1: Column renamed or removed

Lets say you (or your users) open the dashboard one day and behold this orange introduction into a few less-than-productive work hours. Apparently, the status column is not present in the data source anymore and a few uninformed business decisions might be made (or put on hold) until your team debugs the circumstances of the removal.

Screenshot 2022-04-10 at 22.49.20.png

What can be done to mitigate this? Well, lets reproduce this situation first. You can do so by navigating to Cube Cloud, switching to the Development Mode, and editing the Orders.js file. Just change the SQL expression to reference orders_wo_status table that, obviously enough, has exactly the same data as the orders table but the status column is removed:

Screenshot 2022-04-10 at 22.43.48.png

Then, you can save and commit the changes:

Screenshot 2022-04-10 at 22.43.36.png

After the Cube Cloud deployment is redeployed in a few moments, you can refresh your dashboard in Superset to switch it into the annoying orange state. (If the dashboard doesnt break, try pressing the button with three dots in the top right corner and requesting the dashboard to refresh.)

Unbreaking the dashboard, a simple way. Lets say that the status column was removed because its redundant: its values can be calculated using the values in created_at, shipped_at, and completed_at columns. Then, how to fix the dashboard after the column removal?

One way to do so is to edit the dataset in Superset. You can delete the missing column and add it back as a calculated one using Supersets semantic layer. (If youd like to dig deeper, check out this post in Presets blog.)

However, there are a few downsides: first, this calculated column wouldnt be visible in Supersets SQL Editor where you might run SQL queries to perform data exploration; second, the calculation wouldnt propagate upstream. If Superset is not the only tool you use (or would ever use) to explore your data, its better to apply the fix upstream rather than make duplicated fixes all over different tools.

Unbreaking the dashboard, a better way. An alternative way is to replace the status column with a calculation in Cubes data model. Then, Superset and all other tools would work as if that column was never removed and your dashboard never broke.

You can go back to Cube Cloud, switch to the Development Mode, and edit the Orders.js file once again. Lets replace the status dimension with the following code snippet:

    status: {      case: {        when: [          { sql: `completed_at IS NOT NULL`, label: `completed` },          { sql: `shipped_at IS NOT NULL`, label: `shipped` },        ],        else: { label: `processing` },      },      type: `string`,    },

You can save and deploy your changes. Note that were using a case dimension that provides some syntactic sugar over a regular CASE... WHEN... THEN... ELSE... END SQL statement and provides the ability to reference other dimensions:

Screenshot 2022-04-10 at 23.44.32.png

Congratulations! Without any changes in Superset, your dashboard is back to a healthy state:

Screenshot 2022-04-10 at 22.23.31.png

Moreover, if you ever need to introduce a more complex calculation to any dimension or measure, you can always do that in Cube Cloud and still wont need to change anything in tools that consume data, including Superset.

Mishap 2: Table altered or removed

Lets say you (or your users) open the dashboard and see this unfortunate orange notification once again. Apparently, the shipped_at column doesnt exist:

Screenshot 2022-04-11 at 00.11.11.png

The reason is that developers of the upstream system decided to implement the event sourcing design pattern, meaning that theres no orders table anymore. Instead, theres the orders_events table that, obviously enough, has exactly the same data but completely reshaped as a stream of events related to orders status changes. In that table, every row represents only partial information about an order and the timestamp column replaced previously existing created_at, shipped_at, and completed_at columns. Heres the DDL statement:

create table orders_events(    order_id  integer,    user_id   integer,    status    text,    timestamp timestamp);

Wanna try that yourself? You can do so by navigating to Cube Cloud, switching to the Development Mode, and editing the Orders.js file. Just change the SQL expression to reference orders_events table:

Screenshot 2022-04-11 at 00.23.27.png

After the Cube Cloud deployment is redeployed, you can refresh your dashboard in Superset to switch it into the annoying orange state. (If the dashboard doesnt break, try pressing the button with three dots in the top right corner and requesting the dashboard to refresh.)

Unbreaking the dashboard, a complex way. One way to do so is to introduce a data transformation tool to your data pipeline (assuming you dont use one) that would transform the data back into the desired state.

However, there are a few downsides: first, this is more like a month-long project than a quick fix to the data pipeline; second, it might bring unnecessary and unwanted complexity to your technology stack. (Note that if you actually use dbt or dbt Metrics, Cube has an integration.)

Unbreaking the dashboard, a better way. An alternative way is to update Cubes data model to transform data back to the desired state. Then, Superset and all other tools would work as if the table was never removed and your dashboard never broke.

You can go back to Cube Cloud, switch to the Development Mode, and edit the Orders.js file once again. Lets replace the source code of the file with the following code snippet:

cube(`Orders`, {  sql: `    WITH bareOrders AS (        SELECT order_id AS id, user_id        FROM public.orders_events        GROUP BY order_id, user_id    ),    orders AS (        SELECT          bareOrders.*,          (CASE WHEN completed.timestamp IS NOT NULL THEN completed.status ELSE (CASE WHEN shipped.timestamp IS NOT NULL THEN shipped.status ELSE processing.status END) END) AS status,          COALESCE(processing.timestamp, shipped.timestamp, completed.timestamp) AS created_at,          COALESCE(shipped.timestamp, completed.timestamp) AS shipped_at,          completed.timestamp AS completed_at        FROM bareOrders        LEFT JOIN public.orders_events AS processing          ON bareOrders.id = processing.order_id AND processing.status = 'processing'        LEFT JOIN public.orders_events AS shipped          ON bareOrders.id = shipped.order_id AND shipped.status = 'shipped'        LEFT JOIN public.orders_events AS completed          ON bareOrders.id = completed.order_id AND completed.status = 'completed'    )    SELECT * FROM orders  `,  dimensions: {    id: {      sql: `id`,      type: `number`,      primaryKey: true    },    // 'processing', 'shipped', or 'completed'    status: {      sql: `status`,      type: `string`    },    createdAt: {      sql: `created_at`,      type: `time`    },    shippedAt: {      sql: `shipped_at`,      type: `time`    },    completedAt: {      sql: `completed_at`,      type: `time`    },    // Calculated, uses SQL functions and column references    daysBeforeShipped: {      sql: `EXTRACT(DAYS FROM (shipped_at - created_at))`,      type: `number`    },    // Calculated, uses SQL functions and column references    daysBeforeCompleted: {      sql: `EXTRACT(DAYS FROM (completed_at - created_at))`,      type: `number`    }  },  measures: {    count: {      type: `count`    },    // Calculated, uses a dimension reference    avgDaysBeforeShipped: {      sql: `${daysBeforeShipped}`,      type: `avg`    },    // Calculated, uses a dimension reference    avgDaysBeforeCompleted: {      sql: `${daysBeforeCompleted}`,      type: `avg`    }  },});

As you can see, Cubes data model is flexible enough to allow cubes to be defined over arbitrary SQL expressions that can contain common-table expressions (CTE), unions, joins, etc. The rule of thumb for structuring the data model in Cube is that cubes, measures, and dimensions should make sense at the logical, domain level. And if your data is stored differently, you should feel free to reshape it in Cubes data model.

You can save and deploy your updates in Cube Cloud. Again, without any changes in Superset, your dashboard is back to a healthy state:

Screenshot 2022-04-10 at 22.23.31.png

Mishap 3: Data source changed

Lets say something rare yet impactful happened: the team decided to upgrade the data pipeline, introduce new tools, etc. Eventually, the data will move from Postgres to BigQuery, new ETL and data transformation tools will be used. What can we do to prevent the dashboards from breaking or avoid rebuilding the dashboards in Superset?

Screenshot 2022-04-11 at 16.12.03.png

Since weve already connected Superset to Cube, we can sort things out on Cubes side. Just like Superset, Cube supports numerous data sources and a Cube instance can use any subset of these data sources at the same time.

Unbreaking the dashboard. Heres how you can configure Cube to connect to Postgres and BigQuery and update the data model so the orders data is read from BigQuery.

In Cube Cloud, youll need to switch to the Development Mode. First, lets update Orders.js to match the following snippet. Check out the new dataSource option and the updated definitions for daysBeforeShipped and daysBeforeCompleted that use BigQuery-specific TIMESTAMP_DIFF function instead of Postgres-specific EXTRACT function:

cube(`Orders`, {  dataSource: 'bigquery',  sql: `SELECT * FROM superset_examples_2.orders`,  dimensions: {    id: {      sql: `id`,      type: `number`,      primaryKey: true    },    // 'processing', 'shipped', or 'completed'    status: {      sql: `status`,      type: `string`    },    createdAt: {      sql: `created_at`,      type: `time`    },    shippedAt: {      sql: `shipped_at`,      type: `time`    },    completedAt: {      sql: `completed_at`,      type: `time`    },    // Calculated, uses SQL functions and column references    daysBeforeShipped: {      sql: `TIMESTAMP_DIFF(shipped_at, created_at, DAY)`,      type: `number`    },    // Calculated, uses SQL functions and column references    daysBeforeCompleted: {      sql: `TIMESTAMP_DIFF(completed_at, created_at, DAY)`,      type: `number`    }  },  measures: {    count: {      type: `count`    },    // Calculated, uses a dimension reference    avgDaysBeforeShipped: {      sql: `${daysBeforeShipped}`,      type: `avg`    },    // Calculated, uses a dimension reference    avgDaysBeforeCompleted: {      sql: `${daysBeforeCompleted}`,      type: `avg`    }  }});

Next, lets edit the cube.js file to match the following snippet:

// Cube.js configuration options: https://cube.dev/docs/configconst PostgresDriver = require('@cubejs-backend/postgres-driver');const BigQueryDriver = require('@cubejs-backend/bigquery-driver');module.exports = {  contextToAppId: ({ dataSource }) => dataSource,  dbType: ({ dataSource }) => {    return dataSource === 'bigquery'      ? 'bigquery'      : 'postgres';  },  driverFactory: ({ dataSource }) => {    return dataSource === 'bigquery'      ? new BigQueryDriver()      : new PostgresDriver();  }};

Note the driverFactory extension point where the database selection happens: for cubes with the bigquery data source, BigQuery will be used, and Postgres will be used otherwise. Also note that we dont need to specify the credentials for Postgres and BigQuery explicitly because they are securely stored in Cube Clouds settings. Not sharing BigQuery credentials with you because who would do that?

Screenshot 2022-04-11 at 16.37.19.png

With these changes in Cube Cloud and without any changes in Superset, your dashboard will be back to a healthy state:

Screenshot 2022-04-10 at 22.23.31.png

Oh, and theres one more thing! Cube also supports data federation meaning that you can join data from different data sources, e.g., from Snowflake and Athena, if needed. To implement that, you can use pre-aggregations and declaratively set up a rollupJoin pre-aggregation that would seamlessly join data from multiple data sources under the hood.

Cherry on top: Alerts

In all examples above it was either us or our users who were to discover the broken dashboard. In production scenarios, that definitely should not happen. What can we do to get a notification before users see a broken dashboard?

Since weve decided to run Cube in Cube Cloud, we can use the Alerts feature there (not to be confused with Supersets Alerts and Reports feature which is a convenient way to send reports with dashboard data to email). Alert conditions are checked every minute. If an alert is triggered, configured recipients will get email notifications.

You can set up alerts for events like database unavailability or pre-aggregations build errors. In production scenarios, Cube is always used with pre-aggregations. It means that if a database becomes unavailable or the data within the database changes in a way that Cube cant rebuild a pre-aggregation, you will instantly receive a notification.

Configuring alerts. You can go to Cube Cloud, click on your avatar in the top right corner, click Alerts to get to the Alerts page, then click New Alert to set up one.

Screenshot 2022-04-11 at 14.51.48.png

Lets go with almost default settings:

  • Event type: Pre-aggregation build failure
  • Deployments: All
  • Send alerts to: All users on this account

You can click Add now.

Screenshot 2022-04-11 at 15.39.18.png

Testing alerts. The simplest way to test this newly setup alert is to configure pre-aggregations and then break the data model, e.g., change the table name to a wrong one.

You can go back to Cube Cloud, switch to the Development Mode, and edit the Orders.js file once again. Lets update the very beginning of the file in this fashion:

cube(`Orders`, {  dataSource: 'bigquery',  // An utterly wrong table name  sql: `SELECT * FROM superset_is_not_great`,  // A simple pre-aggregation scheduled to refresh every hour  preAggregations: {    main: {      refreshKey: {        every: '1 hour'      },      dimensions: [        status      ]    }  },// Dimensions and measures go below

After you deploy your changes, Cube Cloud would try to build the pre-aggregation, it probably wouldnt find the table with this wicked name and an email notification would hit your inbox:

Screenshot 2022-04-11 at 15.52.53.png

If you dont have thousands of active Supserset users, chances are that youll know about a broken dashboard way ahead of your users. (Supersets built-in caching mechanism could also help serve unbroken dashboards to users for some time.)

Wrapping up

Thanks for following this tutorial. My sincere hope is that now you have a much better understanding of these topics:

  • How Headless BI tools (and Cube in particular) can be helpful to build a think semantic layer and manage the data model upstream of Superset.
  • How Cube can help make your Superset dashboards robust and unbreakableand let you know if anything breaks in advance.

Please don't hesitate to like and bookmark this post, write a comment, and give a star toCube and Superseton GitHub. I hope these tools would be a part of your toolkit from now on.

Good luck and have fun!


Original Link: https://dev.to/cubejs/building-dashboards-over-a-semantic-layer-with-superset-and-cube-57ck

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To