Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 24, 2022 03:35 am GMT

How does Apache DolphinScheduler solve the troubles of data warehouse scheduling in ITcast?

Image description

At the Apache DolphinScheduler&Hudi joint meetup in July, Kong Shuai, a senior researcher of Chuanzhi Education, shared the practice of Apache DolphinScheduler in ITcast.

The speech is divided into 3 parts:

  1. The evolution of the data warehouse structure of ITcast

  2. DolphinSchedulers practice and thinking in ITcast

  3. Looking to the future

1 The evolution of the data warehouse structure of ITcast

First, lets take a look at the evolution of the data warehouse structure of ITcast.

Data Warehouse 1.0 Architecture

Before we dive into the architecture system of ITcast Data Warehouse 1.0, Ill first introduce the main business process of ITcast.

Image description

The above picture shows the core analysis process of ITcast. First of all, customers will visit the official website. consult online customer service to learn about our courses. If customers feel that these courses are helpful and have a learning intention, they will leave us some contact clues, such as phone calls, WeChat, QQ, etc.

After we get these clues, offline customer service will further communicate with customers based on these clues. If these clues cannot be contacted, they will be set as invalid clues; if they can be contacted, customer service will communicate with the customer by phone or WeChat.

When the customer signs up, they will start to learn the corresponding courses.

During this series of operations, all data records will be stored on the server, and our big data system will use the T+1 method to collect and analyze data every morning. The analysis and calculation mainly include access data, consultation data, intention data, and clue data, as well as some registration data, attendance data, etc. After analyzing and calculating the results of these data, we will synchronize these results to application systems, such as decision-making systems, reporting systems, etc., and display them in BI in the application system.

Based on this business process, we designed the education data warehouse 1.0 architecture. In this version, the original data we use mainly are structured data including internal consulting systems, offline face-to-face teaching systems, online education systems, customer management systems, etc.

For these structured data, Sqoop can collect them well, so in the early days, Sqoop is used for ETL, HDFS for data storage, and Hive based on MapReduce engine for analysis to a great extent.

We designed 6 separate layers in the data warehouse. The first layer is ODS original data layer, which retains the original data and will not make any changes. This layer will also keep the most complete data. data. To ensure data security, our ODS layer uses external tables.

Next is the DWD layer, which mainly performs cleaning and conversion operations. Cleaning is primarily to filter out invalid garbage. Conversion is to convert enumeration values, fields, etc., as well as data desensitization.

After the information conversion of the DWD layer, we can provide relatively clean application data.

Then is the DWM layer, which is mainly used to generate some components that can be reused. There are two main operations. One is to generate a detailed wide table to improve performance for subsequent queries, and reduce server consumption; the other is to do some light aggregation work.

Based on these days data and hourly data, we conduct further data aggregation in the following DWS layer. This is a Data Mart, which will gather data based on the data in the middle layer and calculate the time dimension, such as year, month, Quarterly, weekly, etc., and calculate the total cumulative value data.

The Data Mart will export the final data to the RPT application layer, which mainly saves personalized business data.

In addition, we also have a DIM dimension layer, which mainly stores some standard dimension data, and finally exports these data to the application system. The application system generally uses a MySQL database, and the exported data is provided with a data interface through Spring Boot and displayed with BI.

Because we used offline data warehouses and the T+1 calculation method before, the initial scheme can meet our computing needs.

As for the initial task script, we chose the popular Shell script at that time, and in version 1.0, our scheduling tasks were not abundant, so the functions and performance of Oozie at that time could meet our scheduling needs.

The offline data warehouse design ideas of some core themes and the basic process are to extract different subject tables to the ODS layer to form a detailed wide table, which has relevant indicators and dimensions of the wide table, merge the wide table into a large data mart, and summarize and calculate at the DWM layer.

2.0 Architecture

With more and more unstructured data and business scenarios emerging, the original ETL tool Sqoop can no longer meet our ETL needs, so we made some adjustments to the 1.0 architecture, including:

  • In addition to Sqoop, Flume was added for log file collection, and ETL tools were developed using Python;
  • With the gradual increase in the amount of data, Hive can no longer meet the needs of computing performance. Through the comparison of various analysis engines, Presto was finally selected;
  • Task scripts are also gradually being systemized. Shell scripts are not as maintainable and extensible as Python, so the scripts are replaced by Python.

Image description

In addition, some changes have been made to the system layering. In version 2.0, we removed the dimension layer and placed the dimension tables directly in the DWD layer; the original DWS layer, the Data Mart layer was replaced by the DM layer, which is also the data layer and plays the same role.

Besides, a new light summary layer and detailed wide surface layer are introduced, that is, the DWB layer and the DWS layer. The main function of the DWB layer is to widen the table in the DWD layer and convert the Snowflake model to a star model. After the detailed wide table is generated, the performance can be improved and the loss of the server can be reduced in the subsequent calculation.

The DWS layer mainly performs light aggregation, such as light aggregation by the hour and day, and then calculated by year, quarter, etc. in the DM layer.

The RPT and Presto are the entry for all layered data across the company.

Finally, we also changed the scheduling tool. Due to the complexity, function lackage, and unfriendliness of upgrade compatibility and access control, the previous scheduling tool Oozie is replaced by DolphinScheduler.

Core technologies

The first is Sqoop. In the data warehouse version 1.0, our original data is mainly structured, Sqoop happens to be a professional ETL tool for structured data transmission, but with the increase of data and business scenarios, the ETL scenarios become more and more complex, and Sqoop can no longer meet our needs. So in version 2.0, we replaced Sqoop with the DolphinScheduler system. Based on DolphinScheduler and Flume, we can support various complex business scenarios more flexibly through our self-developed Python ETL tools.

Hive has the advantages of low cost, stability, and ecological compatibility, but its shortcomings are also obvious: too slow. With the huge increase in data volume, Hive can no longer meet our demand for computing performance. After various comparisons, we chose Presto as the new engine in version 2.0.

We also investigated in SparkSql, Impala, HAWQ, and ClickHouse before, and the comparison experience are as follows:

  • Although SparkSql is faster than Hive, the query performance of SparkSql and Presto is not outstanding compared to single-table & multi-table.
  • Impalas multi-table query performance is excellent, but the single-table query is not as good as Presto, and it does not support OrcFile, some syntaxes such as update, delete, grouping sets, nor Date data type.
  • HAWQ and Greenplum are relatively moderate, complicated to use, and not outstanding in performance.
  • The single-table performance of ClickHouse is better, but not outstanding in multi-table performance, and the compatibility is not as good as Presto.
    Advantages of Presto:

  • Both Presto and Hive are open-sourced by Facebook, and the two are compatible. Trino is forked from Presto and is very similar to it.

  • Presto/Trino performs single-table & multi-table query performance well and supports EB-level data warehouses and data lakes.

  • Presto/Trino is compatible with Hive, and also supports relational databases such as MySQL, Oracle, and non-relational databases such as Kafka, Redis, and MongoDB.

  • Presto/Trino can perform cross-database read and write operations of heterogeneous data sources.

Next comes the Shell script. We mainly use Shell in the 1.0 version, because there were few tasks at that time, and it was not particularly complicated, but with the systemization of scripts, we need the scripts to be maintainable and scalable, and these requirements make its shortcomings prominent. Shell is a script used for system management, with limited functions, low performance, and high overhead. Python is better than Shell script in terms of performance, consistency, and scalability, and has now become the Top1 programming language in the world.

Therefore, in the data warehouse architecture system of 2.0, we changed the shell script to a Python script.

The last technology related is the scheduling tool. We used Oozie at first, but due to the limitation of its functions, we re-investigated various scheduling tools in the need of constant iteration, and finally chose the domestic open-source project Apache DolphinScheduler.

We use some functions of Apache DolphinScheduler, such as file management and data sources. The file management function is very convenient, allowing us to refer to files easily, and these files can also refer to each other.

After the data source is created, it can also be continuously reused, which is very helpful for us to improve efficiency.

Due to the problem left over by using Oozie, currently, Sqoop, HiveSQL, Presto, and other tasks mainly are still using Shell components in DolphinScheduler, but benefit from DolphinSchedulers functions of visual interface, resource management, multi-tenant management, rights management, alarm group management, multi-project management, etc., the scheduling efficiency has been greatly improved.

Image description

This is the technical evolution of our data warehouse architecture. Next, Ill introduce the pain points of the scheduling tools in the process of the technological evolution of ITcast, and how we solved these problems.

2 The practice of DolphinScheduler in ITcast

Pain points of scheduling

  • XML configuration is complexWe used Oozie for the workflow scheduling before. Oozie is a workflow engine, which is featured in defaulted HPDL language (XML) to define the process. The visualization support depends on third-party tool software (such as HUE), the built-in visualization interface is relatively weak, and the installation is also complicated.

There are two core components of an Oozie workflow: job.properites and workflow.xml. The former mainly saves some commonly used parameter variables, the latter is the core file, and the specific workflow is defined in the workflow.

Image description

Image description

The images above show a very simple workflow that prints Oozie and returns an error message for output. But as you can see, such a simple workflow process requires configuring so many XML tags, which is very troublesome and low efficient. As our scheduling work increases and business scenarios become more complicated, Oozie has severely impacted our production efficiency.

  • Few functional componentsIn addition, there are few functional components in Oozie, and it responds slowly to popular technologies, such as PySpark and other popular analytical computing engines. Although the official has claimed to support PySpark tasks, in practical applications, PySpark task support is poor and troublesome.

Moreover, Oozie does not have functions such as multi-tenant management, alarm group management, multi-environment management, worker group management, project management, and resource management.

  • Blocking deadlock

Image description

During the execution of Oozie, each task will start an oozie-launcher loader, and oozie-launcher will take up a lot of memory;

Resources will not be released during the life cycle of the Oozie launcher. If the data task does not have sufficient resources at this time, the data task has to wait until there are sufficient resources;

If multiple Oozie tasks are submitted at the same time, or Oozie performs multiple parallel subtasks, it will cause insufficient memory, and the Oozie launcher will keep waiting for resources when there are not enough resources, causing resources and tasks to wait for each other, and deadlock occurs.

  • Access Control & Upgrade CompatibilityAccess Control: Oozie has no access control and no multi-tenancy function;

Compatibility: Oozie depends on the Hadoop cluster version, and updates to the latest version are prone to incompatibility with existing clusters.

3 DolphinScheduler addresses scheduling issues

These are some of the pain points we encountered with our scheduling in version 1.0. So how do we address these pain points?

Apache DolphinScheduler is a distributed and scalable visual workflow task scheduling platform.

Compared with Oozies complex XML configuration process, all of DolphinSchedulers flow and timing operations are visualized, and DAG is drawn by dragging and dropping tasks and can be monitored in real time.

At the same time, DolphinScheduler supports one-click deployment without a complicated installation process to improve work efficiency.

  • Abundant featuresOne of the benefits of DolphinScheduler is that it is feature-rich, and compared to Oozie, DolphinScheduler keeps close pace with popular technologies.

DolphinScheduler has achieved rapid upgrade and compatibility with the analysis and calculation engines of current processes such as PySpark. And the new components are convenient and efficient to use, which greatly improves work efficiency.

At the same time, DolphinScheduler is constantly upgrading and improving various functions, such as multi-tenant management, alarm group management, multi-environment management, worker group management, project management, resource management, and other functions.

  • High reliability

Image description

Another feature of DolphinScheduler is its high reliability. Different from Oozies blocking deadlock phenomenon, DolphinScheduler uses a task buffer queue mechanism to avoid overload; the number of tasks that can be scheduled on a single machine can be flexibly configured, and when there are too many tasks, they will be cached in the task queue, avoiding cause the machine to get stuck, just like a traffic light.

At the same time, DolphinScheduler supports a decentralized multi-Master and multi-Worker service peer-to-peer architecture, which can avoid excessive pressure on a single Master.

  • Access Control & Upgrade CompatibilityAccess Control: Oozie has no access control; DolphinScheduler can authorize users to access resources, projects, and data sources, and different users do not affect each other.

Compatibility: Oozie is prone to incompatibility with existing clusters; the DolphinScheduler upgrade will not affect the previous cluster settings, and the upgrade is simple.

In general, DolphinScheduler helps us solve many scheduling troubles.

4 Looking to the future

Finally, this is our outlook for the future.

Although we have not yet used all the components of Apache DolphinScheduler, it has helped us solve most of the previous troubles we have met and greatly improved our work efficiency. We plan to promote it to more work groups in the company later.

In the current use, DolphinScheduler supports PySpark tasks well, but there are still some adaptation problems when using the Presto plug-in. We are adapting the Presto function, and planning to optimize the data source type to support dynamic hot swapping, to allow any type of data source available at any time.

As mentioned above, due to the fast version update, some adaptation problems will inevitably occur during use, and we are excited to participate in the community and put forward our ideas to optimize and improve the project. We hope Apache DolphinScheduler takes it to the next level!

Welcome to fill in this survey to let me know how do you think about Apache DolphinScheduler:

Original Link:

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To