Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
January 11, 2022 05:45 pm GMT

Data Lake explained

WHAT - the famous analogy

A data warehouse is like a producer of water where you are handed bottled water in a particular size and shape of a bottle. Whereas a data lake is a place where many streams of water flow into it and it's up to everyone to get the water the way he wants it.

WHY

The data lake is the new data warehouse. It shares the goals of the data warehouse of supporting business insights beyond the day-to-day transaction data handling. The main factors for the evolution of the data warehouse are the following ones:

Abundance of Unstructured Data

As we are collecting more and more data (text, xml, json, song, voice, sensor data...). That is why we need to find a better way to process it.
It is possible to transform data during the ETL process. But if we are deciding on a particular form of transformation, we might not have the flexibility we need afterwards for analysis. This applies for deep json structures, where we do not want to distill only some elements. Another example would be text/pdf documents that need to be stored as blobs, but are useless unless processed to extract some metrics.

The Rise of Big Data Technologies

The HDFS (Hadoop Distributed File System) made it possible to store Petabytes of data on commodity hardware. It has less cost per TB compared to a MPP (Massive Parallel Programming) database, like Redshift.
Thanks to new processing tools like MapReduce or Spark we can process data at scale on the same hardware used for storage.
Schema-On-Read makes it possible to do data analytics without inserting into a predefined schema or process unstructured text.

New Roles and Advanced Analytics

As data is treated as the new oil, people want to get out the most value of it. A data scientist often needs to represent and join data sets together from external sources. For this case the clean consistent and performant model a data-warehouse architecture provides for business users does not work. The data lake needs to cope with these agile and ad-hoc nature data exploration activities. Also machine learning or natural language processing needs to access the data in a different form than ie a star schema provides.

BENEFITS

Lower costs - more possibilities

  • ETL Offloading: Same hardware for storage and processing, a big data cluster. There is no more need for a special ETL grid or additional storage for an staging area.
  • Dimensional modelling with conformed dimensions or data marts for high/known-value data
  • Low cost per TB makes it possible to store low/unknown value data for analytics

Schema-on-Read

With the big data tools in the hadoop ecosystem, like Spark, it is as easy to work with a file as it is to work with a database, but without creating and inserting into a database. This is called schema-on-read, as for the schema of a table it is either inferred or specified and the data is not inserted into it, but upon read the data is checked against the specified schema.

Spark - Example Schema Inference

dfExample = spark.read.csv("data/example.csv",                             inferSchema=True,                            header=true,                            sep=";"                           )

The schema is inferenced, but we want to make sure the type is set correctly. For a example a date field should not be set as a string.

To better control types and malformed data, we can specify a schema (StructType), to make sure everything is correct. It is still schema-on-read though.
We also can specify what should happen to a row that is not conform to our schema. The options are drop it, replace with null or fail.

exampleSchema = StructType([                 StructField("id", IntegerType()),                 StructField("amount", IntegerType()),                 StructField("info", StringType()),                 StructField("date", DateType())                ])dfExample = spark.read.csv("data/example.csv",                             schema=exampleSchema,                            sep=";",                            mode="DROPMALFORMED"                           )

With that we can do direct querying on the fly without database insertions.

dfExample.groupBy("info")\         .sum("amount")\          .orderBy(desc("sum(amount)"))\         .show(3)

We can also write SQL with creating a temporary table. Nothing will be written to a database here.

dfExample.createOrReplaceTempView("example")spark.sql("""          SELECT info, sum(amount) as total_amount          FROM example          GROUP BY info          ORDER BY total_amount desc           """).show(3)

Unstructured data support

Spark can read and write files in

  • text-based formats
  • binary formats like Avro (saves space) and Parquet, that is a columnar storage and
  • compressed formats like gzip and snappy
dfText = spark.read.text("text.gz")dfSample = spark.read.csv("sample.csv")

Spark can read and write files from a variety of file systems (local, HDFS, S3...) and a variety of databases (SQL, MongoDB, Cassandra, Neo4j...)

Everything that is exposed in a single abstraction - the dataframe - can be processed with SQL.

ISSUES

  • A data lake can easily transform into an chaotic data garbage dump.
  • Data governance is hard to implement as a data lake can be used for cross-department data and external data.
  • Sometimes it is unclear for what cases a data lake should replace, offload or work in parallel with a data warehouse or data marts. In all cases dimensional modelling is a valuable practice.

The Data Lake - SUMMARY

  • All types of data are welcome.
  • Data is stored "as-is", transformations are done later. Extract-Load-Transform - ELT instead of ETL.
  • Data is processed with schema-on-read. There is no predefined star-schema before the transformation
  • massive parallelism and scalability come out of the box with all big data processing tools. We can use columnar storage (parquet) without expensive MPP databases.

sketch data lake

COMPARISON

Data WarehouseData Lake
Data formTabular formatAll formats
Data valueHigh onlyHigh-value, medium-value and to-be-discovered
IngestionETLELT
Data modelStar- and snowflake with conformed dimensions or data-marts and OLAP cubesAll representations are possible
SchemaSchema-on-write (Known before ingestion)Schema-on-read (On the fly at the time of the analysis)
TechnologyMPP databases, expensive with disks and connectivityCommodity hardware with parallelism as first principle
Data QualityHigh with effort for consistency and clear rules for accessibilityMixed, everything is possible
UsersBusiness analystsData scientists, Business analysts & ML engineers
AnalyticsReports and Business Intelligence visualisationsMachine Learning, graph analytics and data exploration

Original Link: https://dev.to/barbara/data-lake-explained-3cel

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To