Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
October 13, 2021 04:36 pm GMT

Advanced MessagePack capabilities

Advanced MessagePack capabilities

e9d6306b85aa4ce811a0dcca6d033789
Author: Eugene Leonovich
MessagePack is a binary format for data serialization. It is positioned by the authors as a more efficient alternative to JSON. Due to its speed and portability, it's often used as a format for data exchange in high-performance systems. The other reason this format became popular is that it's very easy to implement. Your favorite programming language probably already has several libraries designed to work with it.
In this article, I'm not going to talk about the design of MessagePack, and I am not going to compare it to similar products. There is a lot of information about it on the internet. However, there is not enough information about MessagePack's extension types system. So I'm going to explain what it is, show some examples, and talk about how to make serialization even more effective with extension types.

The Extension type

The MessagePack specification defines 9 base types:

  • Nil
  • Boolean
  • Integer
  • Float
  • String
  • Binary
  • Array
  • Map
  • Extension.

Extension is a container designed for storing extension types. Let's look closely at how it works. It will help us with writing our own types.
Here is how the container is structured:
425470e1345d1767f7f1ae6d29195f30 (1)

Header is the container's header (1 to 5 bytes). It contains the payload size, i.e., the length of the Data field. To learn more about how the header is formed, take a look at the specification.
Type is the ID of the stored type, an 8-bit signed integer. Negative values are reserved for official types. User types' IDs can take any value in the range from 0 to 127.
Data is an arbitrary byte string up to 4 GiB long, which contains the actual data. The format of official types is described in the specification, while the format of user types may depend entirely on the developer's imagination.

The list of official types currently includes only Timestamp with the ID of -1. Occasionally, we see suggestions to add new types (such as UUID, multidimensional arrays, or geo coordinates), but since the discussions are not very active, I wouldn't expect any updates in the near future.

Hello, World!

34ae802c3fd31328904479bee387fe93 (2)

That's enough theory, let's start coding! For these examples, we'll use the msgpack.php MessagePack library since it provides a convenient API to handle extension types. I hope you'll find these code examples easy to understand even if you use other libraries.
Since I mentioned UUID, let's implement support for this data type as an example. To do so, we'll need to write an extension---a class to serialize and deserialize UUID values. We will use the symfony/uid library to make handling such values easier.

This example can be adapted for any UUID library, be it the popular * ramsey/uuid, PECL * uuid module, or a user implementation.*
Let's name our class UuidExtension. The class must implement the Extension interface:

use MessagePack\BufferUnpacker;use MessagePack\Packer;use MessagePack\TypeTransformer\Extension;use Symfony\Component\Uid\Uuid;final class UuidExtension implements Extension{    public function getType(): int    {        // TODO    }    public function pack(Packer $packer, mixed $value): ?string    {        // TODO    }    public function unpackExt(BufferUnpacker $unpacker, int $extLength): Uuid    {        // TODO    }}

We determined earlier what the type (ID) of the extension is, so we can easily implement the getType() method. In the simplest case, this method could return a fixed constant, globally defined for the whole project. However, to make the class more versatile, we'll let it define the type when initializing the extension. Let's add a constructor with one integer argument, $type:

/** @readonly */private int $type;public function __construct(int $type){    if ($type < 0 || $type > 127) {        throw new \OutOfRangeException(            "Extension type is expected to be between 0 and 127, $type given"        );    }    $this->type = $type;}public function getType(): int{    return $this->type;}

Now let's implement the pack() method. From the method's signature, we can see that it takes two parameters: a Packer class instance and a $value of any type. The method must return either a serialized value (wrapped into the Extension container) or null if the class of the extension does not support the value type:

public function pack(Packer $packer, mixed $value): ?string{    if (!$value instanceof Uuid) {        return null;    }    return $packer->packExt($this->type, $value->toBinary());}

The reverse operation isn't much harder to implement. The unpackExt() method takes a BufferUnpacker instance and the length of the serialized data (the size of the Data field from the schema above). Since we've saved the binary representation of a UUID object in this field, all we need to do is read this data and build a Uuid object:

public function unpackExt(BufferUnpacker $unpacker, int $extLength): Uuid{    return Uuid::fromString($unpacker->read($extLength));}

Our extension is ready! The last step is to register a class object with a specific ID. Let the ID be 0:

$uuidExt = new UuidExtension(0);$packer = $packer->extendWith($uuidExt);$unpacker = $unpacker->extendWith($uuidExt);

Let's make sure everything works correctly:

$uuid = new Uuid('7e3b84a4-0819-473a-9625-5d57ad1c9604');$packed = $packer->pack($uuid);$unpacked = $unpacker->reset($packed)->unpack();assert($uuid->equals($unpacked));

That was an example of a simple UUID extension. Similarly, you can add support for any other type used in your application: DateTime, Decimal, Money. Or you can write a versatile extension that allows serializing any object (as it was done in KPHP).
However, this is not the only use for extensions. I'll now show you some interesting examples that demonstrate other advantages of using extension types.

"Lorem ipsum," or compressing the incompressible

851828579dec0b5e1c75b41834b61030 (2)

If you've ever inquired about MessagePack before, you probably know the phrase from its official website, msgpack.org: "It's like JSON, but fast and small."
In fact, if you compare how much memory space the same data occupies in JSON and MessagePack, you'll see why the latter is a much more compact format. For example, the number 100 takes 3 bytes in JSON and only 1 in MessagePack. The difference becomes more significant as the number's order of magnitude grows. For the maximum value of int64 (9223372036854775807), the size of the stored data differs by as much as 10 bytes (19 against 9)!
The same is true for boolean values---4 or 5 bytes in JSON against 1 byte in MessagePack. It is also true for arrays because many syntactic symbols---such as commas separating the elements, semicolons separating the key-value pairs, and brackets marking the array limits---don't exist in binary format. Obviously, the larger the array is, the more syntactic litter accumulates along with the payload.
String values, however, are a little more complicated. If your strings don't consist entirely of quotation marks, line feeds, and other special symbols that require escaping, then you won't see a difference between their sizes in JSON and in MessagePack. For example, "foobar" has a length of 8 bytes in JSON and 7 in MessagePack. Note that the above only applies to UTF-8 strings. For binary strings, JSON's disadvantage against MessagePack is obvious.

Knowing this peculiarity of MessagePack, you can have a good laugh reading articles that compare the two formats in terms of data compression efficiency while using mainly string data for the tests. Apparently, any conclusions based on the results of such tests would make no practical sense. So take those articles skeptically and run comparative tests on *your own* data.

At some point, there were discussions about whether to add string compression (individual or in frames) to the specification to make string serialization more compact. However, the idea was rejected, and the implementation of this feature was left to users. So let's try it.
Let's create an extension that will compress long strings. We will use whatever compression tool is at hand, for example, zlib.

Choose the data compression algorithm based on the specifics of your data. For example, if you are working with lots of short strings, take a look at [*SMAZ](https://github.com/antirez/smaz).*

Let's start with the constructor for our new class, TextExtension. The first argument is the extension ID, and as a second optional argument, we'll add minimum string length. Strings shorter than this value will be serialized in a standard way, without compression. In this way, we will avoid cases where the compressed string ends up longer than the initial one:

final class TextExtension implements Extension{    /** @readonly */    private int $type;    /** @var positive-int */    private int $minLength;    public function __construct(int $type, int $minLength = 100)    {        ...        $this->type = $type;        $this->minLength = $minLength;    }    ...}

To implement the pack() method, we might write something like this:

public function pack(Packer $packer, mixed $value): ?string{    if (!is_string($value)) {        return null;    }    if (strlen($value) < $this->minLength) {        return $packer->packStr($value);    }    // compress and pack    ...}

However, this wouldn't work. String is one of the basic types, so the packer will serialize it before our extension is called. The reason is that serialization is implemented in msgpack.php for better performance. Otherwise, before serializing each value, the packer would need to scan the available extensions, considerably slowing down the process.
Therefore, we need to tell the packer not to serialize certain strings as, you know, strings but to use an extension. As you might guess from the UUID example, it can be done via a ValueObject. Let's call it Text, similar to the extension class:

/** * @psalm-immutable */final class Text{    public function __construct(        public string $str    ) {}    public function __toString(): string    {        return $this->str;    }}

So instead of

$packed = $packer->pack("a very long string");

we'll use a Text object to mark long strings:

$packed = $packer->pack(new Text("a very long string"));

Let's update the pack() method:

public function pack(Packer $packer, mixed $value): ?string{    if (!$value instanceof Text) {        return null;    }    $length = strlen($value->str);    if ($length < $this->minLength) {        return $packer->packStr($value->str);    }    // compress and pack    ...}

Now we just need to compress the string and put the result in an Extension. Note that the minimum length limit does not guarantee that the string will take less space after compression. For this reason, you might want to compare the lengths of the compressed string and the original and choose whichever is more compact:

$context = deflate_init(ZLIB_ENCODING_GZIP);$compressed = deflate_add($context, $value->str, ZLIB_FINISH);return isset($compressed[$length - 1])    ? $packer->packStr($value->str)    : $packer->packExt($this->type, $compressed);

Deserialization:

public function unpackExt(BufferUnpacker $unpacker, int $extLength): string{    $compressed = $unpacker->read($extLength);    $context = inflate_init(ZLIB_ENCODING_GZIP);    return inflate_add($context, $compressed, ZLIB_FINISH);}

Let's see the result:

$longString = <<<STRLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.STR;$packedString = $packer->pack($longString); // 448 bytes$packedCompressedString = $packer->pack(new Text($longString)); // 291 bytes

In this example, we saved 157 bytes, or 35% of what would be the standard serialization result, on a single line!

From "schema-less" to "schema-mixed"

04bbfc5f6758a3841bc7753e4421e960 (8)

Compressing long strings is not the only way to save space. MessagePack is a schemaless, or schema-on-read, format that has its advantages and disadvantages. One of the disadvantages in comparison with schema-full (schema-on-write) formats is highly ineffective serialization of repeated data structures. For example, this happens with selections from a database where all elements in the resulting array have a similar structure:

$userProfiles = [    [        'id' => 1,        'first_name' => 'First name 1',        'last_name' => 'Last name 1',    ],    [        'id' => 2,        'first_name' => 'First name 2',        'last_name' => 'Last name 2',    ],    ...    [        'id' => 100,        'first_name' => 'First name 100',        'last_name' => 'Last name 100',    ],];

If you serialize this array with MessagePack, the repeated keys of each element in the array will take a substantial part of the total data size. But what if we could save the keys of such structured arrays just once? It would significantly cut down the size and also speed up serialization since the packer would have fewer operations to perform.
Like before, we are going to use extension types for that. Our type will be a value object wrapped around an arbitrary structured array:

/** * @psalm-immutable */final class StructList{    public function __construct(        public array $list,    ) {}}

If your project includes a library for database handling, there is probably a special class in that library to store table selection results. You can use this class as a type instead of/along with StructList.

Here is how we are going to serialize such arrays. First, we'll check the array size. Of course, if the array is empty or has only one element, there is no reason to store keys separately from values. We'll serialize arrays like these in a standard way.
In other cases, we'll first save a list of keys and then a list of values. We won't be storing an associative array list, which is the standard MessagePack option. Instead, we'll write data in a more compact form:
3ca04136382cc4c0767fbc1626e9908d (9)

Implementation:

final class StructListExtension implements Extension{    ...    public function pack(Packer $packer, mixed $value): ?string    {        if (!$value instanceof StructList) {            return null;        }        $size = count($value->list);        if ($size < 2) {            return $packer->packArray($value->list);        }        $keys = array_keys(reset($value->list));        $values = '';        foreach ($value->list as $item) {            foreach ($keys as $key) {                $values .= $packer->pack($item[$key]);            }        }        return $packer->packExt($this->type,            $packer->packArray($keys).            $packer->packArrayHeader($size).            $values        );    }    ...}

To deserialize, we need to unpack the keys array and then use it to restore the initial array:

public function unpackExt(BufferUnpacker $unpacker, int $extLength): array{    $keys = $unpacker->unpackArray();    $size = $unpacker->unpackArrayHeader();    $list = [];    for ($i = 0; $i < $size; ++$i) {        foreach ($keys as $key) {            $list[$i][$key] = $unpacker->unpack();        }    }    return $list;}

That's it! Now, if we serialize $profiles from the example above as a normal array and as a structured StructList, we'll see a great difference in size---the latter will be 47% more compact.
$packedList = $packer->pack($profiles); // 5287 bytes
$packedStructList = $packer->pack(new StructList($profiles)); // 2816 bytes

We could go further and create a specialized Profiles type to store information about the array structure in the extension code. This way, we wouldn't need to save the keys array. However, in this case, we would lose in versatility.

Conclusion

We've taken a look at just a few examples of using extension types in MessagePack. To see more examples, check the msgpack.php library. For the implementations of all extension types supported by the Tarantool protocol, see the tarantool/client library.
I hope this article gave you a sense of what extension types are and how they can be useful. If you're already using MessagePack but haven't known about the feature, this information may inspire you to revise your current methods and start writing custom types.
If you're just wondering which serialization format to choose for your next project, the article might help you make a reasonable choice, adding a point in favor of MessagePack :)

Links

Get Tarantool on our website
Get help in our telegram channel


Original Link: https://dev.to/tarantool/advanced-messagepack-capabilities-4735

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To