This guide serves as a gentle introduction to OpenZL concepts via hands-on application of the CLI and API. Through this exercise, you will develop a conceptual understanding of the OpenZL engine, learn some common use patterns, and have a good grasp of the CLI.
Prerequisites
Build the library and CLI tool using CMake
The examples in the guide assume you have the OpenZL benchmark corpus on your machine. Download it here: The input file names in the code snippets correspond to files in theopenzl_corpus/getting_started
folder.
Basic usage
Serial data in OpenZL
Let's start with a simple example. We will use the CLI to compress a file named myfile.txt
using the serial compression profile.
--profile
option allows you to select a predefined compression profile. In this case, we are using the serial
profile, which is suited for serial data.
More about profiles: Under the hood, a
profile
is simply a pre-configured OpenZL graph. Since we did not do any ACE training, the graph contains a single Zstd node. Other profiles contain more complex graphs, as we will see later.
Numeric data in OpenZL
Now let's try to compress a file of numbers. We have preconfigured profile called le-i32
that takes advantage of fixed-width structure of integer data to compress better than byte-wise LZ.
For comparison, we can try compressing the same output with serial.
We can see the immediate benefit of moving from serial to le-i32.
Tip: The more you know about your data, the better you can tune your compression. Simply knowing that your data is integral dramatically improves compression.
Custom Compressors
The true power of OpenZL lies in its configurability. Creating your own custom compressor is documented in the API reference. Once created, a compressor can be serialized into a CBOR file. The CLI supports compressing with a serialized compressor.
The main way of creating serialized compressors is via export from code. The other way is via training.What's in a serialized compressor? A serialized compressor contains one or more serialized graphs along with associated
LocalParams
. Although there may be multiple graphs registered to a compressor, only one, the starting graph, is used as the graph when compressing with that compressor. You can explore a sample serialized compressor by putting a.zsc
file into a CBOR-to-JSON converter.
Training
Oftentimes, the performance of a configurable graph is much improved by preconfiguring it with a number of representative data samples. This process is called training the graph. Conceptually, the rationale and outcome of graph training is very similar to training Zstd dictionaries. However, the mechanics are quite different.
Thetrain
command takes an unconfigured compressor in the form of a profile
or serialized compressor and outputs a serialized compressor with a configuration trained on the provided samples.
Terminology - (Un)configured graph: A graph is considered configured if all of its graph components are configured. A codec is configurable only via its
LocalParams
. Thus, an unconfigured codec is one that takes params but has blankLocalParams
passed to it. A configured codec is then one with populatedLocalParams
. Selectors and function graphs additionally can set successor graph IDs. Then, a selector or function graph without a complete list of successors is also unconfigured.
Let's see what happens when we run a compression using the trained graph. First, the untrained graph...
...and now the trained graph. The CLI allows you to compress using a serialized compressor. We will use the one generated by the train command The difference is stark. Training is able to net us a 10% ratio win at no other cost. (Technically, if we wanted to be rigorous we should have split our data into train and test directories. But the result is very similar.)Training effectively: The typical training workflow is to collect a small number of representative samples, train on those samples, and use the trained compressor on the entire dataset. In production settings, data may evolve over time. So it is useful to collect new samples at designated intervals (maybe daily or weekly) and replace the trained compressor with the newly trained one.
Inline Training
Because the impact of training is so stark, the CLI requires that trainable compressors be trained in some form before using them. The typical way to do this is by first training then compressing with the trained .zsc
compressor, but inline training is also supported.