Using OpenZL

When building a good specialized compressor, there are two stages. The first stage is to parse the data to extract the structure. The second stage is to use good backend compressors that exploit the structure to achieve good compression. We have built tools for both of these stages in OpenZL. Refer to the flowchart below to learn which tools suit your application.

Parsing

graph TD
    structure[Is the data highly structured with repetitive consistent structure? See table in 'Stuctured Data' section for examples]
    no-parse[Do not parse and directly go to backend compression]
    schema[Does the data share a single schema?]
    supported[Is your data format already supported?]
    use-supported[Use the existing prebuilt compression profile]
    sddl-compat[Can your data satisfy requirements to be parsed with SDDL? Refer to 'SDDL Compatibility' in section below]
    use-sddl[Write a description of format layout in SDDL format]
    ext-parse[Can the data be easily parsed into a simpler structure outside of OpenZL to be SDDL compatible?]
    use-custom[Write a custom parser in C/C++ with OpenZL library]
    structure ---> |No|no-parse
    structure ---> |Yes|schema
    schema ---> |No|no-parse
    schema ---> |Yes|supported
    supported ---> |No|sddl-compat
    supported ---> |Yes|use-supported
    sddl-compat ---> |No|ext-parse
    sddl-compat ---> |Yes|use-sddl
    ext-parse ---> |No|use-custom
    ext-parse ---> |Yes and I prefer to|use-sddl

Structured Data

Good Formats	Bad formats
csv	html
parquet
protobuff
thrift
json

Most data formats can benefit from parsing although to different extents. Csv and json are both highly structured and repetitive and are representative examples as formats that benefit significantly from parsing correctly. On the other hand, html is structured but not repetitive and will therefore not benefit much from parsing.

SDDL Compatibility

While SDDL has the capability to describe any data format, some formats are easier to describe. Simpler formats have the following properties:

No nested structures
No variably sized structures

SDDL is generably suitable for data containing vector(s) with headers and footers. SDDL is however not suitable when there are stricter speed requirements due to being an interpreted language.

Parsing Next Steps

SDDL usage
Prebuilt format compressors in the CLI
Writing a custom parser

Backend Compression

graph TD
    tensor[Is your data a multi-dimensional tensor eg. image/ video?]
    nd[Is compression achieved from treating data as 1-dim sufficient?]
    special[Write your own prediction codec and custom graph. See notes below on tensor compression.]
    homogeneous[Is it possible to split your data into homogeneous samples?]
    generic[Use Compress Generic]
    structure[Is your data multiple 'structures' or just a singular chunk?]
    ace[Use ACE training]
    ace-cluster[Use Clustering + ACE training]
    custom[Write custom graph backends to suit your specialized application]
    tensor ---> |No|homogeneous
    tensor ---> |Yes|nd
    nd ---> |Yes, treat data as 1-dim and continue|homogeneous
    nd ---> |No|special
    homogeneous ---> |No|generic
    homogeneous ---> |Yes|structure
    structure ---> |No|ace
    structure ---> |Yes|ace-cluster
    ace ---> |Default ACE seeded graphs are insufficient|custom
    ace-cluster ---> |Default ACE seeded graphs are insufficient|custom

Tensor Compression

Typically image/video data benefits from using lossy codecs as they can achieve much greater compression ratios by ignoring some noise. If lossless compression is required, then a custom prediction codec followed by entropy coding (Huffman/ FSE) can work well.

ACE training

ACE is a powerful tool that uses codecs and graphs in OpenZL to create a good compressor based on the data it is compressing. In specialized applications where special codecs are required, ACE can also be seeded with these codecs to perform better.

On the Roadmap

There are more features we are building related to training such as dictionary support which we expect to improve trained results over time.

Backend Compression Next Steps

Try the numeric_array example
See ACE tutorial
Training in the CLI