Simple Data Description Language

Introduction

OpenZL's backend compression graphs are most effective on streams of self-similar, typed data.

The Simple Data Description Language (SDDL) module provides a lightweight tool to decompose simple structured data formats into their components, so that they can be effectively compressed by OpenZL's backends.

The SDDL functionality is comprised of a few pieces:

The SDDL graph, which takes a compiled description and uses it to parse and decompose the input into a number of output streams.
The SDDL compiler, which takes a description written in the SDDL language and translates it into the binary format that the SDDL engine accepts.
The SDDL profile, a pre-built compressor available in the CLI which runs SDDL on the input and passes all of its outputs to the generic clustering graph.

These components and how to use them are described below.

Writing an SDDL Description

The Simple Data Description Language is a Domain-Specific Language that makes it easy to describe to OpenZL the components that make up simple data formats.

The fundamental task of an SDDL Description is to associate each byte of the input stream with a corresponding Field, whose purpose is to give OpenZL a hint for how to group and handle that part of the input. SDDL has a number of built-in fields, like UInt32LE or Byte. Descriptions can also construct compound fields (analogous to C structs and arrays) out of other fields.

The operative part of an SDDL Description, the part that actually constructs that association between a part of the input stream and a Field, is an operation called consumption. Consumption is represented by the : operator and looks like : Byte, for example.

Introductory Example

Imagine you are trying to compress a stream of readings from an accelerometer. Suppose the format you receive is an array of samples, each of which looks like this in C:

struct AccelerometerSample {
    uint64_t timestamp;
    float x_accel;
    float y_accel;
    float z_accel;
};

A possible SDDL description of this format would be:

# Declare a new compound field "AccelerometerSample" which describes the
# structure of an individual sample using a Record, SDDL's aggregate type.
AccelerometerSample = {
    timestamp : UInt64LE;
    x_accel : Float32LE;
    y_accel : Float32LE;
    z_accel : Float32LE;
}

# Consume the whole input as an array of AccelerometerSample records.
: AccelerometerSample[]

When executing an SDDL Description, the SDDL engine starts at the beginning of the input. As the execution proceeds, and the description is applied to the input, each consumption operation associates the next byte(s), starting at the current position, with the consumed field and then advances past the consumed content.

The task of the description is complete when all of the input has been consumed. (If the description ends and the whole input hasn't been consumed, or if the description tries to consume bytes past the end of the input, the execution fails.)

Examples

While there's an in-depth language/syntax reference available below, the best way to gain familiarity with SDDL is probably to look at some examples. So here are a couple:

SAO

SAO File Description

# Send all header fields to the same output
HeaderInt = UInt32LE

Header = {
  STAR0: HeaderInt
  STAR1: HeaderInt  # First star number in file
  STARN: HeaderInt  # Number of stars in file
  STNUM: HeaderInt  # star i.d. number presence
  MPROP: HeaderInt  # True if proper motion is included
  NMAG : HeaderInt  # Number of magnitudes present
  NBENT: HeaderInt  # Number of bytes per star entry
}

Row = {
  SRA0 : Float64LE  # Right ascension in degrees
  SDEC0: Float64LE  # Declination in degrees
  IS   : Byte[2]    # Instrument status flags
  MAG  : UInt16LE   # Magnitude * 100
  XRPM : Float32LE  # X-axis rate per minute
  XDPM : Float32LE  # X-axis drift per minute
}

# Read the header
header: Header

# Validate format expectations
expect header.STNUM == 0
expect header.MPROP == 1
expect header.NMAG  == 1
expect header.NBENT == sizeof Row

# The header is followed by STARN records
data: Row[header.STARN]

Bitmap Images

BMP v3 Format Description

# Supports uncompressed BMP Version 3 formatted files, described here:
# https://gibberlings3.github.io/iesdp/file_formats/ie_formats/bmp.htm
# Convert images to this with:
# ```
# magick $SOURCE -compress none BMP3:$DEST
# ```

GenericU8 = UInt8
GenericU16 = UInt16LE
GenericU32 = UInt32LE

FileHeader = {
  signature   : GenericU16
  file_size   : GenericU32
  reserved    : GenericU32
  data_offset : GenericU32
}

file_header : FileHeader

expect file_header.signature == 0x4d42  # "BM"
expect file_header.reserved == 0

InfoHeader = {
  header_size      : GenericU32
  width            : GenericU32
  height           : GenericU32
  planes           : GenericU16
  bits_per_pixel   : GenericU16
  compression      : GenericU32
  image_size       : GenericU32
  x_pixels_per_m   : GenericU32
  y_pixels_per_m   : GenericU32
  colors_used      : GenericU32
  important_colors : GenericU32
}

info_header : InfoHeader

expect info_header.compression == 0

width = info_header.width
height = info_header.height
bits_per_pixel = info_header.bits_per_pixel

num_colors = (
  (bits_per_pixel == 1) * 2 +
  (bits_per_pixel == 4) * 16 +
  (bits_per_pixel == 8) * 256
)

ColorTableEntry = {
  red      : GenericU8;
  green    : GenericU8;
  blue     : GenericU8;
  reserved : GenericU8;
}

color_table_entries : ColorTableEntry[num_colors];

row1_bytes  = 4 * ((width + 31) / 32)
row4_bytes  = 4 * ((width +  7) /  8)
row8_bytes  = 4 * ((width +  3) /  4)
row16_bytes = 4 * ((width +  1) /  2)
row24_bytes = 4 * ((width +  1) * 3 / 4)

Image = {
  : GenericU8[row1_bytes][height][bits_per_pixel == 1]
  : GenericU8[row4_bytes][height][bits_per_pixel == 4]
  : GenericU8[row8_bytes][height][bits_per_pixel == 8]
  : GenericU16[row16_bytes / 2][height][bits_per_pixel == 16]
  : GenericU8[row24_bytes][height][bits_per_pixel == 24]
}

image : Image

Binary STL Files

STLB Format Description

# Binary STL Files begin with an 80-byte header.
header : Byte[80]

# Followed by the number of triangles they store.
triangle_count : UInt32LE

Triangle = {
    normal_vec : Float32LE[3]
    vertices   : Float32LE[3][3]
    attributes : Byte[2]
}

# The rest of the file is triangles
: Triangle[triangle_count]

Running SDDL

The SDDL Profile

The easiest way to run SDDL over an input is via the SDDL profile built into the CLI.

Example

Start by writing an SDDL Description for your data. Here's a trivial one that splits the input into alternating integer streams:

cat <<EOF >desc.sddl
Row = {
  UInt32LE
  UInt32LE
}
num_rows = _rem / sizeof Row
: Row[num_rows]

# Consume any remaining input.
: Byte[_rem]
EOF

Then compress an input using that description:

./zli compress --profile sddl --profile-arg desc.sddl --train-inline my_input_file -o my_input_file.zl

Since the SDDL profile passes the results of the parse produced by the SDDL graph to the generic clustering graph, which needs to be trained, the --train-inline flag is important to get good performance.

If you are compressing many inputs with the same profile, it's much faster to do the training once and use the resulting trained profile for each input rather than training on each and every input separately:

./zli train --profile sddl --profile-arg desc.sddl input_dir/ -o trained_sddl.zlc

for f in $(ls input_dir/); do
  ./zli compress --compressor trained_sddl.zlc input_dir/$f -o output_dir/$f.zl
done

The SDDL Graph

The SDDL Graph allows you to integrate SDDL into compressors other than the prebuilt SDDL profile. You can create an SDDL graph with ZL_Compressor_buildSDDLGraph:

ZL_Result_ZL_GraphID ZL_Compressor_buildSDDLGraph(ZL_Compressor* compressor,
                                                  const void* description,
                                                  size_t descriptionSize,
                                                  ZL_GraphID successor);

Builds a Simple Data Description Language graph with the provided (pre-compiled) description and successor graph.

See the SDDL page in the documentation for a complete description of this component.

The SDDL Graph has the following structure:

flowchart TD
    subgraph SDDL Graph
        Desc([Description]);
        Input([Input]);
        Conv@{ shape: procs, label: "Type Conversions"};
        Engine[SDDL Engine];
        Inst([Instructions]);
        Disp[/Dispatch Transform\];
        Succ[Successor Graph];

        Desc --> Engine;
        Input --> Engine;
        Engine --> Inst;
        Inst -->|Dispatch Instructions| Disp;
        Input --> Disp;
        Inst -->|Type Information| Conv;
        Disp ==>|Many Streams| Conv;
        Conv ==>|Many Streams| Succ;
    end

    OuterInput[ZL_Input] --> Input;
    OuterParam[ZL_LocalCopyParam] --> Desc;

This graph takes a single serial input and applies the given description to it, using that description to decompose the input into fields which are mapped to one or more output streams. These streams, as well as two control streams are all sent to a single invocation of the successor graph. The successor must therefore be a multi-input graph able to accept any number of numeric and serial streams (at least).

(The control streams are: a numeric stream containing the stream indices into which each field has been placed and a numeric stream containing the size of each field. See also the documentation for dispatchN_byTag and particularly, ZL_Edge_runDispatchNode(), which is the underlying component that this graph uses to actually decompose the input, for more information about the dispatch operation. These streams respectively are the first and second stream passed into the successor graph, and the streams into which the input has been dispatched follow, in order.)

The streams on which the successor graph is invoked are also tagged with int metadata, with key 0 set to their index. (For the moment. Future work may allow for more robust/stable tagging.) This makes this graph compatible with the generic clustering graph (see ZL_Clustering_registerGraph()), and the sddl profile in the demo CLI, for example, is set up that way, with the SDDL graph succeeded by the generic clusterer.

The description given to the graph must be pre-compiled. Use the SDDL Compiler to translate your description to the compiled representation that the graph accepts:

The SDDL Compiler

The SDDL compiler can be invoked in a few different ways:

The SDDL Profile

The sddl profile in the CLI internally invokes the compiler as part of setting up the compressor.

Standalone Binary

A simple, standalone compiler binary is also available.

CMake Build Instructions

mkdir tmp
cd tmp
cmake .. -DOPENZL_BUILD_TOOLS=ON
make sddl_compiler

Make Build Instructions

make sddl_compiler

The SDDL compiler binary takes in a program from stdin and writes the compiled code to stdout. It takes any number of -v flags as arguments, which increase the verbosity of the compiler. This can be a useful aid when your program isn't compiling correctly.

C++ API

A C++ API is offered in tools/sddl/compiler/Compiler.h:

std::string compile(poly::string_view source, poly::string_view filename);

This function translates a program source in the Data Description Driven Dispatch language to the binary compiled representation that the SDDL graph accepts in OpenZL.

Parameters:

source –

a human-readable description in the SDDL Language.
filename –

an optional string identifying the source of the source code, which will be included in the pretty error message if compilation fails. If the input didn't come from a source readily identifiable with a string that would be meaningful to the user / consumer of error messages, you can just use [input] or some- thing, I dunno.

Returns:

the compiled binary representation of the description, which the SDDL graph accepts. See the SDDL graph documentation for a description of the format of this representation.

Exceptions:

CompilerException –

if compilation fails. Additional context can be found in the output log provided to the compiler during construction, if a suitably high verbosity has been selected.

Compiled Representation

The SDDL engine accepts, and the SDDL compiler produces, a translated / compiled representation of the description, which is a CBOR-serialized expression tree.

Danger

The compiled format is unstable and subject to change!

You should not expect SDDL descriptions compiled with one version of OpenZL to work with SDDL graphs from other versions of OpenZL. Nor should you currently build codegen that targets this unstable format.

SDDL Syntax Reference

This section serves as an in-depth reference for the features of SDDL. For an introduction to SDDL, see the overall SDDL documentation.

Warning

The SDDL Language is under active development. Its capabilities are expected to grow significantly. As part of that development, the syntax and semantics of existing features may change or break without warning.

An SDDL Description is a series of Statements. A statement is a newline or semicolon-terminated Expression. There are multiple kinds of Expressions:

Fields

A Field is a tool to identify the type of data stored in a part of the input as well as to group appearances of that type of data in the input.

Fields are declared either by instantiating a built-in field or by composing one or more existing fields into a compound field. Input content is associated with fields via the consume operation.

Built-In Fields

The following table enumerates the predefined fields available in SDDL:

Name	`ZL_Type`	Size	Sign?	Endian?	Returns
`Byte`	Serial	1	No	N/A	`int64_t`
`Int8`	Numeric	1	Yes	N/A	`int64_t`
`UInt8`	Numeric	1	No	N/A	`int64_t`
`Int16LE`	Numeric	2	Yes	Little	`int64_t`
`Int16BE`	Numeric	2	Yes	Big	`int64_t`
`UInt16LE`	Numeric	2	No	Little	`int64_t`
`UInt16BE`	Numeric	2	No	Big	`int64_t`
`Int32LE`	Numeric	4	Yes	Little	`int64_t`
`Int32BE`	Numeric	4	Yes	Big	`int64_t`
`UInt32LE`	Numeric	4	No	Little	`int64_t`
`UInt32BE`	Numeric	4	No	Big	`int64_t`
`Int64LE`	Numeric	8	Yes	Little	`int64_t`
`Int64BE`	Numeric	8	Yes	Big	`int64_t`
`UInt64LE`	Numeric	8	No	Little	`int64_t` (2s-complement)
`UInt64BE`	Numeric	8	No	Big	`int64_t` (2s-complement)
`Float8`	Numeric	1	Yes	N/A	None
`Float16LE`	Numeric	2	Yes	Little	None
`Float16BE`	Numeric	2	Yes	Big	None
`Float32LE`	Numeric	4	Yes	Little	None
`Float32BE`	Numeric	4	Yes	Big	None
`Float64LE`	Numeric	8	Yes	Little	None
`Float64BE`	Numeric	8	Yes	Big	None
`BFloat8`	Numeric	1	Yes	N/A	None
`BFloat16LE`	Numeric	2	Yes	Little	None
`BFloat16BE`	Numeric	2	Yes	Big	None
`BFloat32LE`	Numeric	4	Yes	Little	None
`BFloat32BE`	Numeric	4	Yes	Big	None
`BFloat64LE`	Numeric	8	Yes	Little	None
`BFloat64BE`	Numeric	8	Yes	Big	None

A consumption operation invoked on one of these field types will evaluate to the value of the bytes consumed, interpreted according to the type of the field. E.g., if the next 4 bytes of the input are "\x01\x23\x45\x67", the expression result : UInt32BE will store the value 0x01234567 in result. "\xff" consumed as a Int8 will produce -1 where if it were instead consumed as UInt8 or Byte it would evaluate to 255.

Compound Fields

Arrays

An array is constructed from a field and a length:

Foo = Byte
len = 1234

ArrayOfFooFields = Foo[len]

Consuming an Array consumes the inner field a number of times, equal to the provided length of the array.

Note

The field and length are evaluated when the array is declared, not when it is used. E.g.,

Foo = Byte
len = 42

Arr = Foo[len]

Foo = UInt32LE
len = 10

: Arr

This will consume 42 bytes, not 10 32-bit integers.

Records

A Record is a sequential collection of Fields. A Record is declared by listing its member fields as a comma-separated list between curly braces:

Row = {
  Byte,
  Byte,
  UInt32LE[8],
}

A member field of type T in a record can be expressed in the following three ways:

{
  T,       # Bare field, implies the consumption of the field
  : T,     # An instruction to consume the field, equivalent to the previous
  var : T, # Consumption of the field, with the result assigned to a variable
}

Consuming a Record expands to an in-order consumption of its member fields.

The return value of consuming a Record is a scope object, which contains variables captured during consumption. Fields' values will be captured into this returned scope when they are expressed in the variable : Field syntax. These values can be retrieved from the returned scope using the . member access operator.

Example

Header = {
  magic : UInt32LE,
  size : UInt32LE,
}

hdr : Header

expect hdr.magic == 1234567890
contents : Contents[hdr.size]

This example demonstrates the declaration, consumption, and then use of values of member fields of a Record.

Field Instances

Each field declaration instantiates a new field. Different instances of a field, even when they have otherwise identical properties, may be treated differently by the SDDL engine.

Each use of a built-in field name is considered a declaration. E.g.,

Foo = {
  UInt64LE
  UInt64LE
  UInt64LE
  UInt64LE
}

is different from

U64 = UInt64LE

Foo = {
  U64
  U64
  U64
  U64
}

In the former, four different integer fields are declared, whereas in the latter only one is.

In the future, we intend for the SDDL engine to make intelligent decisions about how to map each fields to output streams. For the moment, though, each field instance is mechanically given its own output stream. This means that the two examples above produce different parses of the input:

In the former, the content consumed by Foo will be mapped to four different output streams, whereas in the latter it will all be sent to a single output stream.

Numbers

Other than Fields, the other value that SDDL manipulates is Numbers.

Warning

All numbers in SDDL are signed 64-bit integers.

Smaller types are sign-extended into 64-bit width. Unsigned 64-bit fields are converted to signed 64-bit values via twos-complement conversion.

Numbers arise from integer literals that appear in the description, as the result of evaluating arithmetic expressions, or as the result of consuming a numeric type.

Operations

Op	Syntax	Types	Effect
`expect`	`expect <A>`	I -> N	Fails the run if `A` evaluates to 0.
`consume`	`[L] : <R>`	V?, FL -> IS?	Consumes the field provided as `R`, stores the result into an optional variable `L`. The expression as a whole also evaluates to that result value.
`sizeof`	`sizeof <A>`	F -> I	Evaluates to the size in bytes of the given field `A`. Fails the run if invoked on anything other than a static field.
`assign`	`<L> = <R>`	V, * -> *	Stores the resolved value of the expression in `R` and stores it in the variable `L`. The assignment expression also evaluates as a whole to that resolved value.
`member`	`<L>.<R>`	S, V -> *	Retrieves the value held by the variable `R` in the scope `L`. Cannot be used as the left-hand argument of assignment.
`bind`	`<L>(<R...>)`	L, T -> L	Binds the first `n` args of function `L` to the `n` comma-separated args `R`, returning a new function with `n` fewer unbound arguments.
`eq`	`<L> == <R>`	I, I -> I	Evaluates to 1 if `L` and `R` are equal, 0 otherwise.
`neq`	`<L> != <R>`	I, I -> I	Evaluates to 0 if `L` and `R` are equal, 1 otherwise.
`gt`	`<L> > <R>`	I, I -> I	Evaluates to 1 if `L` is greater than `R`, 0 otherwise.
`ge`	`<L> >= <R>`	I, I -> I	Evaluates to 1 if `L` is greater than or equal to `R`, 0 otherwise.
`lt`	`<L> < <R>`	I, I -> I	Evaluates to 1 if `L` is less than `R`, 0 otherwise.
`le`	`<L> <= <R>`	I, I -> I	Evaluates to 1 if `L` is less than or equal to `R`, 0 otherwise.
`neg`	`- <A>`	I -> I	Negates `A`.
`add`	`<L> + <R>`	I, I -> I	Evaluates to the sum of `L` and `R`.
`sub`	`<L> - <R>`	I, I -> I	Evaluates to the difference of `L` and `R`.
`mul`	`<L> * <R>`	I, I -> I	Evaluates to the product of `L` and `R`.
`div`	`<L> / <R>`	I, I -> I	Evaluates to the quotient of `L` divided by `R`.
`mod`	`<L> % <R>`	I, I -> I	Evaluates to the remainder of `L` divided by `R`.

In this table, the "Types" column denotes the signature of the operation, using the following abbreviations for the types of objects in SDDL:

F: Field.
I: Integer.
N: Null.
V: Variable name.
L: Function (a "lambda").
S: Scope.
T: Tuple.

Evaluation Order

Note that unlike C, which largely avoids defining the relative order in which different parts of an expression are evaluated (instead, only adding a limited number of sequencing points), SDDL has a defined order in which the parts of an expression are evaluated:

For binary operators, the left-hand side is always evaluated before the right-hand side.

This means that the behavior of valid expressions is always well-defined. E.g.:

foo = :UInt32LE + 2 * :Byte

must sequence the 32-bit int before the byte. Of course, it's probably better to avoid relying on this.

Consuming Fields

Consuming Atomic Fields

Consuming an atomic Field of size N has the following effects:

The next N bytes of the input stream, starting at the current cursor position, are associated with the field being consumed. This means they will be dispatched into the output stream associated with this field.
The cursor is advanced N bytes.
Those bytes are interpreted according to the field type (type, signedness, endianness), and the consumption operation evaluates to that value.

Consuming Compound Fields

Consuming a compound Field is recursively expanded to be a consumption of the leaf atomic fields that comprise the compound Field.

Currently, the consumption of a compound field does not produce a value.

Variables

A Variable holds a value, the result of evaluating an Expression.

Variable names begin with an alphabetical character or underscore and contain any number of subsequent underscores or alphanumeric characters.

Variables are assigned to via either the = operator or as part of the : operator:

var = 2 + 2
expect var == 4

# consumes the Byte field, and stores the value read from the field into var.
var : Byte

Other than when it appears on the left-hand side of an assignment operation, referring to a variable in an expression resolves to the value it contains.

Special Variables

The SDDL engine exposes some information to the description environment via the following implicit variables:

Name	Type	Value
`_rem`	Number	The number of bytes remaining in the input.

These variables cannot be assigned to.