Simple Data Description Language
Introduction
OpenZL's backend compression graphs are most effective on streams of self-similar, typed data.
The Simple Data Description Language (SDDL) module provides a lightweight tool to decompose simple structured data formats into their components, so that they can be effectively compressed by OpenZL's backends.
The SDDL functionality is comprised of a few pieces:
-
The SDDL graph, which takes a compiled description and uses it to parse and decompose the input into a number of output streams.
-
The SDDL compiler, which takes a description written in the SDDL language and translates it into the binary format that the SDDL engine accepts.
-
The SDDL profile, a pre-built compressor available in the CLI which runs SDDL on the input and passes all of its outputs to the generic clustering graph.
These components and how to use them are described below.
Writing an SDDL Description
The Simple Data Description Language is a Domain-Specific Language that makes it easy to describe to OpenZL the components that make up simple data formats.
The fundamental task of an SDDL Description is to associate each byte of the input stream with a corresponding Field, whose purpose is to give OpenZL a hint for how to group and handle that part of the input. SDDL has a number of built-in fields, like UInt32LE
or Byte
. Descriptions can also construct compound fields (analogous to C structs and arrays) out of other fields.
The operative part of an SDDL Description, the part that actually constructs that association between a part of the input stream and a Field, is an operation called consumption. Consumption is represented by the :
operator and looks like : Byte
, for example.
Introductory Example
Imagine you are trying to compress a stream of readings from an accelerometer. Suppose the format you receive is an array of samples, each of which looks like this in C:
A possible SDDL description of this format would be:
# Declare a new compound field "AccelerometerSample" which describes the
# structure of an individual sample using a Record, SDDL's aggregate type.
AccelerometerSample = {
timestamp : UInt64LE;
x_accel : Float32LE;
y_accel : Float32LE;
z_accel : Float32LE;
}
# Consume the whole input as an array of AccelerometerSample records.
: AccelerometerSample[]
When executing an SDDL Description, the SDDL engine starts at the beginning of the input. As the execution proceeds, and the description is applied to the input, each consumption operation associates the next byte(s), starting at the current position, with the consumed field and then advances past the consumed content.
The task of the description is complete when all of the input has been consumed. (If the description ends and the whole input hasn't been consumed, or if the description tries to consume bytes past the end of the input, the execution fails.)
Examples
While there's an in-depth language/syntax reference available below, the best way to gain familiarity with SDDL is probably to look at some examples. So here are a couple:
SAO
SAO File Description
# Send all header fields to the same output
HeaderInt = UInt32LE
Header = {
STAR0: HeaderInt
STAR1: HeaderInt # First star number in file
STARN: HeaderInt # Number of stars in file
STNUM: HeaderInt # star i.d. number presence
MPROP: HeaderInt # True if proper motion is included
NMAG : HeaderInt # Number of magnitudes present
NBENT: HeaderInt # Number of bytes per star entry
}
Row = {
SRA0 : Float64LE # Right ascension in degrees
SDEC0: Float64LE # Declination in degrees
IS : Byte[2] # Instrument status flags
MAG : UInt16LE # Magnitude * 100
XRPM : Float32LE # X-axis rate per minute
XDPM : Float32LE # X-axis drift per minute
}
# Read the header
header: Header
# Validate format expectations
expect header.STNUM == 0
expect header.MPROP == 1
expect header.NMAG == 1
expect header.NBENT == sizeof Row
# The header is followed by STARN records
data: Row[header.STARN]
Bitmap Images
BMP v3 Format Description
# Supports uncompressed BMP Version 3 formatted files, described here:
# https://gibberlings3.github.io/iesdp/file_formats/ie_formats/bmp.htm
# Convert images to this with:
# ```
# magick $SOURCE -compress none BMP3:$DEST
# ```
GenericU8 = UInt8
GenericU16 = UInt16LE
GenericU32 = UInt32LE
FileHeader = {
signature : GenericU16
file_size : GenericU32
reserved : GenericU32
data_offset : GenericU32
}
file_header : FileHeader
expect file_header.signature == 0x4d42 # "BM"
expect file_header.reserved == 0
InfoHeader = {
header_size : GenericU32
width : GenericU32
height : GenericU32
planes : GenericU16
bits_per_pixel : GenericU16
compression : GenericU32
image_size : GenericU32
x_pixels_per_m : GenericU32
y_pixels_per_m : GenericU32
colors_used : GenericU32
important_colors : GenericU32
}
info_header : InfoHeader
expect info_header.compression == 0
width = info_header.width
height = info_header.height
bits_per_pixel = info_header.bits_per_pixel
num_colors = (
(bits_per_pixel == 1) * 2 +
(bits_per_pixel == 4) * 16 +
(bits_per_pixel == 8) * 256
)
ColorTableEntry = {
red : GenericU8;
green : GenericU8;
blue : GenericU8;
reserved : GenericU8;
}
color_table_entries : ColorTableEntry[num_colors];
row1_bytes = 4 * ((width + 31) / 32)
row4_bytes = 4 * ((width + 7) / 8)
row8_bytes = 4 * ((width + 3) / 4)
row16_bytes = 4 * ((width + 1) / 2)
row24_bytes = 4 * ((width + 1) * 3 / 4)
Image = {
: GenericU8[row1_bytes][height][bits_per_pixel == 1]
: GenericU8[row4_bytes][height][bits_per_pixel == 4]
: GenericU8[row8_bytes][height][bits_per_pixel == 8]
: GenericU16[row16_bytes / 2][height][bits_per_pixel == 16]
: GenericU8[row24_bytes][height][bits_per_pixel == 24]
}
image : Image
Binary STL Files
STLB Format Description
Running SDDL
The SDDL Profile
The easiest way to run SDDL over an input is via the SDDL profile built into the CLI.
Example
Start by writing an SDDL Description for your data. Here's a trivial one that splits the input into alternating integer streams:
cat <<EOF >desc.sddl
Row = {
UInt32LE
UInt32LE
}
num_rows = _rem / sizeof Row
: Row[num_rows]
# Consume any remaining input.
: Byte[_rem]
EOF
Then compress an input using that description:
./zli compress --profile sddl --profile-arg desc.sddl --train-inline my_input_file -o my_input_file.zl
Since the SDDL profile passes the results of the parse produced by the SDDL graph to the generic clustering graph, which needs to be trained, the --train-inline
flag is important to get good performance.
If you are compressing many inputs with the same profile, it's much faster to do the training once and use the resulting trained profile for each input rather than training on each and every input separately:
The SDDL Graph
The SDDL Graph allows you to integrate SDDL into compressors other than the prebuilt SDDL profile. You can create an SDDL graph with ZL_Compressor_buildSDDLGraph
:
Builds a Simple Data Description Language graph with the provided (pre-compiled) description
and successor
graph.
See the SDDL page in the documentation for a complete description of this component.
The SDDL Graph has the following structure:
flowchart TD
subgraph SDDL Graph
Desc([Description]);
Input([Input]);
Conv@{ shape: procs, label: "Type Conversions"};
Engine[SDDL Engine];
Inst([Instructions]);
Disp[/Dispatch Transform\];
Succ[Successor Graph];
Desc --> Engine;
Input --> Engine;
Engine --> Inst;
Inst -->|Dispatch Instructions| Disp;
Input --> Disp;
Inst -->|Type Information| Conv;
Disp ==>|Many Streams| Conv;
Conv ==>|Many Streams| Succ;
end
OuterInput[ZL_Input] --> Input;
OuterParam[ZL_LocalCopyParam] --> Desc;
This graph takes a single serial input and applies the given description to it, using that description to decompose the input into fields which are mapped to one or more output streams. These streams, as well as two control streams are all sent to a single invocation of the successor graph. The successor must therefore be a multi-input graph able to accept any number of numeric and serial streams (at least).
(The control streams are: a numeric stream containing the stream indices into which each field has been placed and a numeric stream containing the size of each field. See also the documentation for dispatchN_byTag
and particularly, ZL_Edge_runDispatchNode()
, which is the underlying component that this graph uses to actually decompose the input, for more information about the dispatch operation. These streams respectively are the first and second stream passed into the successor graph, and the streams into which the input has been dispatched follow, in order.)
The streams on which the successor graph is invoked are also tagged with int metadata, with key 0 set to their index. (For the moment. Future work may allow for more robust/stable tagging.) This makes this graph compatible with the generic clustering graph (see ZL_Clustering_registerGraph()
), and the sddl
profile in the demo CLI, for example, is set up that way, with the SDDL graph succeeded by the generic clusterer.
The description given to the graph must be pre-compiled. Use the SDDL Compiler to translate your description to the compiled representation that the graph accepts:
The SDDL Compiler
The SDDL compiler can be invoked in a few different ways:
The SDDL Profile
The sddl
profile in the CLI internally invokes the compiler as part of setting up the compressor.
Standalone Binary
A simple, standalone compiler binary is also available.
The SDDL compiler binary takes in a program from stdin
and writes the compiled code to stdout
. It takes any number of -v
flags as arguments, which increase the verbosity of the compiler. This can be a useful aid when your program isn't compiling correctly.
C++ API
A C++ API is offered in tools/sddl/compiler/Compiler.h
:
This function translates a program source
in the Data Description Driven Dispatch language to the binary compiled representation that the SDDL graph accepts in OpenZL.
Parameters:
-
source
–a human-readable description in the SDDL Language.
-
filename
–an optional string identifying the source of the
source
code, which will be included in the pretty error message if compilation fails. If the input didn't come from a source readily identifiable with a string that would be meaningful to the user / consumer of error messages, you can just use[input]
or some- thing, I dunno.
Returns:
the compiled binary representation of the description, which the SDDL graph accepts. See the SDDL graph documentation for a description of the format of this representation.
Exceptions:
-
CompilerException
–if compilation fails. Additional context can be found in the output log provided to the compiler during construction, if a suitably high verbosity has been selected.
Compiled Representation
The SDDL engine accepts, and the SDDL compiler produces, a translated / compiled representation of the description, which is a CBOR-serialized expression tree.
Danger
The compiled format is unstable and subject to change!
You should not expect SDDL descriptions compiled with one version of OpenZL to work with SDDL graphs from other versions of OpenZL. Nor should you currently build codegen that targets this unstable format.
SDDL Syntax Reference
This section serves as an in-depth reference for the features of SDDL. For an introduction to SDDL, see the overall SDDL documentation.
Warning
The SDDL Language is under active development. Its capabilities are expected to grow significantly. As part of that development, the syntax and semantics of existing features may change or break without warning.
An SDDL Description is a series of Statements. A statement is a newline or semicolon-terminated Expression. There are multiple kinds of Expressions:
Fields
A Field is a tool to identify the type of data stored in a part of the input as well as to group appearances of that type of data in the input.
Fields are declared either by instantiating a built-in field or by composing one or more existing fields into a compound field. Input content is associated with fields via the consume operation.
Built-In Fields
The following table enumerates the predefined fields available in SDDL:
Name | ZL_Type |
Size | Sign? | Endian? | Returns |
---|---|---|---|---|---|
Byte |
Serial | 1 | No | N/A | int64_t |
Int8 |
Numeric | 1 | Yes | N/A | int64_t |
UInt8 |
Numeric | 1 | No | N/A | int64_t |
Int16LE |
Numeric | 2 | Yes | Little | int64_t |
Int16BE |
Numeric | 2 | Yes | Big | int64_t |
UInt16LE |
Numeric | 2 | No | Little | int64_t |
UInt16BE |
Numeric | 2 | No | Big | int64_t |
Int32LE |
Numeric | 4 | Yes | Little | int64_t |
Int32BE |
Numeric | 4 | Yes | Big | int64_t |
UInt32LE |
Numeric | 4 | No | Little | int64_t |
UInt32BE |
Numeric | 4 | No | Big | int64_t |
Int64LE |
Numeric | 8 | Yes | Little | int64_t |
Int64BE |
Numeric | 8 | Yes | Big | int64_t |
UInt64LE |
Numeric | 8 | No | Little | int64_t (2s-complement) |
UInt64BE |
Numeric | 8 | No | Big | int64_t (2s-complement) |
Float8 |
Numeric | 1 | Yes | N/A | None |
Float16LE |
Numeric | 2 | Yes | Little | None |
Float16BE |
Numeric | 2 | Yes | Big | None |
Float32LE |
Numeric | 4 | Yes | Little | None |
Float32BE |
Numeric | 4 | Yes | Big | None |
Float64LE |
Numeric | 8 | Yes | Little | None |
Float64BE |
Numeric | 8 | Yes | Big | None |
BFloat8 |
Numeric | 1 | Yes | N/A | None |
BFloat16LE |
Numeric | 2 | Yes | Little | None |
BFloat16BE |
Numeric | 2 | Yes | Big | None |
BFloat32LE |
Numeric | 4 | Yes | Little | None |
BFloat32BE |
Numeric | 4 | Yes | Big | None |
BFloat64LE |
Numeric | 8 | Yes | Little | None |
BFloat64BE |
Numeric | 8 | Yes | Big | None |
A consumption operation invoked on one of these field types will evaluate to the value of the bytes consumed, interpreted according to the type of the field. E.g., if the next 4 bytes of the input are "\x01\x23\x45\x67"
, the expression result : UInt32BE
will store the value 0x01234567 in result
. "\xff"
consumed as a Int8
will produce -1 where if it were instead consumed as UInt8
or Byte
it would evaluate to 255.
Compound Fields
Arrays
An array is constructed from a field and a length:
Consuming an Array consumes the inner field a number of times, equal to the provided length of the array.
Note
The field and length are evaluated when the array is declared, not when it is used. E.g.,
This will consume 42 bytes, not 10 32-bit integers.
Records
A Record is a sequential collection of Fields. A Record is declared by listing its member fields as a comma-separated list between curly braces:
A member field of type T
in a record can be expressed in the following three ways:
{
T, # Bare field, implies the consumption of the field
: T, # An instruction to consume the field, equivalent to the previous
var : T, # Consumption of the field, with the result assigned to a variable
}
Consuming a Record expands to an in-order consumption of its member fields.
The return value of consuming a Record is a scope object, which contains variables captured during consumption. Fields' values will be captured into this returned scope when they are expressed in the variable : Field
syntax. These values can be retrieved from the returned scope using the .
member access operator.
Example
Header = {
magic : UInt32LE,
size : UInt32LE,
}
hdr : Header
expect hdr.magic == 1234567890
contents : Contents[hdr.size]
This example demonstrates the declaration, consumption, and then use of values of member fields of a Record.
Field Instances
Each field declaration instantiates a new field. Different instances of a field, even when they have otherwise identical properties, may be treated differently by the SDDL engine.
Each use of a built-in field name is considered a declaration. E.g.,
is different from
In the former, four different integer fields are declared, whereas in the latter only one is.
In the future, we intend for the SDDL engine to make intelligent decisions about how to map each fields to output streams. For the moment, though, each field instance is mechanically given its own output stream. This means that the two examples above produce different parses of the input:
In the former, the content consumed by Foo
will be mapped to four different output streams, whereas in the latter it will all be sent to a single output stream.
Numbers
Other than Fields, the other value that SDDL manipulates is Numbers.
Warning
All numbers in SDDL are signed 64-bit integers.
Smaller types are sign-extended into 64-bit width. Unsigned 64-bit fields are converted to signed 64-bit values via twos-complement conversion.
Numbers arise from integer literals that appear in the description, as the result of evaluating arithmetic expressions, or as the result of consuming a numeric type.
Operations
Op | Syntax | Types | Effect |
---|---|---|---|
expect |
expect <A> |
I -> N | Fails the run if A evaluates to 0. |
consume |
[L] : <R> |
V?, FL -> IS? | Consumes the field provided as R , stores the result into an optional variable L . The expression as a whole also evaluates to that result value. |
sizeof |
sizeof <A> |
F -> I | Evaluates to the size in bytes of the given field A . Fails the run if invoked on anything other than a static field. |
assign |
<L> = <R> |
V, * -> * | Stores the resolved value of the expression in R and stores it in the variable L . The assignment expression also evaluates as a whole to that resolved value. |
member |
<L>.<R> |
S, V -> * | Retrieves the value held by the variable R in the scope L . Cannot be used as the left-hand argument of assignment. |
bind |
<L>(<R...>) |
L, T -> L | Binds the first n args of function L to the n comma-separated args R , returning a new function with n fewer unbound arguments. |
eq |
<L> == <R> |
I, I -> I | Evaluates to 1 if L and R are equal, 0 otherwise. |
neq |
<L> != <R> |
I, I -> I | Evaluates to 0 if L and R are equal, 1 otherwise. |
gt |
<L> > <R> |
I, I -> I | Evaluates to 1 if L is greater than R , 0 otherwise. |
ge |
<L> >= <R> |
I, I -> I | Evaluates to 1 if L is greater than or equal to R , 0 otherwise. |
lt |
<L> < <R> |
I, I -> I | Evaluates to 1 if L is less than R , 0 otherwise. |
le |
<L> <= <R> |
I, I -> I | Evaluates to 1 if L is less than or equal to R , 0 otherwise. |
neg |
- <A> |
I -> I | Negates A . |
add |
<L> + <R> |
I, I -> I | Evaluates to the sum of L and R . |
sub |
<L> - <R> |
I, I -> I | Evaluates to the difference of L and R . |
mul |
<L> * <R> |
I, I -> I | Evaluates to the product of L and R . |
div |
<L> / <R> |
I, I -> I | Evaluates to the quotient of L divided by R . |
mod |
<L> % <R> |
I, I -> I | Evaluates to the remainder of L divided by R . |
In this table, the "Types" column denotes the signature of the operation, using the following abbreviations for the types of objects in SDDL:
F
: Field.I
: Integer.N
: Null.V
: Variable name.L
: Function (a "lambda").S
: Scope.T
: Tuple.
Evaluation Order
Note that unlike C, which largely avoids defining the relative order in which different parts of an expression are evaluated (instead, only adding a limited number of sequencing points), SDDL has a defined order in which the parts of an expression are evaluated:
For binary operators, the left-hand side is always evaluated before the right-hand side.
This means that the behavior of valid expressions is always well-defined. E.g.:
must sequence the 32-bit int before the byte. Of course, it's probably better to avoid relying on this.
Consuming Fields
Consuming Atomic Fields
Consuming an atomic Field of size N
has the following effects:
-
The next
N
bytes of the input stream, starting at the current cursor position, are associated with the field being consumed. This means they will be dispatched into the output stream associated with this field. -
The cursor is advanced
N
bytes. -
Those bytes are interpreted according to the field type (type, signedness, endianness), and the consumption operation evaluates to that value.
Consuming Compound Fields
Consuming a compound Field is recursively expanded to be a consumption of the leaf atomic fields that comprise the compound Field.
Currently, the consumption of a compound field does not produce a value.
Variables
A Variable holds a value, the result of evaluating an Expression.
Variable names begin with an alphabetical character or underscore and contain any number of subsequent underscores or alphanumeric characters.
Variables are assigned to via either the =
operator or as part of the :
operator:
var = 2 + 2
expect var == 4
# consumes the Byte field, and stores the value read from the field into var.
var : Byte
Other than when it appears on the left-hand side of an assignment operation, referring to a variable in an expression resolves to the value it contains.
Special Variables
The SDDL engine exposes some information to the description environment via the following implicit variables:
Name | Type | Value |
---|---|---|
_rem |
Number | The number of bytes remaining in the input. |
These variables cannot be assigned to.