Arrays and Collections
Chapter 5 - Working with repeated data
Arrays are fundamental to binary format descriptions. Most real-world formats contain sequences of repeated structures—pixels in images, samples in audio, records in databases. This chapter explores how SDDL handles arrays, from simple fixed-size collections to complex structure-of-arrays layouts.
For a high-level overview of how arrays fit into SDDL's overall syntax, see the Language Elements Overview.
Because arrays repeat the same type many times, they amplify whatever instant-parse behavior that type has. An element that takes a small scan penalty once will take it thousands of times inside an array. The previous chapter's instant-parse rules are the lens through which we evaluate array performance. For concrete specs that use these layouts, see the coverage map entries for fixed arrays, parameterized arrays, and auto-sized arrays; a future row tracks the pending example for scan-based arrays.
Fixed-Size Arrays
A common array form specifies an exact element count.
Basic Syntax
values: Int32LE[100] # Exactly 100 integers
pixels: UInt8[1920][1080] # 2D array: 1920 × 1080 pixels
colors: RGB[256] # 256 RGB color entries
The number in brackets specifies how many elements to parse. This can be:
- A literal constant:
Int32LE[100] - A parameter:
Int32LE[count]wherecountis a record parameter - A variable:
Int32LE[num_items]wherenum_itemswas defined withvar - An expression:
Int32LE[width * height]
Multi-Dimensional Arrays
SDDL supports multi-dimensional arrays with multiple bracket pairs:
Layout is row-major: the rightmost index varies fastest. In the example above, the three channel values for pixel (0,0) come first, followed by channels for pixel (0,1), and so on.
Arrays of Records
Arrays work with any type, including records:
Record Point() = {
x: Float32LE,
y: Float32LE,
z: Float32LE
}
Record PointCloud(num_points) = {
points: Point[num_points]
}
count: UInt32LE
cloud: PointCloud(count)
This creates an array of Point structures laid out sequentially in memory.
Parameterized Element Types
Array elements can themselves be parameterized:
Record Block(size) = {
header: Bytes(4),
data: Bytes(size - 4)
}
Record FileData(block_size, block_count) = {
blocks: Block(block_size)[block_count]
}
Each Block in the array uses the same block_size parameter.
Auto-Sized Arrays
When the array size is not known nor declared in advance, one can use auto-sized arrays.
Reading Until End of Scope
An array without a size specification repeats until there's no more data:
Record Entry() = {
id: Int32LE,
value: Float64LE
}
entries: Entry[] # Read Entry records until end of file
This consumes all remaining data in the current scope, parsing Entry structures repeatedly.
Within a scope, only a single array can be auto-sized.
What Is "Scope"?
Scope depends on context:
- At the top level: End of file
- Inside a record with known size: End of the record's data
Example with explicit scope:
Record Container(payload_size) = {
header: Bytes(16),
entries: Entry[] # Reads until payload_size is exhausted
} pad_to payload_size
If the auto-sized array is not the last member of the scope, all following members in the same scope must be instant-parse, and all elements required to determine their size must be known before the array.
Partial Elements and Leftover Data
What happens when the remaining data in scope isn't evenly divisible by the element size? The behavior depends on the enclosing scope.
File Scope (or Non-Padded Record Scope)
When an auto-sized array appears at file level, all remaining data must form complete elements. Leftover bytes that don't constitute a full element cause a parse error.
Record Point() = {
x: Float32LE, # 4 bytes
y: Float32LE, # 4 bytes
z: Float32LE # 4 bytes
}
points: Point[] # Each Point is 12 bytes
If the file contains 100 bytes: - 96 bytes = 8 complete points: ✓ Valid - 100 bytes = 8 complete points + 4 leftover bytes: ✗ Parse error
The 4 leftover bytes are insufficient to form a complete Point structure.
Padded Record Scope
When an auto-sized array appears within a record that uses pad_to or pad_align, leftover bytes are allowed and treated as padding.
Record Container(payload_size) = {
header: Bytes(16),
entries: Entry[] # Auto-sized array
} pad_to payload_size
If payload_size is 100 and Entry is 12 bytes:
- The header consumes 16 bytes
- Remaining space: 84 bytes
- 84 ÷ 12 = 7 complete entries (84 bytes used)
- Leftover: 0 bytes (padding not needed in this case)
If payload_size is 104:
- The header consumes 16 bytes
- Remaining space: 88 bytes
- 88 ÷ 12 = 7 complete entries (84 bytes used)
- Leftover: 4 bytes ✓ Valid (treated as padding)
This allows the Container to meet its pad_to size requirement while the array consumes only complete elements.
Combining Fixed and Auto-Sized Dimensions
You can mix fixed and auto-sized dimensions:
Record Scanline(width) = {
pixels: RGB[width]
}
# Read scanlines until end of data
# Each scanline has a fixed width
image: Scanline(1920)[]
The first dimension is auto-sized (unknown number of scanlines), the second is fixed (exactly 1920 pixels per scanline).
The scan Keyword
When array elements require scanning (they're not instant-parse), you must use the scan keyword:
Record VariableEntry() = {
size: UInt16LE,
data: Bytes(size) # Size depends on local field: requires scan
}
# Must use 'scan' because VariableEntry requires scanning
entries: scan VariableEntry[]
The scan keyword makes the scanning requirement explicit. If you forget it, the compiler will remind you with an error.
When Auto-Sized Arrays Make Sense
Auto-sized arrays are useful when: - The format doesn't store a count - You're parsing a stream of unknown length - The file format is "records until EOF" - You want to consume all remaining data
Arrays and Instant-Parse
Whether an array is instant-parse depends on its element type and size specification.
Instant-Parse Arrays
An array is instant-parse when: - Its element type is instant-parse - Its size depends only on parameters or constants
If Cell is instant-parse and width/height are parameters, the entire Grid is instant-parse.
Non-Instant-Parse Arrays
Arrays become non-instant-parse when: - The element type requires scanning - The size depends on local fields
This Data record is not instant-parse because count is a local field, not a parameter.
Note that Data.values itself is an instant-parse array, since Int32LE is instant-parse.
Array Layout and Alignment
Default Layout
Arrays are laid out with elements immediately following each other, with no padding between elements.
Aligned Elements
If the element type has alignment requirements, padding may appear between elements:
Record Aligned() = {
value: UInt8,
important: align(8) Int64LE # Must start at 8-byte boundary
}
records: Aligned[100]
Each Aligned record may include padding after value to ensure important starts at an 8-byte boundary. The array will include padding between elements to maintain alignment.
Record Padding
Records with pad_align or pad_to affect array layout:
Record Padded() = {
data: Bytes(10)
} pad_align 16 # Each record is a multiple of 16 bytes
records: Padded[50] # Each record occupies 16 bytes (10 data + 6 padding)
The padding ensures consistent element sizes, which can improve cache performance.
Working with Arrays: Common Patterns
Pattern 1: Count-Prefixed Array
The most common array pattern: a count field followed by that many elements.
Pattern 2: Type-Specific Arrays
Different array types based on a flag or version:
Record Header() = {
version: UInt16LE,
count: UInt32LE
}
header: Header
when header.version == 1 { data_v1: DataV1[header.count] }
when header.version == 2 { data_v2: DataV2[header.count] }
Pattern 3: Nested Arrays
Arrays of arrays for grid or matrix data:
Record Row(width) = {
cells: Cell[width]
}
Record Grid(width, height) = {
rows: Row(width)[height]
}
Pattern 4: Chunked Data
Split data into fixed-size chunks:
Each chunk is exactly 4096 bytes, making the data easy to process in blocks.
Pattern 5: Mixed Fixed and Variable
Combine fixed-size headers with variable-size payloads:
Record PacketHeader() = {
id: UInt32LE,
payload_size: UInt16LE,
flags: UInt16LE
}
Record Packet() = {
header: PacketHeader,
payload: Bytes(header.payload_size)
}
packets: scan Packet[] # Variable-size packets until end of file
Structure-of-Arrays Layout
By default, arrays store elements sequentially: all fields of element 0, then all fields of element 1, and so on. This is called array-of-structures (AOS).
Array-of-Structures (Default)
Record Particle() = {
x: Float32LE,
y: Float32LE,
z: Float32LE,
vx: Float32LE,
vy: Float32LE,
vz: Float32LE
}
particles: Particle[1000]
Memory layout:
Each particle's data is contiguous.
Structure-of-Arrays with soa
Some binary formats store data in structure-of-arrays (SOA) layout.
Record Particle() = {
x: Float32LE,
y: Float32LE,
z: Float32LE,
vx: Float32LE,
vy: Float32LE,
vz: Float32LE
}
particles: soa Particle[1000]
Memory layout:
All x values are contiguous, then all y values, and so on.
Alternative description: You could describe this same layout as six separate arrays:
x_values: Float32LE[1000],
y_values: Float32LE[1000],
z_values: Float32LE[1000],
vx_values: Float32LE[1000],
vy_values: Float32LE[1000],
vz_values: Float32LE[1000]
Why use soa instead: Using soa Particle[1000] is clearer because:
- It makes explicit that these six arrays represent 1000 particles (semantic relationship)
- It ensures all arrays have the same count (enforced by the format)
- It's compatible with auto-sizing:
soa Particle[]reads until end of scope
SOA and Nested Records
SOA layout only unwraps the first level of fields. If a field is itself a record, it remains in array-of-structures (AOS) layout:
Record Color() = {
r: UInt8,
g: UInt8,
b: UInt8
}
Record Pixel() = {
position: Int16LE,
color: Color # Nested record
}
pixels: soa Pixel[100]
Layout:
This creates 2 arrays: one for position fields, one for color structures (each color structure contains r, g, b in sequence).
SOA and Array Members
If a record field is itself a fixed-size array, it remains as a contiguous block:
Record Item() = {
id: UInt32LE,
values: Float32LE[3] # Fixed-size array member
}
items: soa Item[100]
Layout:
This creates 2 arrays: one for id fields, one for values triplets.
soa Requirements
The element type must be structured so that each field's size and layout can be determined from fields that appear earlier in the record definition. This ensures the SOA layout can be parsed field-array by field-array.
Simple instant-parse records (all fixed-size fields) always satisfy this requirement:
Record Point(dimension) = {
coords: Float32LE[dimension]
}
points: soa Point(3)[1000] # OK: instant-parse, simple structure
Records with variable-size fields work if dependencies follow field order:
Record VariableItem() = {
size: UInt16LE, # Array 0
data: Bytes(size) # Array 1: depends on earlier field
}
items: soa VariableItem[100] # OK: size array determines data array layout
soa Limitations
You can combine soa with auto-sizing only if the array is instant-parse:
Auto-sized SOA requires the element type to be instant-parse (all fields must have statically-known sizes). This is because the parser needs to calculate the element count from the remaining bytes, which is only possible when each element has a fixed, known size.
Practical Examples
Example 1: Time Series Data
Record TimePoint() = {
timestamp: Int64LE,
temperature: Float32LE,
humidity: Float32LE,
pressure: Float32LE
}
count: UInt32LE
# SOA: all timestamps together, all temperatures together, etc.
measurements: soa TimePoint[count]
Example 2: Custom Image Format
In this custom example, we'll imagine a format where each color plane, named a channel, is stored separately and contiguously, in SOA layout.
Record ImageHeader() = {
magic: Bytes(4),
width: UInt32LE,
height: UInt32LE,
channels: UInt8,
} pad_align 4
Union Pixel(channels) = {
case 1: gray: UInt8,
case 3: Record() {
r: UInt8,
g: UInt8,
b: UInt8
},
case 4: Record() {
r: UInt8,
g: UInt8,
b: UInt8,
a: UInt8
}
}
header: ImageHeader where header.magic == "IMGF"
var num_pixels = header.width * header.height
# This format uses structure-of-arrays layout
pixels: soa Pixel(header.channels)[num_pixels]
Note: The anonymous Record() { ... } syntax means the fields (r, g, b, a) become direct members of Pixel, not nested under a named field. When channels == 3, a Pixel has three direct fields: r, g, and b. This is essential for SOA to work— soa Pixel[n] creates three separate arrays (one for each color channel), not a single nested structure array.
Then, soa layout specifies that all red channel values are together, all green together, etc.
Example 3: Mixed-Format Records
Record RecordHeader() = {
type: UInt8,
size: UInt16LE
}
Record TextRecord(size) = {
text: Bytes(size)
}
Record BinaryRecord(size) = align(4) {
expect size % 4 == 0
data: Bytes(size)
}
# Variable-size, type-dispatched records
Record Block(type, size) = {
header: BlockHeader,
when header.type == 1 { text: TextRecord(header.size) },
when header.type == 2 { binary: BinaryRecord(header.size) }
}
records: scan GenericRecord[]
Performance Considerations
Instant-Parse vs Scan Performance
Instant-parse arrays can be parsed much faster:
# Fast: instant-parse
Record Fixed(size) = {
data: Bytes(size)
}
count: UInt32LE
fixed: Fixed(100)[count] # Size known from parameter
# Slower: requires scan
Record Variable() = {
size: UInt16LE,
data: Bytes(size) # Size from local field
}
count: UInt32LE
variable: scan Variable[count]
The first example allows parallel processing and zero-copy optimizations. The second requires sequential scanning.
This difference compounds with scale. If Variable appears once, the scan cost is small. If it appears 50,000 times in an array, the runtime must evaluate size 50,000 times and cannot jump directly to record i without walking through the previous i - 1 records. For compressors and other downstream tools, that means lower throughput and weaker opportunities for parallelism.
Summary
Arrays cover a few different patterns: fixed counts (Type[count]), parameterized counts that remain instant-parse, auto-sized arrays (Type[]) that read to the end of scope, and structure-of-arrays layouts (soa Type[count]). Auto-sized arrays require scan when element sizes depend on parsed data, while instant-parse arrays can be processed randomly or in parallel. Choose the form that matches the source format: constants or parameters when you know the length up front, auto-sized arrays when the file omits counts, and soa when downstream tooling benefits from field-wise storage.
Where to Go Next
- Alignment and Padding for padding directives that interact with arrays.
- Conditional & Variant Data if array elements are optional or variant.
- Variables and Expressions to compute sizes and indices.