Feather file format

Here is the general structure of the file:

<4-byte magic number "FEA1">
<ARRAY 0>
<ARRAY 1>
...
<ARRAY n>
<METADATA>
<uint32: metadata size>
<4-byte magic number "FEA1">

There is a stream of arrays laid out end-to-end, with the metadata appended on completion at the end of the file.

For the arrays themselves, the memory layout is type dependent.

Primitive arrays
Variable-length (BINARY and UTF8) arrays
Encoded variants of 1 and 2 (e.g. dictionary encoding)

Primitive arrays

<null bitmask, optional> <values>

The null bitmask is byte-aligned, but contains 1 bit per value indicating null (0) or not null (1) (this is also how PostgreSQL stores nulls). We use LSB right-to-left bit-numbering (reference). For example, an array with length 6 with [valid, valid, null, valid, null, valid] would have a single byte with the values

0 0 1 0 1 0 1 1

In C, the code to check a bit looks like:

bits[i / 8] & (1 << (i % 8))

The values is a contiguous array with elements equal to the fixed-width byte size.

If, in the metadata, null_count = 0, then the null bitmask is omitted.

Variable-length arrays

We use the Apache Arrow encoding for storing variable-length values

<null bitmask, optional> <int32_t* value offsets> <uint8_t* value data>

For an array of N elements, N+1 offsets are written so the end of last value can be determined.

Dictionary encoding

Dictionary-encoded data is stored in the following layout.

<uint32: dictionary size>
<array: dictionary values>
<array: dictionary indices, int32 type>

The total_bytes stored in the metadata is the cumulative size of all of these pieces of data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FORMAT.md

FORMAT.md

Feather file format

Primitive arrays

Variable-length arrays

Dictionary encoding

Files

FORMAT.md

Latest commit

History

FORMAT.md

File metadata and controls

Feather file format

Primitive arrays

Variable-length arrays

Dictionary encoding