-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to document the current state of the union. Part 1: Layout #365
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -190,8 +190,12 @@ guarantee that `Option<&mut T>` has the same size as `&mut T`. | |
|
||
While all niches are invalid bit-patterns, not all invalid bit-patterns are | ||
niches. For example, the "all bits uninitialized" is an invalid bit-pattern for | ||
`&mut T`, but this bit-pattern cannot be used by layout optimizations, and is not a | ||
niche. | ||
`&mut T`, but this bit-pattern cannot be used by layout optimizations, and is not a niche. | ||
|
||
It is a surprisingly common misconception that niches can occur in [padding] bytes. | ||
They cannot: A niche representation must be invalid for `T`. | ||
But a padding byte must be irrelevant to the value of `T`. | ||
A byte that participates in deciding whether or not the representation is valid cannot, therefore, be a padding byte. | ||
|
||
#### Zero-sized type / ZST | ||
|
||
|
@@ -207,6 +211,8 @@ requirement of 2. | |
|
||
*Padding* (of a type `T`) refers to the space that the compiler leaves between fields of a struct or enum variant to satisfy alignment requirements, and before/after variants of a union or enum to make all variants equally sized. | ||
|
||
Padding for a type is either [interior padding], which is part of one or more fields, or [exterior padding], which is before, between, or after the fields. | ||
|
||
Padding can be though of as `[Pad; N]` for some hypothetical type `Pad` (of size 1) with the following properties: | ||
* `Pad` is valid for any byte, i.e., it has the same validity invariant as `MaybeUninit<u8>`. | ||
* Copying `Pad` ignores the source byte, and writes *any* value to the target byte. Or, equivalently (in terms of Abstract Machine behavior), copying `Pad` marks the target byte as uninitialized. | ||
|
@@ -217,8 +223,26 @@ for all values `v` and lists of bytes `b` such that `v` and `b` are related at ` | |
changing `b` at index `i` to any other byte yields a `b'` such `v` and `b'` are related (`Vrel_T(v, b')`). | ||
In other words, the byte at index `i` is entirely ignored by `Vrel_T` (the value relation for `T`), and two lists of bytes that only differ in padding bytes relate to the same value(s), if any. | ||
|
||
This definition works fine for product types (structs, tuples, arrays, ...). | ||
The desired notion of "padding byte" for enums and unions is still unclear. | ||
This definition works fine for product types (structs, tuples, arrays, ...) and for unions. The desired notion of "padding byte" for enums is still unclear. | ||
|
||
#### Padding (exterior) | ||
[exterior padding]: #exterior-padding | ||
|
||
Exterior padding bytes are [padding] bytes that are not part of one or more fields. They are exactly the padding bytes that are not [interior padding], and therefore must be before, between, or after the fields of the type. Padding that comes after all fields is called [tail padding]. | ||
|
||
#### Padding (interior) | ||
[interior padding]: #interior-padding | ||
|
||
Interior padding bytes are [padding] bytes that are part of one or more fields of a type. | ||
|
||
We can say that a field `f: F` *contains* the byte at index `i` in the type `T` if the layout of `T` places `f` at offset `j` and we have `j <= i < j + size_of::<F>()`. Then a padding byte is interior padding if and only if there exists a field `f` that contains it. | ||
|
||
It follows that, provided `T` is not an enum, for any such `f`, the byte at index `i - j` in `F` is a padding byte of `F`. This is because all values of `f` give rise to distinct values of `T`. | ||
alercah marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Padding (tail) | ||
[tail padding]: #tail-padding | ||
|
||
Tail padding is [exterior padding] that comes after all fields of a type. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is it useful to distinguish tail padding from other padding? We have padding before, between (only for structs), and after fields. Seems strange to not teat them the same everywhere, and if we do treat them the same, we don't need these special terms. |
||
|
||
#### Place | ||
|
||
|
@@ -254,8 +278,8 @@ The relation should be functional for a fixed list of bytes (i.e., every list of | |
It is partial in both directions: not all values have a representation (e.g. the mathematical integer `300` has no representation at type `u8`), and not all lists of bytes correspond to a value of a specific type (e.g. lists of the wrong size correspond to no value, and the list consisting of the single byte `0x10` corresponds to no value of type `bool`). | ||
For a fixed value, there can be many representations (e.g., when considering type `#[repr(C)] Pair(u8, u16)`, the second byte is a [padding byte][padding] so changing it does not affect the value represented by a list of bytes). | ||
|
||
See the [value domain][value-domain] for an example how values and representation relations can be made more precise. | ||
See the [MiniRust page on values][minirust-values] for an example how values and representation relations can be made more precise. | ||
|
||
[stacked-borrows]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md | ||
[value-domain]: https://github.com/rust-lang/unsafe-code-guidelines/tree/master/wip/value-domain.md | ||
[place-value-expr]: https://doc.rust-lang.org/reference/expressions.html#place-expressions-and-value-expressions | ||
[minirust-values]: https://github.com/RalfJung/minirust/blob/master/lang/values.md | ||
[place-value-expr]: https://doc.rust-lang.org/reference/expressions.html#place-expressions-and-value-expressions |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,12 @@ | ||
# Layout of unions | ||
|
||
**Disclaimer:** This chapter represents the consensus from issue | ||
[#13]. The statements in here are not (yet) "guaranteed" | ||
not to change until an RFC ratifies them. | ||
**Disclaimer**: This chapter is a work-in-progress. | ||
What's contained here represents the consensus from [various issues][union | ||
discussion]. | ||
The statements in here are not (yet) "guaranteed" not to change until an RFC | ||
ratifies them. | ||
|
||
[#13]: https://github.com/rust-rfcs/unsafe-code-guidelines/issues/13 | ||
[union discussion]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/active_discussion/unions.md | ||
|
||
### Layout of individual union fields | ||
|
||
|
@@ -29,8 +31,23 @@ largest field, and the offset of each union field within its variant. How these | |
are picked depends on certain constraints like, for example, the alignment | ||
requirements of the fields, the `#[repr]` attribute of the `union`, etc. | ||
|
||
[padding]: ../glossary.md#padding | ||
[layout]: ../glossary.md#layout | ||
Unions may contain both [exterior][exterior padding] and [interior padding]. | ||
In the below diagram, exterior padding is marked by `EXT`, interior padding by | ||
`INT`, and bytes that are padding bytes for a particular field but not padding | ||
for union as a whole are marked `NON`: | ||
|
||
```text | ||
[ EXT [ field0_0_ty | INT | field0_1_ty | INT ] EXT ] | ||
[ EXT [ field1_0_ty | INT | NON NON NON | INT ] EXT ] | ||
[ EXT | NON NON NON | INT [ field2_0_ty ] INT | EXT ] | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO a simpler way to define this is to say: for each 'row' (variant) of the union, we define the set of padding bytes as
Then the padding of the union is simply the intersection of those sets of padding bytes. In particular for unions, the exterior/interior padding distinction is kind of mirky, since we can have a layout like
where the same byte is exterior padding in one variant and interior padding in another variant. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I was assuming that for a Repr-raw union, we would want the set of padding bytes to be exactly the exterior padding bytes, i.e., the bytes that are padding before and after all fields, ignoring field-internal padding. Is this not true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not true for current unions, there were examples somewhere of types like For a "preserve all bits union" I was assuming we have no padding at all. |
||
|
||
It is necessarily the case that any byte that is a non-padding byte for any | ||
field is also a non-padding byte for the union. | ||
It is, in general, **unspecified** whether the converse is true. | ||
Specific reprs may specify whether or not bytes are padding bytes. | ||
|
||
Padding bytes in unions has subtle implications; see the union [value model]. | ||
|
||
### Unions with default layout ("`repr(Rust)`") | ||
|
||
|
@@ -40,6 +57,12 @@ layout of Rust unions is, _in general_, **unspecified**. | |
That is, there are no _general_ guarantees about the offset of the fields, | ||
whether all fields have the same offset, what the call ABI of the union is, etc. | ||
|
||
**Major footgun:** The layout of `#[repr(Rust)]` enums allows for the [padding footgun] to also exist with `#[repr(Rust)]`, and this behaviour *is* | ||
extant in Rustc as of this writing. It is [**TBD**][#354] whether it will be | ||
removed. | ||
|
||
[padding footgun]: #padding-footgun | ||
|
||
<details><summary><b>Rationale</b></summary> | ||
|
||
As of this writing, we want to keep the option of using non-zero offsets open | ||
|
@@ -107,23 +130,24 @@ the layout of `U1` is **unspecified** because: | |
* `Zst2` is not a [1-ZST], and | ||
* `SomeOtherStruct` has an unspecified layout and could contain padding bytes. | ||
|
||
### C-compatible layout ("repr C") | ||
### C-compatible layout (`#[repr(C)]`) | ||
|
||
The layout of `repr(C)` unions follows the C layout scheme. Per sections | ||
[6.5.8.5] and [6.7.2.1.16] of the C11 specification, this means that the offset | ||
of every field is 0. Unsafe code can cast a pointer to the union to a field type | ||
to obtain a pointer to any field, and vice versa. | ||
The layout of `repr(C)` unions follows the C layout scheme. | ||
Per sections [6.5.8.5] and [6.7.2.1.16] of the C11 specification, this means that the offset | ||
of every field is 0, and the alignment of the union is the largest alignment of its fields. | ||
Unsafe code can cast a pointer to the union to a field type to obtain a pointer to any field, and vice versa. | ||
|
||
[6.5.8.5]: http://port70.net/~nsz/c/c11/n1570.html#6.5.8p5 | ||
[6.7.2.1.16]: http://port70.net/~nsz/c/c11/n1570.html#6.7.2.1p16 | ||
|
||
#### Padding | ||
|
||
Since all fields are at offset 0, `repr(C)` unions do not have padding before | ||
their fields. They can, however, have padding in each union variant *after* the | ||
field, to make all variants have the same size. | ||
Since all fields are at offset 0, `repr(C)` unions do not have [padding] before | ||
their fields. | ||
They can, however, have padding in each union variant *after* the field, to make | ||
all variants have the same size. | ||
|
||
Moreover, the entire union can have trailing padding, to make sure the size is a | ||
Moreover, the entire union can have tail padding, to make sure the size is a | ||
multiple of the alignment: | ||
|
||
```rust | ||
|
@@ -138,9 +162,47 @@ assert_eq!(size_of::<U>(), 2); | |
# } | ||
``` | ||
|
||
> **Note**: Fields are overlapped instead of laid out sequentially, so | ||
> unlike structs there is no "between the fields" that could be filled | ||
> with padding. | ||
#### Padding Footgun | ||
|
||
**Major footgun:** In general, unions can have padding. | ||
On some platform ABIs, such as the popular arm64, C unions may even have [interior padding] *within* fields, where a byte is padding in every variant: | ||
|
||
```rust | ||
#[repr(C)] | ||
union U { | ||
x: (u8, u16), // [u8, 1*pad, u16] | ||
y: (u8, u8), // [u8, 1*pad, u8, 1*pad] | ||
} | ||
let u = unsafe { mem::zeroed::<U>() }; // resulting bytes: [0, uninit (!!), 0, 0] | ||
let buf: &[u8] = unsafe { slice::from_raw_parts(transmute(&u), 4) }; // UB! | ||
``` | ||
|
||
This is, surprisingly, undefined behaviour, because it appears that the union is | ||
fully initialized and therefore ought to be castable to a slice. | ||
However, because byte 1 is a padding byte in both variants, it can be a padding | ||
byte in the union type as well. | ||
Therefore, when the result of `mem::zeroed` is copied onto the stack, the | ||
padding byte is uninitialized, not 0. | ||
|
||
This behaviour is platform-specific; on some platforms, this example may be | ||
well-defined. | ||
|
||
**C/C++ compatibility hazard:** This footgun exists for compatibility with the | ||
*C/C++ platform ABI, but it is not well-known in C/C++ communities. | ||
In particular, unions are sometimes treated as non-exhaustive, with an expectation that they will be ABI-compatible with future versions of the same code that have additional variatns for the union. | ||
Padding, however, can cause unions not to actually be ABI-compatible with future versions of the same type. | ||
(Note that it's also possible that adding a new variant might change the parameter-passing conventions, however, even in the absence of padding!) | ||
So whenever dealing with a union that might have padding across FFI boundaries, you should be particularly careful not to assume that all bytes are initialized. | ||
|
||
<details><summary><b>Rationale</b></summary> | ||
|
||
Look. It wasn't our idea. | ||
|
||
We could try to limit the blast radius to `extern "C"` functions, but really, | ||
that's just sawing off the end of the footgun. | ||
|
||
</details> | ||
|
||
|
||
#### Zero-sized fields | ||
|
||
|
@@ -172,4 +234,63 @@ translation of that code into Rust will not produce a compatible result. Refer | |
to the [struct chapter](structs-and-tuples.md#c-compatible-layout-repr-c) for | ||
further details. | ||
|
||
### Transparent layout (`#[repr(transparent)]`) | ||
|
||
`#[repr(transparent)]` is currently unstable for unions, but [RFC 2645] | ||
documents most of its semantics. | ||
Notably, it causes unions to be passed using the same ABI as the non-1-ZST | ||
field. | ||
|
||
**Major footgun:** Matching the interior ABI means that all padding bytes of the | ||
*non-1-ZST field will also be padding bytes of the union, so the [interior | ||
*padding footgun] exists with `#[repr(transparent)]` unions. | ||
|
||
**Note:** If `U` is a transparent union wrapping a `T`, `U` may not inherit | ||
*`T`'s niches, and therefore `Option<U>` and `Option<T>`, for instance, will not | ||
*necessarily have the same layout or even the same size. | ||
|
||
This is because, if `U` contains any zero-sized fields in addition to the `T` | ||
field, the [value model] forces `U` to support uninitialized bytes, and that in | ||
turn prevents `T`'s niches from being present in `U`. | ||
Currently, `U` also supports uninitialized bytes if it does not contain any | ||
additional fields, but it is [**TBD**][#364] if single-field transparent unions | ||
might support niches. | ||
|
||
[RFC 2645]: https://github.com/rust-lang/rfcs/blob/master/text/2645-transparent-unions.md | ||
|
||
### Bag-o-bytes layout (Repr-raw) | ||
|
||
There are applications where it is desirable that unions behave simply as a | ||
buffer of abstract bytes, with no constraints on validity and no interior | ||
padding bytes that can [get surprisingly reset to uninit][interior padding | ||
footgun]. | ||
|
||
Thus, we propose that Rust support a repr, which we are tentatively calling the Raw-repr, which gives these semantics to unions. The Raw-repr may be `#[repr(Rust)]` or it may be a new repr, say `#[repr(Raw)`], which one is TBD. The Raw-repr will have the following properties: | ||
|
||
* All fields are laid out at offset 0. | ||
* The alignment of the union is the greatest alignment among fields (or 1, in the case of an empty union). | ||
* There are no padding bytes---even the bytes that aren't part of any variant, that would otherwise be tail padding, are not padding. | ||
* If the union is over-aligned with an `#[repr(align(n))]` attribute, then any bytes beyond the "natural" alignment are tail padding. | ||
|
||
Note that Raw-repr unions are *not* a substitute for `#[repr(C)]` unions. Although it would be nice if we could avoid the [padding footgun] that way. | ||
|
||
<details><summary><b>Rationale</b></summary> | ||
|
||
We need at least one repr without the [padding footgun], because interior padding in particular is surprising. | ||
In particular, if users want to treat unions as non-exhaustive in a way that is ABI compatible with future versions with more fields, then such unions must not contain any padding. | ||
The presence of tail padding---such as with `union([u8; 3], u16)`, which could have a single byte of tail padding---is less surprising. | ||
But it would still prevent ABI forwards-compatibility if a `u32` field were added later. | ||
|
||
This layout is extremely constrained, so it would generally be against the philosophy of `#[repr(Rust)]` to impose these constraints on the default layout instead of introducing a new one. However, without such constraints, `#[repr(Rust)]` is a just a giant, largely useless footgun, which is a rationale to simply constrain it and leave any potential relaxations, e.g. for safe transmutes and niches, to other reprs. Thus, whether it becomes a new repr or not is still TBD. | ||
|
||
</details> | ||
|
||
[#354]: https://github.com/rust-lang/unsafe-code-guidelines/issues/354 | ||
[#364]: https://github.com/rust-lang/unsafe-code-guidelines/issues/364 | ||
[1-ZST]: ../glossary.md#zero-sized-type--zst | ||
[exterior padding]: ../glossary.md#exterior-padding | ||
[interior padding]: ../glossary.md#interior-padding | ||
[layout]: ../glossary.md#layout | ||
[padding]: ../glossary.md#padding | ||
[union values]: ../validity/unions.md#values | ||
[value model]: ../glossary.md#value-model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this useful for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear if you mean the sentence or the definitions in general.
The definitions of interior and exterior padding I added early on, expecting to need to refer to them, but revisions may have made them less necessary. I will make sure to take a look on whether they are still needed.
This particular sentence was because I find "X is either Y or Z" type statements helpful to in the context of a definition of X, to make clear that all X is exactly one of Y or Z. But if you don't like it I don't mind removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the distinction of interior and exterior padding... as my later comments indicated, I think that is terminology we just don't need.