diff --git a/reference/src/glossary.md b/reference/src/glossary.md index 2e70d7bf..01980cd4 100644 --- a/reference/src/glossary.md +++ b/reference/src/glossary.md @@ -190,8 +190,12 @@ guarantee that `Option<&mut T>` has the same size as `&mut T`. While all niches are invalid bit-patterns, not all invalid bit-patterns are niches. For example, the "all bits uninitialized" is an invalid bit-pattern for -`&mut T`, but this bit-pattern cannot be used by layout optimizations, and is not a -niche. +`&mut T`, but this bit-pattern cannot be used by layout optimizations, and is not a niche. + +It is a surprisingly common misconception that niches can occur in [padding] bytes. +They cannot: A niche representation must be invalid for `T`. +But a padding byte must be irrelevant to the value of `T`. +A byte that participates in deciding whether or not the representation is valid cannot, therefore, be a padding byte. #### Zero-sized type / ZST @@ -207,6 +211,8 @@ requirement of 2. *Padding* (of a type `T`) refers to the space that the compiler leaves between fields of a struct or enum variant to satisfy alignment requirements, and before/after variants of a union or enum to make all variants equally sized. +Padding for a type is either [interior padding], which is part of one or more fields, or [exterior padding], which is before, between, or after the fields. + Padding can be though of as `[Pad; N]` for some hypothetical type `Pad` (of size 1) with the following properties: * `Pad` is valid for any byte, i.e., it has the same validity invariant as `MaybeUninit`. * Copying `Pad` ignores the source byte, and writes *any* value to the target byte. Or, equivalently (in terms of Abstract Machine behavior), copying `Pad` marks the target byte as uninitialized. @@ -217,8 +223,26 @@ for all values `v` and lists of bytes `b` such that `v` and `b` are related at ` changing `b` at index `i` to any other byte yields a `b'` such `v` and `b'` are related (`Vrel_T(v, b')`). In other words, the byte at index `i` is entirely ignored by `Vrel_T` (the value relation for `T`), and two lists of bytes that only differ in padding bytes relate to the same value(s), if any. -This definition works fine for product types (structs, tuples, arrays, ...). -The desired notion of "padding byte" for enums and unions is still unclear. +This definition works fine for product types (structs, tuples, arrays, ...) and for unions. The desired notion of "padding byte" for enums is still unclear. + +#### Padding (exterior) +[exterior padding]: #exterior-padding + +Exterior padding bytes are [padding] bytes that are not part of one or more fields. They are exactly the padding bytes that are not [interior padding], and therefore must be before, between, or after the fields of the type. Padding that comes after all fields is called [tail padding]. + +#### Padding (interior) +[interior padding]: #interior-padding + +Interior padding bytes are [padding] bytes that are part of one or more fields of a type. + +We can say that a field `f: F` *contains* the byte at index `i` in the type `T` if the layout of `T` places `f` at offset `j` and we have `j <= i < j + size_of::()`. Then a padding byte is interior padding if and only if there exists a field `f` that contains it. + +It follows that, provided `T` is not an enum, for any such `f`, the byte at index `i - j` in `F` is a padding byte of `F`. This is because all values of `f` give rise to distinct values of `T`. + +#### Padding (tail) +[tail padding]: #tail-padding + +Tail padding is [exterior padding] that comes after all fields of a type. #### Place @@ -254,8 +278,8 @@ The relation should be functional for a fixed list of bytes (i.e., every list of It is partial in both directions: not all values have a representation (e.g. the mathematical integer `300` has no representation at type `u8`), and not all lists of bytes correspond to a value of a specific type (e.g. lists of the wrong size correspond to no value, and the list consisting of the single byte `0x10` corresponds to no value of type `bool`). For a fixed value, there can be many representations (e.g., when considering type `#[repr(C)] Pair(u8, u16)`, the second byte is a [padding byte][padding] so changing it does not affect the value represented by a list of bytes). -See the [value domain][value-domain] for an example how values and representation relations can be made more precise. +See the [MiniRust page on values][minirust-values] for an example how values and representation relations can be made more precise. [stacked-borrows]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md -[value-domain]: https://github.com/rust-lang/unsafe-code-guidelines/tree/master/wip/value-domain.md -[place-value-expr]: https://doc.rust-lang.org/reference/expressions.html#place-expressions-and-value-expressions +[minirust-values]: https://github.com/RalfJung/minirust/blob/master/lang/values.md +[place-value-expr]: https://doc.rust-lang.org/reference/expressions.html#place-expressions-and-value-expressions \ No newline at end of file diff --git a/reference/src/layout/unions.md b/reference/src/layout/unions.md index b9f018b4..108b891c 100644 --- a/reference/src/layout/unions.md +++ b/reference/src/layout/unions.md @@ -1,10 +1,12 @@ # Layout of unions -**Disclaimer:** This chapter represents the consensus from issue -[#13]. The statements in here are not (yet) "guaranteed" -not to change until an RFC ratifies them. +**Disclaimer**: This chapter is a work-in-progress. +What's contained here represents the consensus from [various issues][union +discussion]. +The statements in here are not (yet) "guaranteed" not to change until an RFC +ratifies them. -[#13]: https://github.com/rust-rfcs/unsafe-code-guidelines/issues/13 +[union discussion]: https://github.com/rust-lang/unsafe-code-guidelines/blob/master/active_discussion/unions.md ### Layout of individual union fields @@ -29,8 +31,23 @@ largest field, and the offset of each union field within its variant. How these are picked depends on certain constraints like, for example, the alignment requirements of the fields, the `#[repr]` attribute of the `union`, etc. -[padding]: ../glossary.md#padding -[layout]: ../glossary.md#layout +Unions may contain both [exterior][exterior padding] and [interior padding]. +In the below diagram, exterior padding is marked by `EXT`, interior padding by +`INT`, and bytes that are padding bytes for a particular field but not padding +for union as a whole are marked `NON`: + +```text +[ EXT [ field0_0_ty | INT | field0_1_ty | INT ] EXT ] +[ EXT [ field1_0_ty | INT | NON NON NON | INT ] EXT ] +[ EXT | NON NON NON | INT [ field2_0_ty ] INT | EXT ] +``` + +It is necessarily the case that any byte that is a non-padding byte for any +field is also a non-padding byte for the union. +It is, in general, **unspecified** whether the converse is true. +Specific reprs may specify whether or not bytes are padding bytes. + +Padding bytes in unions has subtle implications; see the union [value model]. ### Unions with default layout ("`repr(Rust)`") @@ -40,6 +57,12 @@ layout of Rust unions is, _in general_, **unspecified**. That is, there are no _general_ guarantees about the offset of the fields, whether all fields have the same offset, what the call ABI of the union is, etc. +**Major footgun:** The layout of `#[repr(Rust)]` enums allows for the [padding footgun] to also exist with `#[repr(Rust)]`, and this behaviour *is* +extant in Rustc as of this writing. It is [**TBD**][#354] whether it will be +removed. + +[padding footgun]: #padding-footgun +
Rationale As of this writing, we want to keep the option of using non-zero offsets open @@ -107,23 +130,24 @@ the layout of `U1` is **unspecified** because: * `Zst2` is not a [1-ZST], and * `SomeOtherStruct` has an unspecified layout and could contain padding bytes. -### C-compatible layout ("repr C") +### C-compatible layout (`#[repr(C)]`) -The layout of `repr(C)` unions follows the C layout scheme. Per sections -[6.5.8.5] and [6.7.2.1.16] of the C11 specification, this means that the offset -of every field is 0. Unsafe code can cast a pointer to the union to a field type -to obtain a pointer to any field, and vice versa. +The layout of `repr(C)` unions follows the C layout scheme. +Per sections [6.5.8.5] and [6.7.2.1.16] of the C11 specification, this means that the offset +of every field is 0, and the alignment of the union is the largest alignment of its fields. +Unsafe code can cast a pointer to the union to a field type to obtain a pointer to any field, and vice versa. [6.5.8.5]: http://port70.net/~nsz/c/c11/n1570.html#6.5.8p5 [6.7.2.1.16]: http://port70.net/~nsz/c/c11/n1570.html#6.7.2.1p16 #### Padding -Since all fields are at offset 0, `repr(C)` unions do not have padding before -their fields. They can, however, have padding in each union variant *after* the -field, to make all variants have the same size. +Since all fields are at offset 0, `repr(C)` unions do not have [padding] before +their fields. +They can, however, have padding in each union variant *after* the field, to make +all variants have the same size. -Moreover, the entire union can have trailing padding, to make sure the size is a +Moreover, the entire union can have tail padding, to make sure the size is a multiple of the alignment: ```rust @@ -138,9 +162,47 @@ assert_eq!(size_of::(), 2); # } ``` -> **Note**: Fields are overlapped instead of laid out sequentially, so -> unlike structs there is no "between the fields" that could be filled -> with padding. +#### Padding Footgun + +**Major footgun:** In general, unions can have padding. +On some platform ABIs, such as the popular arm64, C unions may even have [interior padding] *within* fields, where a byte is padding in every variant: + +```rust +#[repr(C)] +union U { + x: (u8, u16), // [u8, 1*pad, u16] + y: (u8, u8), // [u8, 1*pad, u8, 1*pad] +} +let u = unsafe { mem::zeroed::() }; // resulting bytes: [0, uninit (!!), 0, 0] +let buf: &[u8] = unsafe { slice::from_raw_parts(transmute(&u), 4) }; // UB! +``` + +This is, surprisingly, undefined behaviour, because it appears that the union is +fully initialized and therefore ought to be castable to a slice. +However, because byte 1 is a padding byte in both variants, it can be a padding +byte in the union type as well. +Therefore, when the result of `mem::zeroed` is copied onto the stack, the +padding byte is uninitialized, not 0. + +This behaviour is platform-specific; on some platforms, this example may be +well-defined. + +**C/C++ compatibility hazard:** This footgun exists for compatibility with the +*C/C++ platform ABI, but it is not well-known in C/C++ communities. +In particular, unions are sometimes treated as non-exhaustive, with an expectation that they will be ABI-compatible with future versions of the same code that have additional variatns for the union. +Padding, however, can cause unions not to actually be ABI-compatible with future versions of the same type. +(Note that it's also possible that adding a new variant might change the parameter-passing conventions, however, even in the absence of padding!) +So whenever dealing with a union that might have padding across FFI boundaries, you should be particularly careful not to assume that all bytes are initialized. + +
Rationale + +Look. It wasn't our idea. + +We could try to limit the blast radius to `extern "C"` functions, but really, +that's just sawing off the end of the footgun. + +
+ #### Zero-sized fields @@ -172,4 +234,63 @@ translation of that code into Rust will not produce a compatible result. Refer to the [struct chapter](structs-and-tuples.md#c-compatible-layout-repr-c) for further details. +### Transparent layout (`#[repr(transparent)]`) + +`#[repr(transparent)]` is currently unstable for unions, but [RFC 2645] +documents most of its semantics. +Notably, it causes unions to be passed using the same ABI as the non-1-ZST +field. + +**Major footgun:** Matching the interior ABI means that all padding bytes of the +*non-1-ZST field will also be padding bytes of the union, so the [interior +*padding footgun] exists with `#[repr(transparent)]` unions. + +**Note:** If `U` is a transparent union wrapping a `T`, `U` may not inherit +*`T`'s niches, and therefore `Option` and `Option`, for instance, will not +*necessarily have the same layout or even the same size. + +This is because, if `U` contains any zero-sized fields in addition to the `T` +field, the [value model] forces `U` to support uninitialized bytes, and that in +turn prevents `T`'s niches from being present in `U`. +Currently, `U` also supports uninitialized bytes if it does not contain any +additional fields, but it is [**TBD**][#364] if single-field transparent unions +might support niches. + +[RFC 2645]: https://github.com/rust-lang/rfcs/blob/master/text/2645-transparent-unions.md + +### Bag-o-bytes layout (Repr-raw) + +There are applications where it is desirable that unions behave simply as a +buffer of abstract bytes, with no constraints on validity and no interior +padding bytes that can [get surprisingly reset to uninit][interior padding +footgun]. + +Thus, we propose that Rust support a repr, which we are tentatively calling the Raw-repr, which gives these semantics to unions. The Raw-repr may be `#[repr(Rust)]` or it may be a new repr, say `#[repr(Raw)`], which one is TBD. The Raw-repr will have the following properties: + +* All fields are laid out at offset 0. +* The alignment of the union is the greatest alignment among fields (or 1, in the case of an empty union). +* There are no padding bytes---even the bytes that aren't part of any variant, that would otherwise be tail padding, are not padding. + * If the union is over-aligned with an `#[repr(align(n))]` attribute, then any bytes beyond the "natural" alignment are tail padding. + +Note that Raw-repr unions are *not* a substitute for `#[repr(C)]` unions. Although it would be nice if we could avoid the [padding footgun] that way. + +
Rationale + +We need at least one repr without the [padding footgun], because interior padding in particular is surprising. +In particular, if users want to treat unions as non-exhaustive in a way that is ABI compatible with future versions with more fields, then such unions must not contain any padding. +The presence of tail padding---such as with `union([u8; 3], u16)`, which could have a single byte of tail padding---is less surprising. +But it would still prevent ABI forwards-compatibility if a `u32` field were added later. + +This layout is extremely constrained, so it would generally be against the philosophy of `#[repr(Rust)]` to impose these constraints on the default layout instead of introducing a new one. However, without such constraints, `#[repr(Rust)]` is a just a giant, largely useless footgun, which is a rationale to simply constrain it and leave any potential relaxations, e.g. for safe transmutes and niches, to other reprs. Thus, whether it becomes a new repr or not is still TBD. + +
+ +[#354]: https://github.com/rust-lang/unsafe-code-guidelines/issues/354 +[#364]: https://github.com/rust-lang/unsafe-code-guidelines/issues/364 [1-ZST]: ../glossary.md#zero-sized-type--zst +[exterior padding]: ../glossary.md#exterior-padding +[interior padding]: ../glossary.md#interior-padding +[layout]: ../glossary.md#layout +[padding]: ../glossary.md#padding +[union values]: ../validity/unions.md#values +[value model]: ../glossary.md#value-model \ No newline at end of file