-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Introduce max_serialized_size
function
#209
Conversation
0b3f3d8
to
68a965a
Compare
The new borsh::max_serialized_size helper function returns the length of the longest possible encoding of given type. This may be used to preallocate output buffers or calculate minimum size of a memory regions needed to keep the values (e.g. in case of Solana accounts).
borsh/src/schema_helpers.rs
Outdated
Ok(Definition::Sequence { elements }) => { | ||
// Assume that sequence has MAX_LEN elements since that’s the most | ||
// it can have. | ||
let sz = max_serialized_size_impl(MAX_LEN, elements, defs)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preallocating 4GB of memory for String
or Vec<u8>
, or HashMap<u64, u32>
(it's actually 12 * 4 GB here) doesn't appear terribly useful, if one instead expects the type to casually contain 200 bytes, or 2MB.
Using a number of types like BoundedVec you mentioned instead might be better imo.
Thus there're 2 alternatives:
- Add
max_length
to Definition::Sequence :
Sequence { max_length: u32, elements: Declaration },
which will usually be populated with u32::MAX
, and with something less for types such as BoundedVec<T, L, U>
.
In this variant it can be made to work with statically bounded types with current helper function max_serialized_size
after small adjustments.
- To allow to compute max size for types, whose limit may be set at runtime:
let mut rng = rand::thread_rng();
let up: u64 = rng.next_u64() % 100 + 1;
let up1: usize = up as usize;
// error[E0435]: attempt to use a non-constant value in a constant
let vec: BoundedVec<u8, 0, up1> = BoundedVec::from_vec(vec![20, 20, 20]).unwrap();
pub fn max_serialized_size<T: BorshSchema>() -> Option<usize>
may be reverted back to be a
BorshSchema
method, but with &self
parameter, unused in default implementation:
fn max_serialized_size(&self) -> Option<usize> {
...
}
Also this variant allows a hypothetical VarInt compute its own maximum size without need to add a separate Definition::Varint
(VarInt
type would have a "varint"
Declaration
and empty Definition
):
Varint { min_bytes: u8, max_bytes: u8 },
@frol what do you think?
Side note: i've noticed that derived implementation of serde
for BoundedVec<T, L, U>
doesn't attempt to check the imposed length bounds at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second point, this would require changing schema so that rather than having a mapping from Declaration to Definition, there would be a mapping from Declaration to &dyn BorshSchema
and then additional fn definition(&self) -> Option<Definition>
method. Or Definition would need to be changed to be a trait with definition(&self) -> Option<Definition>;
and max_serialized_size(&self) -> Option<usize>
methods.
However, problem with that approach is that this isn’t serialisable. If the idea behind BorshSchema is to be able to read encoded data without having the definition of the types, it’s potentially pointless to rely on traits and such to get the max size information.
Maybe I’m overthinking this, but I’m imagining something like:
pub enum Definition {
/// A sequence of elements of the same type.
Sequence {
/// How many bytes does the length prefix occupy.
///
/// Zero means this is a fixed-length array or a custom encoding schema
/// which doesn’t follow typical borsh conventions. In the former case,
/// `min_length` and `max_length` will be set to the same value. In the
/// latter case, it may be impossible to deserialise the data without
/// additional knowledge of the declared type.
length_width: u8,
/// Minimum length of the sequence. Usually `0` for dynamically-sized
/// sequences.
min_length: u64,
/// Maximum length of the sequence. Usually `2**length_width - 1`.
max_length: u64,
/// Type of the elements of the sequence.
elements: Declaration,
},
/// A fixed-size tuple with the length known at the compile time and the
/// elements of different types.
Tuple {
/// Types of elements of the tuple.
elements: Vec<Declaration>,
},
/// A tagged union, a.k.a enum. Tagged-unions have variants with associated
/// structures.
Enum {
/// Width of the discriminant tag.
tag_width: u8,
/// Possible variants of the enumeration.
variants: Vec<(VariantName, Declaration)>,
},
/// A structure, structurally similar to a tuple.
Struct {
/// Named fields of the structure with their types.
///
/// Tuple-like structures are modelled as a `Tuple`.
fields: Vec<(FieldName, Declaration)>,
},
}
With that
[u8; 10]
would beSequence { length_width: 0, min_length: 10, max_length: 10, elements: "u8" }
,Vec<u8>
would beSequence { length_width: 4, min_length: 0, max_length: u32::MAX, elements: "u8" }
,SmallVec<u8>
would beSequence { length_width: 1, min_length: 0, max_length: 255, elements: "u8" }
,BoundVec<1, 10, u8>
would beSequence { length_width: 4, min_length: 1, max_length: 10, elements: "u8" }
andVarInt<u32>
would beSequence { length_width: 0, min_length: 1, max_length: 5, elements: "u8" }
.
And the tag_width
addition to Enum
would future-proof the spec for cases where users want to use more than 256 enum variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sequence { length_width: u8, min_length: u64, max_length: u64, elements: Declaration }
doesn't look that bad, especially due to being able to accommodate VarInt
.
I would still not remove Definition::Array
, though Sequence { length_width: 0, min_length: 10, max_length: 10, elements: "u8" }
is fine to be understood too.
tag_width
for Enum
is redundant imo, as Enum
is supposed to be human-readable, and if one has an Enum
with over-than-u8-range variants, or over-than-u16-range variants, he/she probably has to factor something out, to make the definition of Enum
human readable again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like explicitness, so I like the proposed extended Definition
@@ -333,7 +333,7 @@ where | |||
#[cfg(hash_collections)] | |||
pub mod hashes { | |||
//! | |||
//! Module defines [BorshSchema](crate::schema::BorshSchema) implementation for | |||
//! Module defines [BorshSchema](crate::schema::BorshSchema) implementation for | |||
//! [HashMap](std::collections::HashMap)/[HashSet](std::collections::HashSet). | |||
use crate::BorshSchema; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace Definition
in this file with the one proposed (or, strictly speaking, very close to the one proposed)
pub enum Definition {
/// A sequence of elements of the same type.
Sequence {
/// How many bytes does the length prefix occupy.
///
/// Zero means this is a fixed-length array or a custom encoding schema
/// which doesn’t follow typical borsh conventions. In the former case,
/// `min_length` and `max_length` will be set to the same value. In the
/// latter case, it may be impossible to deserialise the data without
/// additional knowledge of the declared type.
length_width: u8,
/// Minimum length of the sequence. Usually `0` for dynamically-sized
/// sequences.
min_length: u64,
/// Maximum length of the sequence. Usually `2**length_width - 1`.
max_length: u64,
/// Type of the elements of the sequence.
elements: Declaration,
},
/// A fixed-size tuple with the length known at the compile time and the
/// elements of different types.
Tuple {
/// Types of elements of the tuple.
elements: Vec<Declaration>,
},
/// A tagged union, a.k.a enum. Tagged-unions have variants with associated
/// structures.
Enum {
/// Width of the discriminant tag.
tag_width: u8,
/// Possible variants of the enumeration.
variants: Vec<(VariantName, Declaration)>,
},
/// A structure, structurally similar to a tuple.
Struct {
fields: Fields,
},
}
adjust max_serialized_size
implementation accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of questions about this:
First of all, what’s even the point of Fields? Specifically, why do
we bother encoding unnamed struct as a struct rather than as a tuple.
This only complicates clients of the code since they have two
identical cases to consider separately.
And secondly, (while I would even go as far as get rid of field names,
but baring that) why not get rid of Struct and instead use:
Tuple {
/// Types of elements of the tuple.
elements: Vec<(Option<FieldName>, Declaration)>,
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mina86 let's address only Array
, Enum
and Sequence
in this pr, as these are where new details are being added, the proposed changes to Struct
aren't adding new information, they are just rearranging existing representation in a slightly different manner, albeit (arguably) in a more correct one.
We can try to include another pr with Definition::Struct/Tuple
changes into 1.0.0 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mina86 on second thought, it might be better to accept this pr as is.
And then issue 3 separate prs:
- remove
Definition::Array
, replace it with new extendedDefinition::Sequence
+ a test for a localBoundedVec<T, L, U>
max_serialized_size
- add
tag_width
toDefinition::Enum
- make the change so that tuple structs become modelled as a
Tuple
/// A fixed-size tuple with the length known at the compile time and the
/// elements of different types.
Tuple {
/// Types of elements of the tuple.
elements: Vec<Declaration>,
},
/// A structure, structurally similar to a tuple.
Struct {
/// Named fields of the structure with their types.
///
/// Tuple-like structures are modelled as a `Tuple`.
fields: Vec<(FieldName, Declaration)>,
},
Thus it will be easier for everyone to digest it in 4 separate chunks and do additional clarification per each point if needed. I completely skipped the part with point 3. on first pass.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m already working on the other things as separate PRs. Whether this PR should be accepted now or later I have no opinion.
max_serialized_size
function; enhance schema::Definition::{Sequence, Enum}
; remove schema::Definition::Array
max_serialized_size
function; enhance schema::Definition::{Sequence, Enum}
; remove schema::Definition::Array
max_serialized_size
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new borsh::max_serialized_size helper function returns the length
of the longest possible encoding of given type. This may be used to
preallocate output buffers or calculate minimum size of a memory
regions needed to keep the values (e.g. in case of Solana accounts).