-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACP: Designing an alternative FromStr
#296
Comments
This feels contradictory to the opening of "Motivating examples or use cases" which establishes that there are clearly different intended use cases. |
Prior art for parser traits
|
If parsers are meant to be composed, it'd be a real pain to deal with incomplete data in the |
If these are meant to be composed, then any error can be a "soft" error and cause backtracking within the parser. |
Thank you for the list of parser traits! I omitted your notes + ordered them alphabetically (for fairness) in the ACP, but you do provide useful insight into the design of these. I had looked into these as I was designing the example a while ago, although for some reason it completely didn't cross my mind to include them in the ACP.
I should add, this is 100% based upon the naming of
That's fair, and it'd be a very good case for having a custom |
side note: imo in designs I'd recommend referring to It doesn't have to be a choice between |
Big fan of extending it to support at least some basic compositionality. The fact that one can't trivially use That said, I'm really unsure how far we can reasonably go in Maybe there's a way to offer some shared things for the basics of simple length-limited integers and characters and such that can be reused by other things, but that the bigger infrastructure should be left to crates? |
I mean, what you're proposing is a valid other option: design things that fit an API similar to that proposed, but don't provide any actual traits to do so. I think that there should be at least some separate parser object rather than a series of methods like parse_prefixed_etc but maybe an alternative could be to standardise on a way to indicate the maximum length buffer something could have. But this could throw a wrench into generalising since, for example, 0255 is a valid This is mostly why I propose separate parsing objects that can be passed additional strings, since that way we don't have to worry about those cases. But I could see an argument made for what you're proposing if done properly. |
sounds like an excuse to add an pub trait IsEmpty {
fn is_empty(&self) -> bool;
}
impl<T> IsEmpty for [T] { ... }
impl<T> IsEmpty for Vec<T> { ... }
impl<T: ?Sized + IsEmpty> IsEmpty for &'_ T { ... }
impl<T: ?Sized + IsEmpty> IsEmpty for &'_ mut T { ... }
impl<T: ?Sized + IsEmpty> IsEmpty for Box<T> { ... }
... |
I feel this is missing the answer and motivation for the core question: why should this be included in the The proposal also includes way too many features & issues to hope to find a single best solution. Statefulness, returning unparsed trailing bytes or erroring on their existence. Parsers which support parsing from multiple data parts! Many applications don't need some of those features. For example, I have never required multipart parsers in my line of work, and stateful parsers which can be continued from partial parse states are usually more trouble than they're worth. Once you have a fast lexer, unless we're talking about multi-megabyte inputs it's easier to reparse the entire input if need be than to try to keep track of parser state (which could be entirely contained in local variables of the relevant functions).
That's essentially trying to include deserialization as a special case of this API, which means that The only compelling reason to change |
Have you considered that the parser state in question would be of the form of a struct which could be contained in local variables of the relevant functions? For example, the incredibly complicated parser state for an integer parser might just be… the current value of the integer being parsed, wrapped in some newtype, maybe with a boolean flag or two for ensuring you fail on empty input/multiple signs passed. If some value straddles multiple I/O reads, instead of being forced to allocate a buffer two (or more) chunks long, it's much easier to just store the state between reads. That state isn't kept in some complicated global parser state, just a local variable, like you said; the difference is that it's encapsulated in some API struct rather than a handful of hand-written variables. Also--
This issue also applies to lexing, since tokens representing numbers, dates, or any other pre-processed value would otherwise require allocating additional space between reads. Much easier for a lexer in a situation like this to hold onto its state between the reads than to allocate a bigger buffer and parse multiple times. |
Proposal
Note: this is a proposed alternative to #287 and probably other proposals before it. The solution is not complete, but it describes the problem in a way that can be agreed upon so that future experiments can be better specified. Basically, we have to agree on the shape of a solution before we can agree on the actual solution.
Problem statement
Right now, the current mechanism for "parse this thing from a string" is
FromStr
, which has the following issues:TryFrom
API under a different name (FromStr
was designed beforeTryFrom
existed)TryFrom
andFrom
(when shouldTryFrom
,From
, andFromStr
differ?)str
content (most common case is byte strings)Motivating examples or use cases
The principle example of the separation between
FromStr
andTryFrom
is JSON objects, likeserde_json::Value
. A string can be represented in JSON, usually where the conversion is aValue::String
constructor. However, strings can also be parsed into JSON, representing other objects.From
andTryFrom
are the former type of conversion, and while there are some unanswered questions for things like truncating integers, etc., those are orthogonal to this RFC. The main discussion here is how we can offer a mechanism that parses objects from strings.Since we know that this mechanism is different from
TryFrom
, ideally, it will not take the same form asTryFrom
. Ideally, parsing will be good enough that the relatively simple parsing of things like integers, IP addresses, etc., will be reusable in other places.There are two main problems I'd like to solve with the solution, but these are the points that should be discussed, since they do complicate the solution shape.
Extraneous Input
Parsers should be able to handle extraneous input from a string. For example, I should be able to pass in a string and get out a parsed number and the remainder of the string past the parsed number.
However, this also means that we need an additional case for the usual behaviour, which states that if the remaining string is nonempty, it should fail.
For example, take the string
"1234abcd"
. In this example, we could parse the number (in decimal)1234
and then return"abcd"
since none of the characters are valid decimal digits. If we wanted the entire string to be a number, we could fail based upon the string"abcd"
noting that there's extra characters after the number.Insufficient Input
Parsers should be able to handle insufficient input from a string. Combined with the above, we get the ability to do proper, buffered parsing of values without the need to split on any known terminator like
\n
. If we pass in a string to the parser and there is no extraneous input, we should be able to continue passing in strings until either input is exhausted or something is invalid, which tells us that we're finished.For example, if the strings
"1234"
,"5678"
, and"90ab"
get parsed in, we should end up with the number1234567890
and the extra input"ab"
. The strings don't have to be the same length, but they are to demonstrate the buffering example.This requires distinguishing whether we could accept more input if the extraneous input is an empty string. For example, if trying to parse a
u8
, the string"55"
could accept more input and become a validu8
, whereas"255"
could not, since any future digits would overflow. A simple implementation could decide to ignore all of this "overflow foresight" and instead just keep accepting digits and stopping on non-digits, although this is worth keeping in mind.Solution sketch
Note: a current (working) solution exists in this repo, but it's explicitly not the solution I'm going to propose here. I'm just sharing it for completeness to indicate what I've tried to accomplish for this in the past.
Parsers
To solve the insufficient input issue, we would have to separate the notion of parsing from a parser, since we need some intermediate value that can accept a series of strings. This is the trade-off we get if we choose to allow insufficient-input parsing, but I personally believe it's worthwhile. In the example shown, I distinguished them using a
BuildParser
andParse
trait similar toBuildHasher
andHash
, but any possible design is valid.This means that, instead of having a method like
from_str_radix
on integers, we'd have awith_radix
method that accepts a radix and returns a parser. Similar toBuildHasher
'shash_one
method, we could also have aparse_one
method onBuildParser
, andfrom_str_radix
could be considered a shorthand for this.Parser States
Parsers should take in string inputs and return results with… something. Parsed values can have the usual hard errors, but they can also have "soft" errors indicating extraneous input, etc. A good example of a hard error for parsing integers is if there have been no valid digits encountered; an empty string is not a valid number, so, simply starting with the string
"abcd"
would result in some error along the lines of, no digits encountered.While the examples given will simply use
Result
and indicate hard errors withErr
, it's noteworthy that some ways of combining parsers would turn all hard errors into soft errors, and that it's useful for some kinds of parsers to indicate whether an error in the data itself is hard or soft. This is something to consider in the shape of the final API, but not as important for the experimental stage.Now, besides the hard error state, there are three possible states for the parser:
"127."
for an IP would be this state."::"
for an IP would be in this state; you could parse"::1"
, for example."[1]"
is valid JSON and any other input would be extraneous.I'm choosing to treat 3 as the same state regardless of whether any extraneous input is actually provided. For example, there's no reasonable distinction to be had whether you type
"[1]"
or"[1]a"
, since the extraneous input in the former could simply be treated as an empty string. We're already dealing with an API that can get complicated, and I don't want to further complicate it.Accepting Input
So, while separating out a parser and a parsed value is one half of the API distinction, the other half is how we cut off the input returned to the parser. There are a few different ways we could handle this. I'll use
Option
andResult
terms for brevity but we could in theory create other types instead. (The example actually usesPoll
.) This lists the potential return values after passing in input:Ok(None)
, state 2 returnsOk(None)
, state 3 returnsOk(Some((val, extra)))
. At any point, a user can call a "finish" method to signal EOF; this returnsErr(Insufficient)
for state 1,Ok(val)
for state 2, and the result of parsing an empty string for state 3 (eitherErr(Insufficient)
orOk(val)
depending on whether an empty string is okay).Ok(None)
, state 2 and 3 returnOk(Some((handle, extra)))
.handle
borrows the parser and must be dropped or converted into a value before adding more input.Ok(None)
, state 2 and 3 returnOk(Some((val, extra)))
and parsers can only parse exactly one value. If state 2 happens before state 3, multiple values are constructed, and previous intermediate results can be dropped if more input is obtained. There can be methods on parsers that clear them, but those are outside the API here.I personally prefer 3 because it's the simplest, and it actually allows us to secretly do option 2 as well. If we allow the parser to borrow itself when returning the value, we could avoid throwing away any intermediate results by simply directly borrowing the result.
For example, we could literally make the parsed value be the handle we discussed, and then calling a method to remove the value would reset the parser.
Parameterization
Parameterizing the API based upon what strings are being parsed feels pretty straightforward: the traits can accept a generic type that would default to
str
, and then you can pass in[u8]
,[u16]
,OsStr
,CStr
,Path
, etc. as you please.I'm going to mostly leave this as an afterthought to the API because I believe it can be easily incorporated later, and we do have GATs to be able to depend on lifetimes for things, but it's still a point that should be addressed.
In the example I gave, I used
Default + PartialEq
as the bound on the parameter, under the assumption that you could do something likeextra == Default::default()
to check if there's any extraneous input. We could probably replace it with a custom trait or trait alias. I don't think that we need something particularly complicated, though.Errors
In the example I provided, I added two methods,
insufficient
andextraneous
, which conjure errors based upon insufficient and extraneous input when such errors are desired. Ultimately, these options will have to be available either through the parser or the error itself, and it's worth mentioning that they are required.This could be motivation for a standard
ParseError
type, likeio::Error
, where these variants are automatically included. I personally prefer parameterization, since parse errors tend to be sophisticated and being able to recover from them is usually helpful, but it is a valid path as well.Common Parsers
"Overflow foresight" was something I mentioned earlier, and it's something I would propose to avoid in libstd parsers. I think that any numerical parser should continue accepting input as long as only digits have been seen, and if overflow happens, an error is returned.
Ultimately, I don't think that anyone would ever want to parse the string "255255" as two
u8
s concatenated together instead of one invalidu8
, and we should probably act similarly.Combinators
It's worth mentioning that, like iterators, parsers of this form can easily be combined together to create more complicated parsers. I would like to explicitly propose here that the standard library should not take on this role, and this is much better suited to the ecosystem. However, I would also like any solution to be able to reuse libstd implementations of parsers without having to roll your own.
For example, parsing a network prefix like
127.0.0.1/8
should not require searching the string for a/
, parsing the part before as an IP, and then parsing the 8; it should simply require parsing the IP, then parsing the remaining input as/8
or an empty string.Alternatives
We could have parsing be another type of conversion, like the truncating conversions proposed in various attempts to convert
as
into aFrom
-like trait. Or, we could just removeFromStr
entirely and say that we don't need a custom parse method.I personally disagree with this, since obtaining input from I/O is a common operation and the solution should mirror that. However, I don't think that an alternative similar to
FromStr
is worth it: in that case, we should just rely onTryFrom
instead.One other, interesting alternative is to allow the entire API, but not parameterize it by strings and explicitly only accept byte strings. This would go more in-line with allowing buffered input, and means we could strap it into the
io
module somehow in the future. I would prefer such an implementation to simply parameterize[u8]
there, but it's worth mentioning regardless.Links and related work
Alternative ACPs:
FromByteStr
trait with blanket implFromStr
#287Ecosystem parser traits (thanks @epage):
chumsky::Parser
combine::Parser
nom::Parser
winnow::Parser::parse_next
yap::Tokens
Other discussions:
(I know they exist, but am not going to go looking right now. We'll add them here as they get pointed out.)
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: