-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: UTF-8 string validation #3958
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
leanprover-community-mathlib4-bot
added a commit
to leanprover-community/batteries
that referenced
this pull request
Apr 20, 2024
leanprover-community-mathlib4-bot
added a commit
to leanprover-community/mathlib4
that referenced
this pull request
Apr 20, 2024
nomeata
approved these changes
Apr 20, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
Should we have some tests for the validator? Is there maybe an “official” set of test inputs that we can use in a test?
2 tasks
Mathlib CI status (docs):
|
leanprover-community-mathlib4-bot
added a commit
to leanprover-community/batteries
that referenced
this pull request
Apr 20, 2024
leanprover-community-mathlib4-bot
added a commit
to leanprover-community/mathlib4
that referenced
this pull request
Apr 20, 2024
leanprover-community-mathlib4-bot
added a commit
to leanprover-community/batteries
that referenced
this pull request
Apr 20, 2024
leanprover-community-mathlib4-bot
added a commit
to leanprover-community/mathlib4
that referenced
this pull request
Apr 20, 2024
github-merge-queue bot
pushed a commit
that referenced
this pull request
Jun 19, 2024
Continuation of #3958. To ensure that lean code is able to uphold the invariant that `String`s are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings via `fromUTF8Unchecked`, but there are still many cases where C++ code constructs strings from a `const char *` or `std::string` with unclear UTF-8 status. To address this and minimize accidental missed validation, the `(lean_)mk_string` function is modified to validate UTF-8. The original function is renamed to `mk_string_unchecked`, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads to `mk_string` or its variants in the C code, and used the appropriate validation function, defaulting to `mk_string` if the provenance is unclear. This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
builds-mathlib
CI has verified that Mathlib builds against this PR
toolchain-available
A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously, there was a function
opaque fromUTF8Unchecked : ByteArray -> String
which would convert a list of bytes into a string, but as the name implies it does not validate that the string is UTF-8 before doing so and as a result it produces unsound results in the compiler (because the lean model ofString
indirectly asserts UTF-8 validity). This PR replaces that function byso that while the function is still "unchecked", we have a proof witness that the string is valid. To recover the original, actually unchecked version, use
lcProof
or other unsafe methods to produce the proof witness.Because this was the only
ByteArray -> String
conversion function, it was used in several places in an unsound way (e.g. reading untrusted input from IO and treating it as UTF-8). These have been replaced byfromUTF8?
orfromUTF8!
as appropriate.