Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: UTF-8 string validation #3958

Merged
merged 3 commits into from
Apr 20, 2024
Merged

Conversation

digama0
Copy link
Collaborator

@digama0 digama0 commented Apr 20, 2024

Previously, there was a function opaque fromUTF8Unchecked : ByteArray -> String which would convert a list of bytes into a string, but as the name implies it does not validate that the string is UTF-8 before doing so and as a result it produces unsound results in the compiler (because the lean model of String indirectly asserts UTF-8 validity). This PR replaces that function by

opaque validateUTF8 (a : @& ByteArray) : Bool

opaque fromUTF8 (a : @& ByteArray) (h : validateUTF8 a) : String

so that while the function is still "unchecked", we have a proof witness that the string is valid. To recover the original, actually unchecked version, use lcProof or other unsafe methods to produce the proof witness.

Because this was the only ByteArray -> String conversion function, it was used in several places in an unsound way (e.g. reading untrusted input from IO and treating it as UTF-8). These have been replaced by fromUTF8? or fromUTF8! as appropriate.

@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 20, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Apr 20, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Apr 20, 2024
Copy link
Collaborator

@nomeata nomeata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

Should we have some tests for the validator? Is there maybe an “official” set of test inputs that we can use in a test?

@digama0 digama0 requested a review from tydeu as a code owner April 20, 2024 15:46
@leanprover-community-mathlib4-bot leanprover-community-mathlib4-bot added the builds-mathlib CI has verified that Mathlib builds against this PR label Apr 20, 2024
@leanprover-community-mathlib4-bot
Copy link
Collaborator

leanprover-community-mathlib4-bot commented Apr 20, 2024

Mathlib CI status (docs):

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Apr 20, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Apr 20, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Apr 20, 2024
leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Apr 20, 2024
@nomeata nomeata added this pull request to the merge queue Apr 20, 2024
Merged via the queue into leanprover:master with commit 62cdb51 Apr 20, 2024
11 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Apr 22, 2024
- [x] Depends on: #3958 
- [x] Depends on: #3960

This makes the UTF-8 encode and decode functions have lean definitions,
so that we can prove properties about them downstream.
github-merge-queue bot pushed a commit that referenced this pull request Jun 19, 2024
Continuation of #3958. To ensure that lean code is able to uphold the
invariant that `String`s are valid UTF-8 (which is assumed by the lean
model), we have to make sure that no lean objects are created with
invalid UTF-8. #3958 covers the case of lean code creating strings via
`fromUTF8Unchecked`, but there are still many cases where C++ code
constructs strings from a `const char *` or `std::string` with unclear
UTF-8 status.

To address this and minimize accidental missed validation, the
`(lean_)mk_string` function is modified to validate UTF-8. The original
function is renamed to `mk_string_unchecked`, with several other
variants depending on whether we know the string is UTF-8 or ASCII and
whether we have the length and/or utf8 char count on hand. I reviewed
every function which leads to `mk_string` or its variants in the C code,
and used the appropriate validation function, defaulting to `mk_string`
if the provenance is unclear.

This PR adds no new error handling paths, meaning that incorrect UTF-8
will still produce incorrect results in e.g. IO functions, they are just
not causing unsound behavior anymore. A subsequent PR will handle adding
better error reporting for bad UTF-8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
builds-mathlib CI has verified that Mathlib builds against this PR toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants