feat: UTF-8 string validation #3958

digama0 · 2024-04-20T14:19:59Z

Previously, there was a function opaque fromUTF8Unchecked : ByteArray -> String which would convert a list of bytes into a string, but as the name implies it does not validate that the string is UTF-8 before doing so and as a result it produces unsound results in the compiler (because the lean model of String indirectly asserts UTF-8 validity). This PR replaces that function by

opaque validateUTF8 (a : @& ByteArray) : Bool

opaque fromUTF8 (a : @& ByteArray) (h : validateUTF8 a) : String

so that while the function is still "unchecked", we have a proof witness that the string is valid. To recover the original, actually unchecked version, use lcProof or other unsafe methods to produce the proof witness.

Because this was the only ByteArray -> String conversion function, it was used in several places in an unsound way (e.g. reading untrusted input from IO and treating it as UTF-8). These have been replaced by fromUTF8? or fromUTF8! as appropriate.

nomeata

Looks good to me, thanks!

Should we have some tests for the validator? Is there maybe an “official” set of test inputs that we can use in a test?

leanprover-community-mathlib4-bot · 2024-04-20T15:47:58Z

Mathlib CI status (docs):

✅ Mathlib branch lean-pr-testing-3958 has successfully built against this PR. (2024-04-20 15:47:56) View Log
✅ Mathlib branch lean-pr-testing-3958 has successfully built against this PR. (2024-04-20 17:20:46) View Log
✅ Mathlib branch lean-pr-testing-3958 has successfully built against this PR. (2024-04-20 19:30:14) View Log

- [x] Depends on: #3958 - [x] Depends on: #3960 This makes the UTF-8 encode and decode functions have lean definitions, so that we can prove properties about them downstream.

Continuation of #3958. To ensure that lean code is able to uphold the invariant that `String`s are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings via `fromUTF8Unchecked`, but there are still many cases where C++ code constructs strings from a `const char *` or `std::string` with unclear UTF-8 status. To address this and minimize accidental missed validation, the `(lean_)mk_string` function is modified to validate UTF-8. The original function is renamed to `mk_string_unchecked`, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads to `mk_string` or its variants in the C code, and used the appropriate validation function, defaulting to `mk_string` if the provenance is unclear. This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.

feat: UTF-8 string validation

c84371f

github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 20, 2024

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Apr 20, 2024

Update lean-toolchain for testing leanprover/lean4#3958

4401c25

leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Apr 20, 2024

Update lean-toolchain for testing leanprover/lean4#3958

afa32bb

nomeata approved these changes Apr 20, 2024

View reviewed changes

digama0 mentioned this pull request Apr 20, 2024

feat: add model implementation for UTF8 enc/dec #3961

Merged

2 tasks

fix test

a5c648a

digama0 requested a review from tydeu as a code owner April 20, 2024 15:46

leanprover-community-mathlib4-bot added the builds-mathlib CI has verified that Mathlib builds against this PR label Apr 20, 2024

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Apr 20, 2024

Trigger CI for leanprover/lean4#3958

3660410

leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Apr 20, 2024

Trigger CI for leanprover/lean4#3958

05a76ba

add validation test

319afdd

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Apr 20, 2024

Trigger CI for leanprover/lean4#3958

d9fe716

leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Apr 20, 2024

Trigger CI for leanprover/lean4#3958

5db5451

nomeata added this pull request to the merge queue Apr 20, 2024

Merged via the queue into leanprover:master with commit 62cdb51 Apr 20, 2024
11 checks passed

digama0 mentioned this pull request Apr 20, 2024

fix: validate UTF-8 at C++ -> Lean boundary #3963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: UTF-8 string validation #3958

feat: UTF-8 string validation #3958

digama0 commented Apr 20, 2024

nomeata left a comment

leanprover-community-mathlib4-bot commented Apr 20, 2024 •

edited

Loading

feat: UTF-8 string validation #3958

feat: UTF-8 string validation #3958

Conversation

digama0 commented Apr 20, 2024

nomeata left a comment

Choose a reason for hiding this comment

leanprover-community-mathlib4-bot commented Apr 20, 2024 • edited Loading

leanprover-community-mathlib4-bot commented Apr 20, 2024 •

edited

Loading