4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170

sean-mcmanus · 2017-02-08T04:28:54Z

VSCode Version: 1.9
OS Version: All.

Steps to Reproduce:

Add a 4-byte UTF8 code point (e.g. 𐍈 or 𠜎 ) to a new file (e.g. C++).
Delete it and put a breakpoint in an extension that handles the textDocument/didChange protocol (e.g. the C++ extension).

Bug: The result from VS Code is that TextDocumentContentChangeEvent rangeLength is 2, so our extension sees that the 1st character is correctly 4 bytes, but then it tries to process a non-existent 2nd character (crashing the process or corrupting data). Other APIs that pass the unexpected character position may also break functionality.

It seems like VS Code should make available the real character length of an edit so extensions can behavior correctly. VS Code reports the 4-byte code point as taking up 2 columns, which is consistent with the rangeLength of 2.

As a workaround, we will assume that all 4-byte code points take up 2 characters and all other code points take up 1 character. Does that seem like a good workaround? I don't know for sure which UTF8 code points count as 2 or more characters.

rmunn · 2017-02-08T09:25:06Z

Codepoints encoded in UTF-8 will be:

1 byte if they are between U+0000 and U+007F
2 bytes if they are between U+0080 and U+07FF (most major European and Middle Eastern languages)
3 bytes if they are between U+0800 and U+FFFF (the rest of the Basic Multilingual Plane)
4 bytes if they are between U+10000 and U+10FFFF (all the supplemental planes)

Note that the characters that take 4 bytes in UTF-8 tend to be problematic in UTF-16 too, because they take two characters (a surrogate pair) to represent, and are a common source of problems for programmers who didn't sufficiently test their code. There are plenty of characters from the supplemental plane in common use (for example, here are 62 Chinese ideograms from the supplemental plane that are considered to be among the "core" ideograms in modern Chinese, and most emojis are in the supplemental plane as well).

Since Javascript uses UTF-16 internally, you have to be prepared to encounter various Unicode problems when dealing with the supplemental plane. It sounds like the TextDocumentContentChangeEvent is passing a UTF-16 string, and the C++ extension is simply assuming, incorrectly, that the surrogate pairs are two separate characters.

jrieken · 2017-02-08T11:09:05Z

Thanks @rmunn for those great details. I agree that this is how JS works, esp. this hurts:

JavaScript treats code units as individual characters, while humans generally think in terms of Unicode characters. This has some unfortunate consequences for Unicode characters outside the BMP. Since surrogate pairs consist of two code units, '𝌆'.length == 2, even though there’s only one Unicode character there. The individual surrogate halves are being exposed as if they were characters: '𝌆' == '\uD834\uDF06'.

sean-mcmanus · 2017-02-08T15:56:51Z

Okay, we have a function to get the UTF16 character length of a UTF8 code point, so using that has fixed the issue for us. So we don't really need a fix from you guys, but if you do decide to change the character/column/ranges values from being UTF16 counts to Unicode character counts then we'd need change our code back to the way it was before.

jrieken · 2017-02-08T16:02:48Z

Ok. We will not change/break the current implementation, but there is a request to 'expose the bytes'. That might be useful for you once it happens: #5735

Tyriar assigned jrieken Feb 8, 2017

jrieken assigned alexdima Feb 8, 2017

jrieken closed this as completed Feb 8, 2017

jrieken added the *question Issue represents a question, should be posted to StackOverflow (VS Code) label Feb 8, 2017

vscodebot bot locked and limited conversation to collaborators Nov 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170

4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170

sean-mcmanus commented Feb 8, 2017 •

edited

Loading

rmunn commented Feb 8, 2017

jrieken commented Feb 8, 2017

sean-mcmanus commented Feb 8, 2017

jrieken commented Feb 8, 2017

4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170

4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170

Comments

sean-mcmanus commented Feb 8, 2017 • edited Loading

rmunn commented Feb 8, 2017

jrieken commented Feb 8, 2017

sean-mcmanus commented Feb 8, 2017

jrieken commented Feb 8, 2017

sean-mcmanus commented Feb 8, 2017 •

edited

Loading