Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170

Closed
sean-mcmanus opened this issue Feb 8, 2017 · 4 comments
Closed
Assignees
Labels
*question Issue represents a question, should be posted to StackOverflow (VS Code)

Comments

@sean-mcmanus
Copy link
Contributor

sean-mcmanus commented Feb 8, 2017

  • VSCode Version: 1.9
  • OS Version: All.

Steps to Reproduce:

  1. Add a 4-byte UTF8 code point (e.g. 𐍈 or 𠜎 ) to a new file (e.g. C++).
  2. Delete it and put a breakpoint in an extension that handles the textDocument/didChange protocol (e.g. the C++ extension).

Bug: The result from VS Code is that TextDocumentContentChangeEvent rangeLength is 2, so our extension sees that the 1st character is correctly 4 bytes, but then it tries to process a non-existent 2nd character (crashing the process or corrupting data). Other APIs that pass the unexpected character position may also break functionality.

It seems like VS Code should make available the real character length of an edit so extensions can behavior correctly. VS Code reports the 4-byte code point as taking up 2 columns, which is consistent with the rangeLength of 2.

As a workaround, we will assume that all 4-byte code points take up 2 characters and all other code points take up 1 character. Does that seem like a good workaround? I don't know for sure which UTF8 code points count as 2 or more characters.

@rmunn
Copy link
Contributor

rmunn commented Feb 8, 2017

Codepoints encoded in UTF-8 will be:

  • 1 byte if they are between U+0000 and U+007F
  • 2 bytes if they are between U+0080 and U+07FF (most major European and Middle Eastern languages)
  • 3 bytes if they are between U+0800 and U+FFFF (the rest of the Basic Multilingual Plane)
  • 4 bytes if they are between U+10000 and U+10FFFF (all the supplemental planes)

Note that the characters that take 4 bytes in UTF-8 tend to be problematic in UTF-16 too, because they take two characters (a surrogate pair) to represent, and are a common source of problems for programmers who didn't sufficiently test their code. There are plenty of characters from the supplemental plane in common use (for example, here are 62 Chinese ideograms from the supplemental plane that are considered to be among the "core" ideograms in modern Chinese, and most emojis are in the supplemental plane as well).

Since Javascript uses UTF-16 internally, you have to be prepared to encounter various Unicode problems when dealing with the supplemental plane. It sounds like the TextDocumentContentChangeEvent is passing a UTF-16 string, and the C++ extension is simply assuming, incorrectly, that the surrogate pairs are two separate characters.

@jrieken
Copy link
Member

jrieken commented Feb 8, 2017

Thanks @rmunn for those great details. I agree that this is how JS works, esp. this hurts:

JavaScript treats code units as individual characters, while humans generally think in terms of Unicode characters. This has some unfortunate consequences for Unicode characters outside the BMP. Since surrogate pairs consist of two code units, '𝌆'.length == 2, even though there’s only one Unicode character there. The individual surrogate halves are being exposed as if they were characters: '𝌆' == '\uD834\uDF06'.

@sean-mcmanus
Copy link
Contributor Author

Okay, we have a function to get the UTF16 character length of a UTF8 code point, so using that has fixed the issue for us. So we don't really need a fix from you guys, but if you do decide to change the character/column/ranges values from being UTF16 counts to Unicode character counts then we'd need change our code back to the way it was before.

@jrieken
Copy link
Member

jrieken commented Feb 8, 2017

Ok. We will not change/break the current implementation, but there is a request to 'expose the bytes'. That might be useful for you once it happens: #5735

@jrieken jrieken closed this as completed Feb 8, 2017
@jrieken jrieken added the *question Issue represents a question, should be posted to StackOverflow (VS Code) label Feb 8, 2017
@vscodebot vscodebot bot locked and limited conversation to collaborators Nov 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
*question Issue represents a question, should be posted to StackOverflow (VS Code)
Projects
None yet
Development

No branches or pull requests

4 participants