-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4-byte UTF8 code point reported as 2 chars in textDocument/didChange #20170
Comments
Codepoints encoded in UTF-8 will be:
Note that the characters that take 4 bytes in UTF-8 tend to be problematic in UTF-16 too, because they take two characters (a surrogate pair) to represent, and are a common source of problems for programmers who didn't sufficiently test their code. There are plenty of characters from the supplemental plane in common use (for example, here are 62 Chinese ideograms from the supplemental plane that are considered to be among the "core" ideograms in modern Chinese, and most emojis are in the supplemental plane as well). Since Javascript uses UTF-16 internally, you have to be prepared to encounter various Unicode problems when dealing with the supplemental plane. It sounds like the |
Thanks @rmunn for those great details. I agree that this is how JS works, esp. this hurts:
|
Okay, we have a function to get the UTF16 character length of a UTF8 code point, so using that has fixed the issue for us. So we don't really need a fix from you guys, but if you do decide to change the character/column/ranges values from being UTF16 counts to Unicode character counts then we'd need change our code back to the way it was before. |
Ok. We will not change/break the current implementation, but there is a request to 'expose the bytes'. That might be useful for you once it happens: #5735 |
Steps to Reproduce:
Bug: The result from VS Code is that TextDocumentContentChangeEvent rangeLength is 2, so our extension sees that the 1st character is correctly 4 bytes, but then it tries to process a non-existent 2nd character (crashing the process or corrupting data). Other APIs that pass the unexpected character position may also break functionality.
It seems like VS Code should make available the real character length of an edit so extensions can behavior correctly. VS Code reports the 4-byte code point as taking up 2 columns, which is consistent with the rangeLength of 2.
As a workaround, we will assume that all 4-byte code points take up 2 characters and all other code points take up 1 character. Does that seem like a good workaround? I don't know for sure which UTF8 code points count as 2 or more characters.
The text was updated successfully, but these errors were encountered: