-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix character offset and byte offset problem in No-English language and issuse #6 #8
Conversation
owent
commented
Apr 25, 2016
> detail: microsoft/vscode#5735 > > We should replace this code if microsoft/vscode#5735 is accepted.
@owt5008137 Thanks for correcting the issue! Is there any issue why you didn't use the Node Buffer class, but the |
@owt5008137 Thanks for your help. Very appreciated! I pushed a new version to the Marketplace. |
@ioachim buffer is just the Node Buffer class.
I'm not sure if I can use all the node.js inner modules, so I add it to package.json. Just remove it if it's already included in VSCode.Decoding the string twice means VSCode decode the string once and this plugin decode it again. If the text in VSCode editor is not a utf8 string, the getText api still get a utf8 content, so it must be decoded and encoded to utf8 once. I have read some code in clang-format, it seems only support utf-8.And it only accepts byte offset and the output is also byte offset and byte length.But VSCode's API only accept character offset and length. So we still need to know what's the character offset of the byte offset(a utf8 character has 1-6 bytes), and then I decode it again. What I can only do is remember last relationship of byte offset and character offset to avoid decode the same content more than once. Of course it's better if we can use VSCode's decode result, but I did not find and API to get these.Do you have any idea to do it? |
The decoding is needed in any case, when writing to stdin it is done internally. The codeContent parameter should be the Buffer instead of a string, and the value given to it should be the same written to the clang-format stdin, to avoid decoding the whole source twice. What I started writing was code to get the char-offset from the byte-offset by reading the buffer directly, instead of encoding back to string, by parsing each byte of the slice. That way we avoid allocating a string each time we want to measure the length in characters. |
This code also run only once every time we format the content. Your solution seems a better way by converting byte-offset to char-offset one by one.But I wonder how many memory it will cost to make index of every byte-offset.Or just index every diffrent length? Is it will cost O(log(n)) to get the char-offset from the byte-offset? |
About the first part: yes, slice() makes a reference to the original buffer, not a copy. Passing the buffer is a just a minor improvement. About the string allocation thing, my proposal wasn't to save each position, as you say it doesn't make much sense cost wise, but replacing the |
@ioachim Oh, I see.It's really a good idea. |
@owt5008137 I'll write that changes sometime this week 👍 |
@ioachim OK,I will pull your code then. |