Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix character offset and byte offset problem in No-English language and issuse #6 #8

Merged
merged 2 commits into from
Apr 25, 2016

Conversation

owent
Copy link
Contributor

@owent owent commented Apr 25, 2016

detail: microsoft/vscode#5735

We should replace this code if microsoft/vscode#5735 is accepted.

@owent owent changed the title Fix character offset and byte offset problem in No-English language. Fix character offset and byte offset problem in No-English language and issuse #6 Apr 25, 2016
@owent
Copy link
Contributor Author

owent commented Apr 25, 2016

@xaverh please notice that "buffer": "^4.6.0" is in vscode 1.0.0, I'm noe sure which version is it in vscode 0.10.10. If the dependency of vscode version shoud be changed to 1.0.0?

@ioachim This PR solve #6 and #7.

@xaverh xaverh merged commit 83d8fca into xaverh:master Apr 25, 2016
@ioachim
Copy link
Contributor

ioachim commented Apr 25, 2016

@owt5008137 Thanks for correcting the issue! Is there any issue why you didn't use the Node Buffer class, but the buffer module? That way we could decode the string just once (as you can send the buffer directly to ChildProcess.stdin, instead of the string, avoiding encoding the string twice.

@xaverh
Copy link
Owner

xaverh commented Apr 25, 2016

@owt5008137 Thanks for your help. Very appreciated! I pushed a new version to the Marketplace.

@owent
Copy link
Contributor Author

owent commented Apr 25, 2016

@ioachim buffer is just the Node Buffer class.
Sorry I didn't read the document https://nodejs.org/dist/latest-v5.x/docs/api/buffer.html carefully. It seems not need to use

import buffer = require('buffer')

I'm not sure if I can use all the node.js inner modules, so I add it to package.json. Just remove it if it's already included in VSCode.Decoding the string twice means VSCode decode the string once and this plugin decode it again.

If the text in VSCode editor is not a utf8 string, the getText api still get a utf8 content, so it must be decoded and encoded to utf8 once.

I have read some code in clang-format, it seems only support utf-8.And it only accepts byte offset and the output is also byte offset and byte length.But VSCode's API only accept character offset and length.

So we still need to know what's the character offset of the byte offset(a utf8 character has 1-6 bytes), and then I decode it again. What I can only do is remember last relationship of byte offset and character offset to avoid decode the same content more than once.

Of course it's better if we can use VSCode's decode result, but I did not find and API to get these.Do you have any idea to do it?

@ioachim
Copy link
Contributor

ioachim commented Apr 25, 2016

The decoding is needed in any case, when writing to stdin it is done internally.

The codeContent parameter should be the Buffer instead of a string, and the value given to it should be the same written to the clang-format stdin, to avoid decoding the whole source twice.

What I started writing was code to get the char-offset from the byte-offset by reading the buffer directly, instead of encoding back to string, by parsing each byte of the slice. That way we avoid allocating a string each time we want to measure the length in characters.

@owent
Copy link
Contributor Author

owent commented Apr 25, 2016

@ioachim

var codeBuffer = new Buffer(codeContent);

This code also run only once every time we format the content.
I'm not sure, but all the buffer objects created by slice(start, end) should be reference to the same buffer block.Because them are all readonly.The only problem is I created so many strings.

Your solution seems a better way by converting byte-offset to char-offset one by one.But I wonder how many memory it will cost to make index of every byte-offset.Or just index every diffrent length? Is it will cost O(log(n)) to get the char-offset from the byte-offset?

@ioachim
Copy link
Contributor

ioachim commented Apr 25, 2016

About the first part: yes, slice() makes a reference to the original buffer, not a copy. Passing the buffer is a just a minor improvement.

About the string allocation thing, my proposal wasn't to save each position, as you say it doesn't make much sense cost wise, but replacing the buffer.toString('utf8').length calls with calls to a function that gets the length by checking the bytes in that slice.

@owent
Copy link
Contributor Author

owent commented Apr 25, 2016

@ioachim Oh, I see.It's really a good idea.
It's really no need to allocate so many string objects or make any index.Especially clang-format always has a monotonic increasing output.

@ioachim
Copy link
Contributor

ioachim commented Apr 25, 2016

@owt5008137 I'll write that changes sometime this week 👍

@owent
Copy link
Contributor Author

owent commented Apr 25, 2016

@ioachim OK,I will pull your code then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants