-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
napi: Length result vary in napi_get_value_string_utf8 #225
Comments
In the comment, we have "If bufsize is insufficient, the string will be truncated, including a null terminator." I think v8 does not null terminate when buffer size is insufficient unless we pass in option PRESERVE_ONE_BYTE_NULL. |
I don't understand. We have some tests that check the result value, but probably not coverage of all the cases. We should add a test case for null terminator when bufsize is insufficient; I don't think there is a test for that now. |
For WriteOneByte and WriteTwoByte, v8 is not including null character in the copied count. So I think we should keep our api consistent.
This should behave the same as other two format |
I think it should be consistent with:
|
Null termination makes sense for any encoding, doesn't it? |
In the following code, result of length is different:
But for the latin1 and utf16, this problem doesn't exist. So I propose to exclude null character in the copied count. |
Reading here https://en.wikipedia.org/wiki/Null-terminated_string would suggest otherwise. If the encoding includes 0 (null) as one of its acceptable values, then you can't use it as a terminator |
@mhdawson Based on my reading of that, it doesn't mean that you can't terminate a string in those encodings with null, just that you cannot measure the length of the string by searching for the first null. |
@jasongin, yes you can always have whatever you want following the string if you can get the length otherwise |
I would agree with @sampsongao, the 0 terminator should either always or never be counted (and I’m leaning towards not counting it, too). |
@addaleax the only disadvantage would be that for encodings were null termination can be used to measure the the length, the developer would have to manage that themselves. I don't think it should be counted for any encoding that cannot use it to know when the string terminates. So if we choose always or never I would want it to be never as well. |
Maybe I’m missing on something, but… UTF-8, Latin-1 and UTF-16 all can reasonably use null termination (just that in the case of UTF-16 it’s a uint16 0 value), so it’s the same situation for all supported encodings here? |
I think that assumes that its Modified UTF-8, |
It you look at it that way, yes, but my point is rather that it’s the same situation for all encodings, so I wasn’t sure why you were talking about differences between encodings? (Maybe I’m completely misunderstanding you and should just stop commenting. 😄) |
I think you are on point if v8 uses modified UTF-8 versus UTF-8, just have not been able to find something that confirms that yet. If that is the case then I agree that we could chose to go either way. I do think that we want to be consistent in either including the null termination when it makes sense or not including it. |
Just checked, V8 uses plain UTF-8 and always encodes U+0000 as a plain 0 value. |
So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16. After the whole discussion around termination, it may just be safer to omit the termination in all cases and be consistent for all encodings. |
Since we are removing napi_get_string_length, so the length argument in the napi_get_value_string_XXX is replacing that function. I think at least this result should be consistent in the case of full copy regardless if we null terminate or not. The problem is that in napi_get_value_string_utf8, it is not consistent in the if clause I included above. |
I don't follow that reasoning. Even if U+0000 is a valid character value, it is still safer to ensure UTF-8 strings end with 0 anyway because careless use of many C++ APIs is likely to assume the strings are null-terminated. If the proper string length is known separately, having the null-terminator doesn't cause any harm. |
But V8 also encodes U+0000 as 0x00 in Latin-1 and as 0x0000 in UTF-16? I still don’t get the distinction, sorry… |
I was trying to say (obviously not very clearly since I don't remember the specifics of the encodings) that if an encoding can have a 0x00 in the middle of the string then unless you use the length you can't tell if the one at the end is part of the string or an extra terminator. The potential downside to including the terminator in this case is that people can write code that appears to work ok until you get a 0x00 in the middle of the string. If an encoding cannot have the 0x00 in the middle of the string, then having the 0x00 terminator at the end is useful. I still have to google a bit to refresh myself on whether 0x00 in the middle is possible with any of these encodings. If you know for sure that it is not then I'd go for including the terminator, otherwise I'd lean towards leaving it out. |
Sure, that could result in truncated strings if people are not careful about using the length properly, but that's not so bad. In the rare case when a dev even cares about supporting embedded nulls, they can take care to get the proper length. I think the alternative is far worse: if strings are not always null-terminated, then people can write code that seems to work sometimes (when memory is 0-initialized), but then other times accesses memory outside the bounds of the allocated region! |
I am not saying having null termination is a bad thing, but we should consider excluding null character in the copied characters count for WriteUtf8 case. This is because WriteOneByte and WriteTwoByte exclude null character in the copied characters count and this causes an inconsistency in our string api. The question that I was originally asking here is whether we should always include null character in the count or should always exclude null character in the count. |
IIRC, NAN always null terminates, for the reasons @jasongin mentioned. If you give out a Reported length should exclude the null terminator, as it is not part of the string, cf. |
Closing since this was fixed by nodejs/node#12368 |
Maybe we should pass in the option v8::String::NO_NULL_TERMINATION or substract one from the copied? @mhdawson @aruneshchandra @jasongin
The text was updated successfully, but these errors were encountered: