napi: Length result vary in napi_get_value_string_utf8 #225

sampsongao · 2017-04-10T18:25:33Z

  if (!buf) {
    CHECK_ARG(env, result);
    *result = val.As<v8::String>()->Utf8Length();
    // ex. return 11 for 'hello world'
  } else {
    int copied = val.As<v8::String>()->WriteUtf8(
      buf, bufsize, nullptr, v8::String::REPLACE_INVALID_UTF8);

    if (result != nullptr) {
      *result = copied;
      // ex. return 12 for 'hello world' because of the null character in the end
    }
  }

Maybe we should pass in the option v8::String::NO_NULL_TERMINATION or substract one from the copied? @mhdawson @aruneshchandra @jasongin

The text was updated successfully, but these errors were encountered:

sampsongao · 2017-04-10T18:55:55Z

In the comment, we have "If bufsize is insufficient, the string will be truncated, including a null terminator." I think v8 does not null terminate when buffer size is insufficient unless we pass in option PRESERVE_ONE_BYTE_NULL.

jasongin · 2017-04-10T19:00:12Z

I don't understand. napi_get_value_string_utf8() is documented to return the number of bytes copied, which includes the null terminating byte. I'd consider it unsafe to skip copying the null terminator.

We have some tests that check the result value, but probably not coverage of all the cases. We should add a test case for null terminator when bufsize is insufficient; I don't think there is a test for that now.

sampsongao · 2017-04-10T19:00:37Z

For WriteOneByte and WriteTwoByte, v8 is not including null character in the copied count. So I think we should keep our api consistent.
I think we can do

  if (!buf) {
    CHECK_ARG(env, result);
    *result = val.As<v8::String>()->Utf8Length();
  } else {
    int copied = val.As<v8::String>()->WriteUtf8(
      buf, bufsize, nullptr, v8::String::REPLACE_INVALID_UTF8 |
      v8::String::PRESERVE_ONE_BYTE_NULL);

    if (result != nullptr) {
      *result = copied - 1;
    }
  }

This should behave the same as other two format

mhdawson · 2017-04-10T19:55:36Z

I think it should be consistent with:

If a null character is expected and written it should be included in the count
null termination does not make sense for the encoding then it should be written or included in the count

jasongin · 2017-04-10T20:02:56Z

null termination does not make sense for the encoding

Null termination makes sense for any encoding, doesn't it?

sampsongao · 2017-04-10T20:03:35Z

In the following code, result of length is different:

status = napi_get_value_string_utf8(env, args[0], NULL, 0, &length); // ex. length = Utf8Length = 11
status = napi_get_value_string_utf8(env, args[0], buf, bufsize, &length); // ex. length = copied = 12

But for the latin1 and utf16, this problem doesn't exist. So I propose to exclude null character in the copied count.

mhdawson · 2017-04-10T20:03:40Z

Reading here https://en.wikipedia.org/wiki/Null-terminated_string would suggest otherwise. If the encoding includes 0 (null) as one of its acceptable values, then you can't use it as a terminator

jasongin · 2017-04-10T20:06:25Z

@mhdawson Based on my reading of that, it doesn't mean that you can't terminate a string in those encodings with null, just that you cannot measure the length of the string by searching for the first null.

mhdawson · 2017-04-10T20:07:16Z

@jasongin, yes you can always have whatever you want following the string if you can get the length otherwise

addaleax · 2017-04-10T20:07:33Z

I would agree with @sampsongao, the 0 terminator should either always or never be counted (and I’m leaning towards not counting it, too).

mhdawson · 2017-04-10T20:12:54Z

@addaleax the only disadvantage would be that for encodings were null termination can be used to measure the the length, the developer would have to manage that themselves. I don't think it should be counted for any encoding that cannot use it to know when the string terminates. So if we choose always or never I would want it to be never as well.

addaleax · 2017-04-10T20:16:32Z

I don't think it should be counted for any encoding that cannot use it to know when the string terminates.

Maybe I’m missing on something, but… UTF-8, Latin-1 and UTF-16 all can reasonably use null termination (just that in the case of UTF-16 it’s a uint16 0 value), so it’s the same situation for all supported encodings here?

mhdawson · 2017-04-10T20:22:13Z

I think that assumes that its Modified UTF-8,

addaleax · 2017-04-10T20:24:50Z

It you look at it that way, yes, but my point is rather that it’s the same situation for all encodings, so I wasn’t sure why you were talking about differences between encodings? (Maybe I’m completely misunderstanding you and should just stop commenting. 😄)

mhdawson · 2017-04-10T20:27:53Z

I think you are on point if v8 uses modified UTF-8 versus UTF-8, just have not been able to find something that confirms that yet. If that is the case then I agree that we could chose to go either way. I do think that we want to be consistent in either including the null termination when it makes sense or not including it.

addaleax · 2017-04-10T20:30:09Z

I think you are on point if v8 uses modified UTF-8 versus UTF-8

Just checked, V8 uses plain UTF-8 and always encodes U+0000 as a plain 0 value.

mhdawson · 2017-04-10T20:37:50Z

So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16.

After the whole discussion around termination, it may just be safer to omit the termination in all cases and be consistent for all encodings.

sampsongao · 2017-04-10T20:52:01Z

Since we are removing napi_get_string_length, so the length argument in the napi_get_value_string_XXX is replacing that function. I think at least this result should be consistent in the case of full copy regardless if we null terminate or not. The problem is that in napi_get_value_string_utf8, it is not consistent in the if clause I included above.

jasongin · 2017-04-10T21:31:14Z

So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16.

I don't follow that reasoning. Even if U+0000 is a valid character value, it is still safer to ensure UTF-8 strings end with 0 anyway because careless use of many C++ APIs is likely to assume the strings are null-terminated. If the proper string length is known separately, having the null-terminator doesn't cause any harm.

addaleax · 2017-04-10T21:33:27Z

So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16.

But V8 also encodes U+0000 as 0x00 in Latin-1 and as 0x0000 in UTF-16? I still don’t get the distinction, sorry…

mhdawson · 2017-04-11T01:18:43Z

I was trying to say (obviously not very clearly since I don't remember the specifics of the encodings) that if an encoding can have a 0x00 in the middle of the string then unless you use the length you can't tell if the one at the end is part of the string or an extra terminator. The potential downside to including the terminator in this case is that people can write code that appears to work ok until you get a 0x00 in the middle of the string.

If an encoding cannot have the 0x00 in the middle of the string, then having the 0x00 terminator at the end is useful. I still have to google a bit to refresh myself on whether 0x00 in the middle is possible with any of these encodings. If you know for sure that it is not then I'd go for including the terminator, otherwise I'd lean towards leaving it out.

jasongin · 2017-04-12T01:23:59Z

The potential downside to including the terminator in this case is that people can write code that appears to work ok until you get a 0x00 in the middle of the string.

Sure, that could result in truncated strings if people are not careful about using the length properly, but that's not so bad. In the rare case when a dev even cares about supporting embedded nulls, they can take care to get the proper length.

I think the alternative is far worse: if strings are not always null-terminated, then people can write code that seems to work sometimes (when memory is 0-initialized), but then other times accesses memory outside the bounds of the allocated region!

sampsongao · 2017-04-12T02:56:48Z

I am not saying having null termination is a bad thing, but we should consider excluding null character in the copied characters count for WriteUtf8 case. This is because WriteOneByte and WriteTwoByte exclude null character in the copied characters count and this causes an inconsistency in our string api. The question that I was originally asking here is whether we should always include null character in the count or should always exclude null character in the count.

kkoopa · 2017-04-12T07:30:59Z

IIRC, NAN always null terminates, for the reasons @jasongin mentioned. If you give out a char * or similar, the buffer should be null-terminated. If you do not store a null-terminated buffer in memory you now need to make a copy, null-terminate that and return the pointer to it. To avoid this hassle, always null-terminate.

Reported length should exclude the null terminator, as it is not part of the string, cf. strlen which does not include the null terminator. I made both size and length methods in various places. length gives the length of the string, size gives the number of bytes in the buffer, including the null terminator.

jasongin · 2017-04-18T22:06:59Z

Closing since this was fixed by nodejs/node#12368

sampsongao changed the title ~~Length result vary in napi_get_value_string_utf8~~ napi: Length result vary in napi_get_value_string_utf8 Apr 10, 2017

jasongin closed this as completed Apr 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

napi: Length result vary in napi_get_value_string_utf8 #225

napi: Length result vary in napi_get_value_string_utf8 #225

sampsongao commented Apr 10, 2017 •

edited

Loading

sampsongao commented Apr 10, 2017 •

edited

Loading

jasongin commented Apr 10, 2017

sampsongao commented Apr 10, 2017

mhdawson commented Apr 10, 2017

jasongin commented Apr 10, 2017

sampsongao commented Apr 10, 2017

mhdawson commented Apr 10, 2017

jasongin commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

sampsongao commented Apr 10, 2017 •

edited

Loading

jasongin commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 11, 2017

jasongin commented Apr 12, 2017

sampsongao commented Apr 12, 2017

kkoopa commented Apr 12, 2017 •

edited

Loading

jasongin commented Apr 18, 2017

napi: Length result vary in napi_get_value_string_utf8 #225

napi: Length result vary in napi_get_value_string_utf8 #225

Comments

sampsongao commented Apr 10, 2017 • edited Loading

sampsongao commented Apr 10, 2017 • edited Loading

jasongin commented Apr 10, 2017

sampsongao commented Apr 10, 2017

mhdawson commented Apr 10, 2017

jasongin commented Apr 10, 2017

sampsongao commented Apr 10, 2017

mhdawson commented Apr 10, 2017

jasongin commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 10, 2017

sampsongao commented Apr 10, 2017 • edited Loading

jasongin commented Apr 10, 2017

addaleax commented Apr 10, 2017

mhdawson commented Apr 11, 2017

jasongin commented Apr 12, 2017

sampsongao commented Apr 12, 2017

kkoopa commented Apr 12, 2017 • edited Loading

jasongin commented Apr 18, 2017

sampsongao commented Apr 10, 2017 •

edited

Loading

sampsongao commented Apr 10, 2017 •

edited

Loading

sampsongao commented Apr 10, 2017 •

edited

Loading

kkoopa commented Apr 12, 2017 •

edited

Loading