Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

napi: Length result vary in napi_get_value_string_utf8 #225

Closed
sampsongao opened this issue Apr 10, 2017 · 25 comments
Closed

napi: Length result vary in napi_get_value_string_utf8 #225

sampsongao opened this issue Apr 10, 2017 · 25 comments

Comments

@sampsongao
Copy link
Collaborator

sampsongao commented Apr 10, 2017

  if (!buf) {
    CHECK_ARG(env, result);
    *result = val.As<v8::String>()->Utf8Length();
    // ex. return 11 for 'hello world'
  } else {
    int copied = val.As<v8::String>()->WriteUtf8(
      buf, bufsize, nullptr, v8::String::REPLACE_INVALID_UTF8);

    if (result != nullptr) {
      *result = copied;
      // ex. return 12 for 'hello world' because of the null character in the end
    }
  }

Maybe we should pass in the option v8::String::NO_NULL_TERMINATION or substract one from the copied? @mhdawson @aruneshchandra @jasongin

@sampsongao sampsongao changed the title Length result vary in napi_get_value_string_utf8 napi: Length result vary in napi_get_value_string_utf8 Apr 10, 2017
@sampsongao
Copy link
Collaborator Author

sampsongao commented Apr 10, 2017

In the comment, we have "If bufsize is insufficient, the string will be truncated, including a null terminator." I think v8 does not null terminate when buffer size is insufficient unless we pass in option PRESERVE_ONE_BYTE_NULL.

@jasongin
Copy link
Member

I don't understand. napi_get_value_string_utf8() is documented to return the number of bytes copied, which includes the null terminating byte. I'd consider it unsafe to skip copying the null terminator.

We have some tests that check the result value, but probably not coverage of all the cases. We should add a test case for null terminator when bufsize is insufficient; I don't think there is a test for that now.

@sampsongao
Copy link
Collaborator Author

For WriteOneByte and WriteTwoByte, v8 is not including null character in the copied count. So I think we should keep our api consistent.
I think we can do

  if (!buf) {
    CHECK_ARG(env, result);
    *result = val.As<v8::String>()->Utf8Length();
  } else {
    int copied = val.As<v8::String>()->WriteUtf8(
      buf, bufsize, nullptr, v8::String::REPLACE_INVALID_UTF8 |
      v8::String::PRESERVE_ONE_BYTE_NULL);

    if (result != nullptr) {
      *result = copied - 1;
    }
  }

This should behave the same as other two format

@mhdawson
Copy link
Member

I think it should be consistent with:

  1. If a null character is expected and written it should be included in the count
  2. null termination does not make sense for the encoding then it should be written or included in the count

@jasongin
Copy link
Member

null termination does not make sense for the encoding

Null termination makes sense for any encoding, doesn't it?

@sampsongao
Copy link
Collaborator Author

In the following code, result of length is different:

status = napi_get_value_string_utf8(env, args[0], NULL, 0, &length); // ex. length = Utf8Length = 11
status = napi_get_value_string_utf8(env, args[0], buf, bufsize, &length); // ex. length = copied = 12

But for the latin1 and utf16, this problem doesn't exist. So I propose to exclude null character in the copied count.

@mhdawson
Copy link
Member

Reading here https://en.wikipedia.org/wiki/Null-terminated_string would suggest otherwise. If the encoding includes 0 (null) as one of its acceptable values, then you can't use it as a terminator

@jasongin
Copy link
Member

@mhdawson Based on my reading of that, it doesn't mean that you can't terminate a string in those encodings with null, just that you cannot measure the length of the string by searching for the first null.

@mhdawson
Copy link
Member

@jasongin, yes you can always have whatever you want following the string if you can get the length otherwise

@addaleax
Copy link
Member

I would agree with @sampsongao, the 0 terminator should either always or never be counted (and I’m leaning towards not counting it, too).

@mhdawson
Copy link
Member

@addaleax the only disadvantage would be that for encodings were null termination can be used to measure the the length, the developer would have to manage that themselves. I don't think it should be counted for any encoding that cannot use it to know when the string terminates. So if we choose always or never I would want it to be never as well.

@addaleax
Copy link
Member

I don't think it should be counted for any encoding that cannot use it to know when the string terminates.

Maybe I’m missing on something, but… UTF-8, Latin-1 and UTF-16 all can reasonably use null termination (just that in the case of UTF-16 it’s a uint16 0 value), so it’s the same situation for all supported encodings here?

@mhdawson
Copy link
Member

I think that assumes that its Modified UTF-8,

@addaleax
Copy link
Member

It you look at it that way, yes, but my point is rather that it’s the same situation for all encodings, so I wasn’t sure why you were talking about differences between encodings? (Maybe I’m completely misunderstanding you and should just stop commenting. 😄)

@mhdawson
Copy link
Member

I think you are on point if v8 uses modified UTF-8 versus UTF-8, just have not been able to find something that confirms that yet. If that is the case then I agree that we could chose to go either way. I do think that we want to be consistent in either including the null termination when it makes sense or not including it.

@addaleax
Copy link
Member

I think you are on point if v8 uses modified UTF-8 versus UTF-8

Just checked, V8 uses plain UTF-8 and always encodes U+0000 as a plain 0 value.

@mhdawson
Copy link
Member

So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16.

After the whole discussion around termination, it may just be safer to omit the termination in all cases and be consistent for all encodings.

@sampsongao
Copy link
Collaborator Author

sampsongao commented Apr 10, 2017

Since we are removing napi_get_string_length, so the length argument in the napi_get_value_string_XXX is replacing that function. I think at least this result should be consistent in the case of full copy regardless if we null terminate or not. The problem is that in napi_get_value_string_utf8, it is not consistent in the if clause I included above.

@jasongin
Copy link
Member

So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16.

I don't follow that reasoning. Even if U+0000 is a valid character value, it is still safer to ensure UTF-8 strings end with 0 anyway because careless use of many C++ APIs is likely to assume the strings are null-terminated. If the proper string length is known separately, having the null-terminator doesn't cause any harm.

@addaleax
Copy link
Member

So I think that means that null termination does not makes sense for utf-8, but does for Latin-1 and UTF-16.

But V8 also encodes U+0000 as 0x00 in Latin-1 and as 0x0000 in UTF-16? I still don’t get the distinction, sorry…

@mhdawson
Copy link
Member

I was trying to say (obviously not very clearly since I don't remember the specifics of the encodings) that if an encoding can have a 0x00 in the middle of the string then unless you use the length you can't tell if the one at the end is part of the string or an extra terminator. The potential downside to including the terminator in this case is that people can write code that appears to work ok until you get a 0x00 in the middle of the string.

If an encoding cannot have the 0x00 in the middle of the string, then having the 0x00 terminator at the end is useful. I still have to google a bit to refresh myself on whether 0x00 in the middle is possible with any of these encodings. If you know for sure that it is not then I'd go for including the terminator, otherwise I'd lean towards leaving it out.

@jasongin
Copy link
Member

The potential downside to including the terminator in this case is that people can write code that appears to work ok until you get a 0x00 in the middle of the string.

Sure, that could result in truncated strings if people are not careful about using the length properly, but that's not so bad. In the rare case when a dev even cares about supporting embedded nulls, they can take care to get the proper length.

I think the alternative is far worse: if strings are not always null-terminated, then people can write code that seems to work sometimes (when memory is 0-initialized), but then other times accesses memory outside the bounds of the allocated region!

@sampsongao
Copy link
Collaborator Author

I am not saying having null termination is a bad thing, but we should consider excluding null character in the copied characters count for WriteUtf8 case. This is because WriteOneByte and WriteTwoByte exclude null character in the copied characters count and this causes an inconsistency in our string api. The question that I was originally asking here is whether we should always include null character in the count or should always exclude null character in the count.

@kkoopa
Copy link

kkoopa commented Apr 12, 2017

IIRC, NAN always null terminates, for the reasons @jasongin mentioned. If you give out a char * or similar, the buffer should be null-terminated. If you do not store a null-terminated buffer in memory you now need to make a copy, null-terminate that and return the pointer to it. To avoid this hassle, always null-terminate.

Reported length should exclude the null terminator, as it is not part of the string, cf. strlen which does not include the null terminator. I made both size and length methods in various places. length gives the length of the string, size gives the number of bytes in the buffer, including the null terminator.

@jasongin
Copy link
Member

Closing since this was fixed by nodejs/node#12368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants