-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add header file generator for Unicode normalization and alphanumeric check #2425
Conversation
d36e4eb
to
714f543
Compare
|
7a3bc21
to
09403f1
Compare
I also added is_numeric and is_alphabetic functions in this PR. |
dc1348f
to
0eba85b
Compare
gcc/rust/ChangeLog: * Make-lang.in: Add rust-unicode.o * rust-lang.cc (run_rust_tests): Add test. * rust-system.h: Include <array> * util/make-rust-unicode.py: Generater of rust-unicode-data.h. * util/rust-unicode-data.h: Auto-generated file. * util/rust-unicode.cc: New file. * util/rust-unicode.h: New file. Signed-off-by: Raiki Tamura <[email protected]>
string_t | ||
nfc_normalize (string_t s) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function performs Unicode normalization of the given string and is going to be ported via rust-unicode.h
.
Currently string_t
aliases std::vector<uint32_t>
, but it will be replaced with the class Utf8String
, introduced in PR #2463
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you @tamaroning :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a very good first implementation of the script :) For future maintainability, I think it would be helpful to add types to the script so we can run it with mypy
. But this is already very good, don't change it in this PR
|
||
template <std::size_t SIZE> | ||
int64_t | ||
binary_search_ranges ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe use something like https://en.cppreference.com/w/cpp/algorithm/binary_search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure because elements of this array represents a range which is unusual
but added comments in #2463
|
||
template <std::size_t SIZE> | ||
int64_t | ||
binary_search_sorted_array (const std::array<std::uint32_t, SIZE> &array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I've never used std::binary_search
so it might be completely wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! added comments in #2463
// Starter. Returns zero. | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that an error or is that okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is OK.
Each codepoints has a CCC value. CCC of almost all characters is 0 (, meaning Starter property).
To minimize table size, our table manages only entries whose CCC is not 0.
for (codepoint_t cp : decomped) | ||
{ | ||
recursive_decomp_cano (cp, buf); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (codepoint_t cp : decomped) | |
{ | |
recursive_decomp_cano (cp, buf); | |
} | |
for (codepoint_t cp : decomped) | |
recursive_decomp_cano (cp, buf); |
GNU style nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #2463
// int starter_pos = 0; // Assume the first character is Starter. Correct? | ||
// int target_pos = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #2463
// TODO: remove | ||
/* | ||
void | ||
dump_string (std::vector<uint32_t> s) | ||
{ | ||
std::cout << "dump="; | ||
for (auto c : s) | ||
{ | ||
std::cout << std::hex << c << ", "; | ||
} | ||
std::cout << std::endl; | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #2463
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Addresses #2379
This PR adds a header file generator written in python, which creates rust-unicode-data.h.
Also this PR adds initial implementation of Unicode normalization and is_numeric and is_alphabetic functions.
Unicode normalization is defined in https://unicode.org/reports/tr15/ (UAX15)
UAX15's implementation notes: https://unicode.org/reports/tr15/#Implementation_Notes
I looked at https://www.w3.org/International/charlint/ as reference implementation