Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add header file generator for Unicode normalization and alphanumeric check #2425

Merged
merged 1 commit into from
Jul 29, 2023

Conversation

tamaroning
Copy link
Contributor

@tamaroning tamaroning commented Jul 14, 2023

Addresses #2379

This PR adds a header file generator written in python, which creates rust-unicode-data.h.
Also this PR adds initial implementation of Unicode normalization and is_numeric and is_alphabetic functions.

gcc/rust/ChangeLog:

	* Make-lang.in: Add rust-unicode.o
	* rust-lang.cc (run_rust_tests): Add test.
	* util/make-rust-unicode.py: Generater of rust-unicode-data.h.
	* util/rust-unicode-data.h: Auto-generated file.
	* util/rust-unicode.cc: New file.
	* util/rust-unicode.h: New file.

Unicode normalization is defined in https://unicode.org/reports/tr15/ (UAX15)
UAX15's implementation notes: https://unicode.org/reports/tr15/#Implementation_Notes
I looked at https://www.w3.org/International/charlint/ as reference implementation

@tamaroning tamaroning force-pushed the uc-normalize branch 3 times, most recently from d36e4eb to 714f543 Compare July 14, 2023 08:32
@tamaroning tamaroning marked this pull request as ready for review July 14, 2023 08:38
@tamaroning
Copy link
Contributor Author

tamaroning commented Jul 15, 2023

I will add #include <array> to rust-sytem.h because of the error I got.
gcc seems implicitly include it but clang does not.
fixed

2023-07-14T10:21:14.7061060Z In file included from ../../gcc/rust/util/rust-unicode.cc:5:
2023-07-14T10:21:14.7062960Z ../../gcc/rust/util/rust-unicode-data.h:5082:53: error: implicit instantiation of undefined template 'std::array<std::pair<unsigned int, unsigned int>, 74>'
2023-07-14T10:21:14.7081740Z const std::array<std::pair<uint32_t, uint32_t>, 74> ALPHABETIC_RANGES = {{
2023-07-14T10:21:14.7107630Z                                                     ^

@tamaroning tamaroning force-pushed the uc-normalize branch 2 times, most recently from 7a3bc21 to 09403f1 Compare July 21, 2023 00:00
@tamaroning
Copy link
Contributor Author

I also added is_numeric and is_alphabetic functions in this PR.

@tamaroning tamaroning mentioned this pull request Jul 6, 2023
15 tasks
@tamaroning tamaroning changed the title Unicode NFC normalization Add header file generator for Unicode normalization and alphanumeric check Jul 21, 2023
@tamaroning tamaroning force-pushed the uc-normalize branch 5 times, most recently from dc1348f to 0eba85b Compare July 21, 2023 03:01
gcc/rust/ChangeLog:

	* Make-lang.in: Add rust-unicode.o
	* rust-lang.cc (run_rust_tests): Add test.
	* rust-system.h: Include <array>
	* util/make-rust-unicode.py: Generater of rust-unicode-data.h.
	* util/rust-unicode-data.h: Auto-generated file.
	* util/rust-unicode.cc: New file.
	* util/rust-unicode.h: New file.

Signed-off-by: Raiki Tamura <[email protected]>
Comment on lines +206 to +208
string_t
nfc_normalize (string_t s)
{
Copy link
Contributor Author

@tamaroning tamaroning Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function performs Unicode normalization of the given string and is going to be ported via rust-unicode.h.
Currently string_t aliases std::vector<uint32_t>, but it will be replaced with the class Utf8String, introduced in PR #2463

Copy link
Member

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you @tamaroning :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very good first implementation of the script :) For future maintainability, I think it would be helpful to add types to the script so we can run it with mypy. But this is already very good, don't change it in this PR


template <std::size_t SIZE>
int64_t
binary_search_ranges (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure because elements of this array represents a range which is unusual
but added comments in #2463


template <std::size_t SIZE>
int64_t
binary_search_sorted_array (const std::array<std::uint32_t, SIZE> &array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. I've never used std::binary_search so it might be completely wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! added comments in #2463

Comment on lines +81 to +82
// Starter. Returns zero.
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that an error or is that okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is OK.
Each codepoints has a CCC value. CCC of almost all characters is 0 (, meaning Starter property).
To minimize table size, our table manages only entries whose CCC is not 0.

Comment on lines +106 to +109
for (codepoint_t cp : decomped)
{
recursive_decomp_cano (cp, buf);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (codepoint_t cp : decomped)
{
recursive_decomp_cano (cp, buf);
}
for (codepoint_t cp : decomped)
recursive_decomp_cano (cp, buf);

GNU style nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #2463

Comment on lines +155 to +156
// int starter_pos = 0; // Assume the first character is Starter. Correct?
// int target_pos = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #2463

Comment on lines +192 to +204
// TODO: remove
/*
void
dump_string (std::vector<uint32_t> s)
{
std::cout << "dump=";
for (auto c : s)
{
std::cout << std::hex << c << ", ";
}
std::cout << std::endl;
}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dead code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #2463

Copy link
Member

@philberty philberty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@philberty philberty added this pull request to the merge queue Jul 29, 2023
Merged via the queue into Rust-GCC:master with commit 7ce263e Jul 29, 2023
@tamaroning tamaroning deleted the uc-normalize branch July 30, 2023 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants