Add header file generator for Unicode normalization and alphanumeric check #2425

tamaroning · 2023-07-14T05:50:51Z

Addresses #2379

This PR adds a header file generator written in python, which creates rust-unicode-data.h.
Also this PR adds initial implementation of Unicode normalization and is_numeric and is_alphabetic functions.

gcc/rust/ChangeLog:

	* Make-lang.in: Add rust-unicode.o
	* rust-lang.cc (run_rust_tests): Add test.
	* util/make-rust-unicode.py: Generater of rust-unicode-data.h.
	* util/rust-unicode-data.h: Auto-generated file.
	* util/rust-unicode.cc: New file.
	* util/rust-unicode.h: New file.

Unicode normalization is defined in https://unicode.org/reports/tr15/ (UAX15)
UAX15's implementation notes: https://unicode.org/reports/tr15/#Implementation_Notes
I looked at https://www.w3.org/International/charlint/ as reference implementation

tamaroning · 2023-07-15T05:56:18Z

I will add #include <array> to rust-sytem.h because of the error I got.
gcc seems implicitly include it but clang does not. fixed

2023-07-14T10:21:14.7061060Z In file included from ../../gcc/rust/util/rust-unicode.cc:5:
2023-07-14T10:21:14.7062960Z ../../gcc/rust/util/rust-unicode-data.h:5082:53: error: implicit instantiation of undefined template 'std::array<std::pair<unsigned int, unsigned int>, 74>'
2023-07-14T10:21:14.7081740Z const std::array<std::pair<uint32_t, uint32_t>, 74> ALPHABETIC_RANGES = {{
2023-07-14T10:21:14.7107630Z                                                     ^

tamaroning · 2023-07-21T00:02:51Z

I also added is_numeric and is_alphabetic functions in this PR.

gcc/rust/ChangeLog: * Make-lang.in: Add rust-unicode.o * rust-lang.cc (run_rust_tests): Add test. * rust-system.h: Include <array> * util/make-rust-unicode.py: Generater of rust-unicode-data.h. * util/rust-unicode-data.h: Auto-generated file. * util/rust-unicode.cc: New file. * util/rust-unicode.h: New file. Signed-off-by: Raiki Tamura <[email protected]>

tamaroning · 2023-07-21T04:50:11Z

gcc/rust/util/rust-unicode.cc

+string_t
+nfc_normalize (string_t s)
+{


This function performs Unicode normalization of the given string and is going to be ported via rust-unicode.h.
Currently string_t aliases std::vector<uint32_t>, but it will be replaced with the class Utf8String, introduced in PR #2463

CohenArthur

Looks great! Thank you @tamaroning :D

CohenArthur · 2023-07-27T08:53:09Z

gcc/rust/util/make-rust-unicode.py

I think this is a very good first implementation of the script :) For future maintainability, I think it would be helpful to add types to the script so we can run it with mypy. But this is already very good, don't change it in this PR

CohenArthur · 2023-07-27T08:58:24Z

gcc/rust/util/rust-unicode.cc

+
+template <std::size_t SIZE>
+int64_t
+binary_search_ranges (


Can we maybe use something like https://en.cppreference.com/w/cpp/algorithm/binary_search?

I'm not sure because elements of this array represents a range which is unusual
but added comments in #2463

CohenArthur · 2023-07-27T08:59:01Z

gcc/rust/util/rust-unicode.cc

+
+template <std::size_t SIZE>
+int64_t
+binary_search_sorted_array (const std::array<std::uint32_t, SIZE> &array,


Same here. I've never used std::binary_search so it might be completely wrong

Yes! added comments in #2463

CohenArthur · 2023-07-27T08:59:27Z

gcc/rust/util/rust-unicode.cc

+    // Starter. Returns zero.
+    return 0;


is that an error or is that okay?

it is OK.
Each codepoints has a CCC value. CCC of almost all characters is 0 (, meaning Starter property).
To minimize table size, our table manages only entries whose CCC is not 0.

CohenArthur · 2023-07-27T09:00:05Z

gcc/rust/util/rust-unicode.cc

+      for (codepoint_t cp : decomped)
+	{
+	  recursive_decomp_cano (cp, buf);
+	}


Suggested change

for (codepoint_t cp : decomped)

{

recursive_decomp_cano (cp, buf);

}

for (codepoint_t cp : decomped)

recursive_decomp_cano (cp, buf);

GNU style nit

fixed in #2463

CohenArthur · 2023-07-27T09:00:29Z

gcc/rust/util/rust-unicode.cc

+      // int starter_pos = 0; // Assume the first character is Starter. Correct?
+      // int target_pos = 1;


fixed in #2463

CohenArthur · 2023-07-27T09:00:38Z

gcc/rust/util/rust-unicode.cc

+// TODO: remove
+/*
+void
+dump_string (std::vector<uint32_t> s)
+{
+  std::cout << "dump=";
+  for (auto c : s)
+    {
+      std::cout << std::hex << c << ", ";
+    }
+  std::cout << std::endl;
+}
+*/


fixed in #2463

philberty

LGTM

tamaroning force-pushed the uc-normalize branch 3 times, most recently from d36e4eb to 714f543 Compare July 14, 2023 08:32

tamaroning marked this pull request as ready for review July 14, 2023 08:38

tamaroning mentioned this pull request Jul 14, 2023

Unicode Normalization of Identifiers #2379

Open

10 tasks

tamaroning force-pushed the uc-normalize branch from 714f543 to 66936f7 Compare July 14, 2023 09:47

tamaroning force-pushed the uc-normalize branch 2 times, most recently from 7a3bc21 to 09403f1 Compare July 21, 2023 00:00

tamaroning mentioned this pull request Jul 6, 2023

Unicode support #2287

Open

15 tasks

tamaroning changed the title ~~Unicode NFC normalization~~ Add header file generator for Unicode normalization and alphanumeric check Jul 21, 2023

tamaroning force-pushed the uc-normalize branch 5 times, most recently from dc1348f to 0eba85b Compare July 21, 2023 03:01

tamaroning force-pushed the uc-normalize branch from 0eba85b to 86bfc84 Compare July 21, 2023 04:08

tamaroning mentioned this pull request Jul 21, 2023

Unicode check for crate_name attribute #2463

Merged

tamaroning commented Jul 21, 2023

View reviewed changes

tamaroning mentioned this pull request Jul 26, 2023

Normalize Hangul #2467

Merged

CohenArthur requested review from tschwinge and philberty July 27, 2023 08:08

CohenArthur added the enhancement label Jul 27, 2023

CohenArthur approved these changes Jul 27, 2023

View reviewed changes

philberty added this to the AST Pipeline for libcore 1.49 Complete milestone Jul 29, 2023

philberty approved these changes Jul 29, 2023

View reviewed changes

philberty added this pull request to the merge queue Jul 29, 2023

Merged via the queue into Rust-GCC:master with commit 7ce263e Jul 29, 2023

tamaroning deleted the uc-normalize branch July 30, 2023 08:46

tamaroning mentioned this pull request Aug 9, 2023

Add type annotation to make-rust-unicode-data.py #2529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add header file generator for Unicode normalization and alphanumeric check #2425

Add header file generator for Unicode normalization and alphanumeric check #2425

tamaroning commented Jul 14, 2023 •

edited

Loading

tamaroning commented Jul 15, 2023 •

edited

Loading

tamaroning commented Jul 21, 2023

tamaroning Jul 21, 2023 •

edited

Loading

CohenArthur left a comment

CohenArthur Jul 27, 2023

CohenArthur Jul 27, 2023

tamaroning Jul 30, 2023

CohenArthur Jul 27, 2023

tamaroning Jul 30, 2023

CohenArthur Jul 27, 2023

tamaroning Jul 30, 2023

CohenArthur Jul 27, 2023

tamaroning Jul 30, 2023

CohenArthur Jul 27, 2023

tamaroning Jul 30, 2023

CohenArthur Jul 27, 2023

tamaroning Jul 30, 2023

philberty left a comment

		// int starter_pos = 0; // Assume the first character is Starter. Correct?
		// int target_pos = 1;

Add header file generator for Unicode normalization and alphanumeric check #2425

Add header file generator for Unicode normalization and alphanumeric check #2425

Conversation

tamaroning commented Jul 14, 2023 • edited Loading

tamaroning commented Jul 15, 2023 • edited Loading

tamaroning commented Jul 21, 2023

tamaroning Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

CohenArthur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philberty left a comment

Choose a reason for hiding this comment

tamaroning commented Jul 14, 2023 •

edited

Loading

tamaroning commented Jul 15, 2023 •

edited

Loading

tamaroning Jul 21, 2023 •

edited

Loading