Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does jsdiff work with Chinese (or other non-English script languages)? #377

Closed
lancejpollard opened this issue Jul 1, 2022 · 5 comments
Closed
Labels

Comments

@lancejpollard
Copy link

If I want to do text differencing to see which Chinese characters were added/removed, would jsdiff work with that? What about other languages with combining characters and such, like Devanagari or Hebrew? I am not super versed in how "text diff" algorithms work, but I imagine it might be very English-/Latin-centric. Is that the case? Or does it work for any other language? If not, what is the general approach to work with other languages like Chinese? Thank you for the help!

@ExplodingCabbage
Copy link
Collaborator

ExplodingCabbage commented Jan 10, 2024

Currently, JS-diff is English-centric.

diffChars splits texts into UTF-16 code units (since JavaScript strings are sort of arrays of UTF-16 code units; e.g. '𢈘'.length is 2) and diffs the sequences of code units. This works fine for any language where every character is represented by a single UTF-16 code unit (e.g. English) but badly for CJK characters where some characters, like 𢈘, are represented by a "surrogate pair" of two UTF-16 code units.

PRs #395 and #461 both aim to fix this, though neither is merged yet.

diffWords is currently a mess, and multilingual support is one of the ways in which this is so. Roughly speaking, diffWords attempts to split text into an array of tokens where each token is either a "word", a run of whitespace, or a run of punctuation/special characters, and then diff this sequence of tokens. But this tokenization logic is extremely Latin-centric; all non-Latin characters are treated as special characters / punctuation by the tokenizer. There is also a bug in the handling of accents, so even non-English European languages are only dubiously supported. There's also absolutely no support for languages where words are not separated by spaces, like Chinese.

There are a whole bunch of issues/PRs about this that you may want to track if you're interested in seeing when support gets added:

what is the general approach to work with other languages like Chinese?

Right now, I would suggest tokenizing into words using an Intl.Segmenter and then diffing using diffArrays. But perhaps I'll improve things soon and there'll be a nicer option using diffWords.

@kerams
Copy link

kerams commented Jun 2, 2024

If performance is not a concern for me, is there any reason at all not to use the segmenter and array diffing instead of char diffing? It seems to me it takes care of all issues related to accents and other kinds of combining characters.

@ExplodingCabbage
Copy link
Collaborator

Intl doesn't exist in old browsers, so you'd need to shim it if you want broad compatibility. It's also incredibly loosely specced; I spent a while once picking slowly through it to try to find the actual core algorithm for splitting the text into segments and facepalmed when I eventually got to this passage:

Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex #29. It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at https://cldr.unicode.org/).

"Boundary determination" - i.e. determining where the boundaries are between graphemes, words, or sentences - is, like, the whole thing that Intl.Segmenter is for, so this amounts to "different implementations are free to return whatever results they like". Right now, I don't know of any differences between implementations, but in future, the results that Node, Chrome, and Firefox return may diverge not only from what they return today but also from each other. If you need such consistency, this again calls for using a shim (and making sure it's configured so that it's used instead of the platform's native version of Intl.Segmenter, and not merely as a fallback).

But these issues probably fall more into the category of "traps to be aware of and avoid falling into" rather than "reasons to outright not use Intl.Segmenter. It requires a tiny bit more work but will probably just give strictly better results right now, as you say.

The only reason you might want to not do it is that there are some pretty profound problems with including whitespace tokens when diffing, which are fixed in diffWords on master by #497 but not yet fixed in any release on npm. You might want to replicate the new & improved handling of whitespace in whatever logic you do with Intl.Segmenter, but it's a bit complicated, so if you're lazy and don't care much about accents etc then you might instead want to wait for the next release and use diffWords.

(Perhaps I should try to get #438 done in the next release and make it compatible with the new whitespace-handling logic; then you'll be able to have the best of all worlds by simply instantiating an Intl.Segmenter and passing it as a parameter to diffWords.)

@kerams
Copy link

kerams commented Jun 3, 2024

Fantastic response, much obliged.

I'm only interested in character-by-character diffing, so word boundary issues are not a concern. I have roughly this in place if typeof(Intl.Segmenter) = undefined then diffChars(x, y) else diffArray(segment x, segment y). As you're saying, Firefox annoyingly got segmenter support only a couple of months ago. So far it seems this approach of using the segmenter for grapheme splitting is superior in every way.

@ExplodingCabbage
Copy link
Collaborator

Oh, interesting - sorry, I missed that you were considering this as an alternative to diffChars, not to diffWords. Yeah, assuming you want to split into graphemes (and not either UTF-16 code units or Unicode code points), Intl.Segmenter is probably the way to go. There's also a library called grapheme-splitter that I think predates Intl.Segmenter's existence and hasn't been updated since 2018; I would assume that it is inferior to Intl.Segmenter, but I don't actually know. If you were inclined to be really thorough and investigate the differences between existing Intl.Segmenter implementations and grapheme-splitter, I would be interested in the results; if I had confirmation that grapheme-splitter is simply worse, I'd update my Jan 2020 Stack Overflow answer at https://stackoverflow.com/a/59796758/1709587 that recommends grapheme-splitter for splitting a string into graphemes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants