-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does jsdiff work with Chinese (or other non-English script languages)? #377
Comments
Currently, JS-diff is English-centric.
PRs #395 and #461 both aim to fix this, though neither is merged yet.
There are a whole bunch of issues/PRs about this that you may want to track if you're interested in seeing when support gets added:
Right now, I would suggest tokenizing into words using an |
If performance is not a concern for me, is there any reason at all not to use the segmenter and array diffing instead of char diffing? It seems to me it takes care of all issues related to accents and other kinds of combining characters. |
"Boundary determination" - i.e. determining where the boundaries are between graphemes, words, or sentences - is, like, the whole thing that But these issues probably fall more into the category of "traps to be aware of and avoid falling into" rather than "reasons to outright not use The only reason you might want to not do it is that there are some pretty profound problems with including whitespace tokens when diffing, which are fixed in (Perhaps I should try to get #438 done in the next release and make it compatible with the new whitespace-handling logic; then you'll be able to have the best of all worlds by simply instantiating an |
Fantastic response, much obliged. I'm only interested in character-by-character diffing, so word boundary issues are not a concern. I have roughly this in place |
Oh, interesting - sorry, I missed that you were considering this as an alternative to |
If I want to do text differencing to see which Chinese characters were added/removed, would jsdiff work with that? What about other languages with combining characters and such, like Devanagari or Hebrew? I am not super versed in how "text diff" algorithms work, but I imagine it might be very English-/Latin-centric. Is that the case? Or does it work for any other language? If not, what is the general approach to work with other languages like Chinese? Thank you for the help!
The text was updated successfully, but these errors were encountered: