Bug on diff words with accent #311

gleidsonh · 2021-02-14T08:45:50Z

Some portuguese language words has accents or hyphen, or sometimes both, like this example.

const Diff = require('diff')

const one = 'Para animá-los quando...';
const other = 'Para encorajá-los quando...';

const diff = Diff.diffWordsWithSpace(one, other);

let oldSentence = ''
let newSentence = ''
diff.forEach((part) => {
    if (part.removed) oldSentence += part.value
    if (part.added) newSentence += part.value
})

console.log('oldSentence', oldSentence)
console.log('newSentence', newSentence)
// diff: anim and encoraj - is truncated on accent

// just for test, if accent is removed, result it will be:
// diff: anima and encoraja - is truncated on hyphen

// just for test, if hyphen is removed, result it will be:
// diff: animalos and encorajalos - it works fine

Is this a known bug or not yet? Can I try to correct and send a PR?

gleidsonh · 2021-02-27T18:51:28Z

@kpdecker Can you please answer me?

ExplodingCabbage · 2024-01-09T17:10:52Z

Hmm. Tokenization of text with accents is definitely broken:

> wordDiff.tokenize("Para animá-los quando...")
[
  'Para',   ' ',
  'anim',   '',
  'á-',     '',
  'los',    ' ',
  'quando', '',
  '...'
]

ExplodingCabbage · 2024-01-09T17:14:07Z

A super-weird thing here is that the á only gets broken off the word it belongs to by the tokenizer if it appears at the end.

> wordDiff.tokenize("xyzáxyz")
[ 'xyzáxyz' ]
> wordDiff.tokenize("xyzá-foo")
[ 'xyz', '', 'á-', '', 'foo' ]
> wordDiff.tokenize("xyza-foo")
[ 'xyza', '', '-', '', 'foo' ]

ExplodingCabbage · 2024-01-09T17:25:34Z

Ah, so it's because we split on (among other things) \b in this regex:

let tokens = value.split(/([^\S\r\n]+|[()[\]{}'"\r\n]|\b)/);

\b considers non-ASCII letters to be non-letters:

> 'fooáfoo'.split(/\b/);
[ 'foo', 'á', 'foo' ]

The only reason this doesn't break an á in the middle of a word into its own token is that we have some logic for stitching the words back together, but it doesn't work when the á is at the beginning or end of a word:

// Join the boundary splits that we do not consider to be boundaries. This is primarily the extended Latin character set.
for (let i = 0; i < tokens.length - 1; i++) {
  // If we have an empty string in the next field and we have only word chars before and after, merge
  if (!tokens[i + 1] && tokens[i + 2]
        && extendedWordChars.test(tokens[i])
        && extendedWordChars.test(tokens[i + 2])) {
    tokens[i] += tokens[i + 2];
    tokens.splice(i + 1, 2);
    i--;
  }
}

This is a mess IMO and needs a rewrite.

ExplodingCabbage added bug diffWords behaviour labels Jan 9, 2024

ExplodingCabbage added the planned-for-release-after-next label Jan 9, 2024

This was referenced Jan 10, 2024

HTML in text handling error? #373

Closed

Does jsdiff work with Chinese (or other non-English script languages)? #377

Closed

ExplodingCabbage added planned-for-next-release and removed planned-for-release-after-next labels Feb 13, 2024

ExplodingCabbage mentioned this issue Feb 19, 2024

Simplify tokenization logic in diffWords #494

Merged

ExplodingCabbage closed this as completed in #494 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug on diff words with accent #311

Bug on diff words with accent #311

gleidsonh commented Feb 14, 2021

gleidsonh commented Feb 27, 2021

ExplodingCabbage commented Jan 9, 2024

ExplodingCabbage commented Jan 9, 2024

ExplodingCabbage commented Jan 9, 2024

Bug on diff words with accent #311

Bug on diff words with accent #311

Comments

gleidsonh commented Feb 14, 2021

gleidsonh commented Feb 27, 2021

ExplodingCabbage commented Jan 9, 2024

ExplodingCabbage commented Jan 9, 2024

ExplodingCabbage commented Jan 9, 2024