Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug on diff words with accent #311

Closed
gleidsonh opened this issue Feb 14, 2021 · 4 comments · Fixed by #494
Closed

Bug on diff words with accent #311

gleidsonh opened this issue Feb 14, 2021 · 4 comments · Fixed by #494

Comments

@gleidsonh
Copy link

Some portuguese language words has accents or hyphen, or sometimes both, like this example.

const Diff = require('diff')

const one = 'Para animá-los quando...';
const other = 'Para encorajá-los quando...';

const diff = Diff.diffWordsWithSpace(one, other);

let oldSentence = ''
let newSentence = ''
diff.forEach((part) => {
    if (part.removed) oldSentence += part.value
    if (part.added) newSentence += part.value
})

console.log('oldSentence', oldSentence)
console.log('newSentence', newSentence)
// diff: anim and encoraj - is truncated on accent

// just for test, if accent is removed, result it will be:
// diff: anima and encoraja - is truncated on hyphen

// just for test, if hyphen is removed, result it will be:
// diff: animalos and encorajalos - it works fine

Is this a known bug or not yet? Can I try to correct and send a PR?

@gleidsonh
Copy link
Author

@kpdecker Can you please answer me?

@ExplodingCabbage
Copy link
Collaborator

Hmm. Tokenization of text with accents is definitely broken:

> wordDiff.tokenize("Para animá-los quando...")
[
  'Para',   ' ',
  'anim',   '',
  'á-',     '',
  'los',    ' ',
  'quando', '',
  '...'
]

@ExplodingCabbage
Copy link
Collaborator

A super-weird thing here is that the á only gets broken off the word it belongs to by the tokenizer if it appears at the end.

> wordDiff.tokenize("xyzáxyz")
[ 'xyzáxyz' ]
> wordDiff.tokenize("xyzá-foo")
[ 'xyz', '', 'á-', '', 'foo' ]
> wordDiff.tokenize("xyza-foo")
[ 'xyza', '', '-', '', 'foo' ]

@ExplodingCabbage
Copy link
Collaborator

Ah, so it's because we split on (among other things) \b in this regex:

let tokens = value.split(/([^\S\r\n]+|[()[\]{}'"\r\n]|\b)/);

\b considers non-ASCII letters to be non-letters:

> 'fooáfoo'.split(/\b/);
[ 'foo', 'á', 'foo' ]

The only reason this doesn't break an á in the middle of a word into its own token is that we have some logic for stitching the words back together, but it doesn't work when the á is at the beginning or end of a word:

// Join the boundary splits that we do not consider to be boundaries. This is primarily the extended Latin character set.
for (let i = 0; i < tokens.length - 1; i++) {
  // If we have an empty string in the next field and we have only word chars before and after, merge
  if (!tokens[i + 1] && tokens[i + 2]
        && extendedWordChars.test(tokens[i])
        && extendedWordChars.test(tokens[i + 2])) {
    tokens[i] += tokens[i + 2];
    tokens.splice(i + 1, 2);
    i--;
  }
}

This is a mess IMO and needs a rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants