You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side, this will degrade the final performance when using guided alignment training. It should be cheap and easy to validate that the alignments are within bounds. If they are out of bounds, then the sentence pair can be discarded with a warning.
This could be done like the following:
diff --git a/src/opustrainer/alignments.py b/src/opustrainer/alignments.py
index 6c8b316..05f8ca6 100644
--- a/src/opustrainer/alignments.py
+++ b/src/opustrainer/alignments.py
@@ -23,3 +23,19 @@ def format_alignments(pairs:List[Pair]) -> str:
"""Opposite of `parse_alignments`, turns a list of alignments back into the `a-b c-d ...` string
format that most alignment tools expect."""
return ' '.join(f'{pair.src}-{pair.trg}' for pair in pairs)
+
+
+def validate_alignments(source: List[str], target: List[str], alignments: List[Pair]):
+ for pair in alignments:
+ if (
+ pair.src < 0
+ or pair.src >= len(source)
+ or pair.trg < 0
+ or pair.trg >= len(target)
+ ):
+ raise Exception(
+ f"Alignments were not valid for:\n"
+ f"Source: {source}\n"
+ f"Target: {target}\n"
+ f"Alignments: {alignments}"
+ )
diff --git a/src/opustrainer/modifiers/placeholders.py b/src/opustrainer/modifiers/placeholders.py
index 4af71b0..9e23d63 100644
--- a/src/opustrainer/modifiers/placeholders.py
+++ b/src/opustrainer/modifiers/placeholders.py
@@ -3,7 +3,7 @@ from operator import attrgetter
from pathlib import Path
from typing import Set, List, Tuple, Optional, TypeVar, Iterable
-from opustrainer.alignments import Pair, parse_alignments, format_alignments
+from opustrainer.alignments import Pair, parse_alignments, format_alignments, validate_alignments
from opustrainer.modifiers import Modifier
from opustrainer.tokenizers import SpaceDetokenizer, SpaceTokenizer, MosesDetokenizer, SentencePieceTokenizer
from opustrainer.modifiers.retokenize import Retokenizer, remap_alignment_pairs
@@ -394,6 +394,7 @@ class PlaceholderTagModifier(Modifier):
if self.print_alignments:
remapped_pairs = remap_alignment_pairs(source_mapping, target_mapping, alignments)
+ validate_alignments(source, target, remapped_pairs)
return source_detok + "\t" + target_detok + "\t" + format_alignments(remapped_pairs)
else:
return source_detok + "\t" + target_detok
One draw-back, is that it still won't catch issues where the wrong tokenization strategy is used like in #53.
The text was updated successfully, but these errors were encountered:
It looks like parse_alignments has some validation:
if src_tokens is not None and trg_tokens is not None:
for pair in pairs:
if pair.src < 0 or pair.src >= len(src_tokens) \
or pair.trg < 0 or pair.trg >= len(trg_tokens):
raise ValueError('Out-of-bound alignment pairs')
But I think the alignments should be validated again after being modified. I would also prefer to output to a logger for anything that didn't get validated. It's hard to debug when you are dealing with tens of millions of sentence pairs.
In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side, this will degrade the final performance when using guided alignment training. It should be cheap and easy to validate that the alignments are within bounds. If they are out of bounds, then the sentence pair can be discarded with a warning.
This could be done like the following:
One draw-back, is that it still won't catch issues where the wrong tokenization strategy is used like in #53.
The text was updated successfully, but these errors were encountered: