Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesized alignments should be validated before producing a new sentence pair #54

Open
gregtatum opened this issue Mar 1, 2024 · 1 comment

Comments

@gregtatum
Copy link
Contributor

In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side, this will degrade the final performance when using guided alignment training. It should be cheap and easy to validate that the alignments are within bounds. If they are out of bounds, then the sentence pair can be discarded with a warning.

This could be done like the following:

diff --git a/src/opustrainer/alignments.py b/src/opustrainer/alignments.py
index 6c8b316..05f8ca6 100644
--- a/src/opustrainer/alignments.py
+++ b/src/opustrainer/alignments.py
@@ -23,3 +23,19 @@ def format_alignments(pairs:List[Pair]) -> str:
     """Opposite of `parse_alignments`, turns a list of alignments back into the `a-b c-d ...` string
     format that most alignment tools expect."""
     return ' '.join(f'{pair.src}-{pair.trg}' for pair in pairs)
+
+
+def validate_alignments(source: List[str], target: List[str], alignments: List[Pair]):
+    for pair in alignments:
+        if (
+            pair.src < 0
+            or pair.src >= len(source)
+            or pair.trg < 0
+            or pair.trg >= len(target)
+        ):
+            raise Exception(
+                f"Alignments were not valid for:\n"
+                f"Source: {source}\n"
+                f"Target: {target}\n"
+                f"Alignments: {alignments}"
+            )
diff --git a/src/opustrainer/modifiers/placeholders.py b/src/opustrainer/modifiers/placeholders.py
index 4af71b0..9e23d63 100644
--- a/src/opustrainer/modifiers/placeholders.py
+++ b/src/opustrainer/modifiers/placeholders.py
@@ -3,7 +3,7 @@ from operator import attrgetter
 from pathlib import Path
 from typing import Set, List, Tuple, Optional, TypeVar, Iterable
 
-from opustrainer.alignments import Pair, parse_alignments, format_alignments
+from opustrainer.alignments import Pair, parse_alignments, format_alignments, validate_alignments
 from opustrainer.modifiers import Modifier
 from opustrainer.tokenizers import SpaceDetokenizer, SpaceTokenizer, MosesDetokenizer, SentencePieceTokenizer
 from opustrainer.modifiers.retokenize import Retokenizer, remap_alignment_pairs
@@ -394,6 +394,7 @@ class PlaceholderTagModifier(Modifier):
         
         if self.print_alignments:
             remapped_pairs = remap_alignment_pairs(source_mapping, target_mapping, alignments)
+            validate_alignments(source, target, remapped_pairs)
             return source_detok + "\t" + target_detok + "\t" + format_alignments(remapped_pairs)
         else:
             return source_detok + "\t" + target_detok

One draw-back, is that it still won't catch issues where the wrong tokenization strategy is used like in #53.

@gregtatum
Copy link
Contributor Author

It looks like parse_alignments has some validation:

    if src_tokens is not None and trg_tokens is not None:
        for pair in pairs:
            if pair.src < 0 or pair.src >= len(src_tokens) \
            or pair.trg < 0 or pair.trg >= len(trg_tokens):
                raise ValueError('Out-of-bound alignment pairs')

But I think the alignments should be validated again after being modified. I would also prefer to output to a logger for anything that didn't get validated. It's hard to debug when you are dealing with tens of millions of sentence pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant