fix: split text by any newline and spaces #178

franluca · 2024-01-24T15:33:41Z

Description of changes: when computing f1 measures for QA, split the text by any number of newlines or spaces (previously was considering only one new line).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

bilalaws · 2024-01-24T16:38:32Z

src/fmeval/eval_algorithms/qa_accuracy.py

@@ -124,7 +134,7 @@ def _f1_score(
    """
    model_output = _normalize_and_strip_text(model_output, normalize_text=normalize_text, strip_text=strip_text)
    target_output = _normalize_and_strip_text(target_output, normalize_text=normalize_text, strip_text=strip_text)
-    ret = f_measure(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))
+    ret = f_measure(reference=set(_split(target_output)), test=set(_split(model_output)))


Do we really need a separate function? Could we not replace split(" ") with split(). Simply using split() gives the desired behavior that you specified in the unit tests.

Ah makes sense. I didn't know the behavior of split with no separator. Actually from the docs the behavior is not that clear. https://docs.python.org/3.10/library/stdtypes.html?highlight=str%20split#str.split
there is no mention about newlines and other special characters. But yes str.split() seems to achieve precisely what we need. Will change.

The only thing _split does is to call the standard library split() function, so consider calling split() directly. Its just one more level that the reader has to go through and not sure that it adds any clarity since the user can simply look at the documentation of standard library split() to understand what it does.

src/fmeval/eval_algorithms/qa_accuracy.py

franluca · 2024-01-25T09:41:38Z

Made changes to documentation, added a test case and corrected lint.

I think it's fine to have one separate function for split even if it calls only str.split . This adds clarity to the code.

bilalaws

Thanks Luca. Left a couple of non-blocking comments.

bilalaws · 2024-01-25T10:09:20Z

src/fmeval/eval_algorithms/qa_accuracy.py

@@ -105,6 +105,17 @@ def _normalize_and_strip_text(text: str, *, normalize_text: bool = False, strip_
    return text


+def _split(text: str) -> List[str]:
+    """
+    Split text matching one or more spaces (\s+) or one or more new lines (\n+) or any other string.whitespace


For the sake of conciseness, consider updating this to: Split text based on string.whitespace characters.
Reason: space and newline are included in string.whitespace characters.

bilalaws · 2024-01-25T10:15:27Z

src/fmeval/eval_algorithms/qa_accuracy.py

@@ -124,7 +134,7 @@ def _f1_score(
    """
    model_output = _normalize_and_strip_text(model_output, normalize_text=normalize_text, strip_text=strip_text)
    target_output = _normalize_and_strip_text(target_output, normalize_text=normalize_text, strip_text=strip_text)
-    ret = f_measure(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))
+    ret = f_measure(reference=set(_split(target_output)), test=set(_split(model_output)))


The only thing _split does is to call the standard library split() function, so consider calling split() directly. Its just one more level that the reader has to go through and not sure that it adds any clarity since the user can simply look at the documentation of standard library split() to understand what it does.

fix: split text by any newline and spaces

e70443c

franluca requested review from bilalaws, polaschwoebel and keerthanvasist January 24, 2024 15:33

bilalaws reviewed Jan 24, 2024

View reviewed changes

franluca added 2 commits January 24, 2024 19:25

fix: split text by any newline and spaces

7648fb6

fix: split text by any newline and spaces

7c274e6

bilalaws previously approved these changes Jan 25, 2024

View reviewed changes

fix: split text by any newline and spaces

d9af92b

franluca dismissed bilalaws’s stale review via d9af92b January 29, 2024 17:23

keerthanvasist approved these changes Jan 29, 2024

View reviewed changes

bilalaws approved these changes Jan 30, 2024

View reviewed changes

Merge branch 'main' into new_lines_qa_patch

32654f9

franluca merged commit 341b3b3 into main Jan 30, 2024
4 checks passed

xiaoyi-cheng deleted the new_lines_qa_patch branch February 20, 2024 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: split text by any newline and spaces #178

fix: split text by any newline and spaces #178

franluca commented Jan 24, 2024

bilalaws Jan 24, 2024

franluca Jan 24, 2024

bilalaws Jan 25, 2024

franluca commented Jan 25, 2024

bilalaws left a comment

bilalaws Jan 25, 2024

bilalaws Jan 25, 2024

fix: split text by any newline and spaces #178

fix: split text by any newline and spaces #178

Conversation

franluca commented Jan 24, 2024

bilalaws Jan 24, 2024

Choose a reason for hiding this comment

franluca Jan 24, 2024

Choose a reason for hiding this comment

bilalaws Jan 25, 2024

Choose a reason for hiding this comment

franluca commented Jan 25, 2024

bilalaws left a comment

Choose a reason for hiding this comment

bilalaws Jan 25, 2024

Choose a reason for hiding this comment

bilalaws Jan 25, 2024

Choose a reason for hiding this comment