-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: split text by any newline and spaces #178
Conversation
@@ -124,7 +134,7 @@ def _f1_score( | |||
""" | |||
model_output = _normalize_and_strip_text(model_output, normalize_text=normalize_text, strip_text=strip_text) | |||
target_output = _normalize_and_strip_text(target_output, normalize_text=normalize_text, strip_text=strip_text) | |||
ret = f_measure(reference=set(target_output.split(" ")), test=set(model_output.split(" "))) | |||
ret = f_measure(reference=set(_split(target_output)), test=set(_split(model_output))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a separate function? Could we not replace split(" ")
with split()
. Simply using split()
gives the desired behavior that you specified in the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. I didn't know the behavior of split with no separator. Actually from the docs the behavior is not that clear. https://docs.python.org/3.10/library/stdtypes.html?highlight=str%20split#str.split
there is no mention about newlines and other special characters. But yes str.split() seems to achieve precisely what we need. Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing _split
does is to call the standard library split()
function, so consider calling split()
directly. Its just one more level that the reader has to go through and not sure that it adds any clarity since the user can simply look at the documentation of standard library split()
to understand what it does.
Made changes to documentation, added a test case and corrected lint. I think it's fine to have one separate function for split even if it calls only str.split . This adds clarity to the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Luca. Left a couple of non-blocking comments.
@@ -105,6 +105,17 @@ def _normalize_and_strip_text(text: str, *, normalize_text: bool = False, strip_ | |||
return text | |||
|
|||
|
|||
def _split(text: str) -> List[str]: | |||
""" | |||
Split text matching one or more spaces (\s+) or one or more new lines (\n+) or any other string.whitespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of conciseness, consider updating this to: Split text based on string.whitespace
characters.
Reason: space and newline are included in string.whitespace
characters.
@@ -124,7 +134,7 @@ def _f1_score( | |||
""" | |||
model_output = _normalize_and_strip_text(model_output, normalize_text=normalize_text, strip_text=strip_text) | |||
target_output = _normalize_and_strip_text(target_output, normalize_text=normalize_text, strip_text=strip_text) | |||
ret = f_measure(reference=set(target_output.split(" ")), test=set(model_output.split(" "))) | |||
ret = f_measure(reference=set(_split(target_output)), test=set(_split(model_output))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing _split
does is to call the standard library split()
function, so consider calling split()
directly. Its just one more level that the reader has to go through and not sure that it adds any clarity since the user can simply look at the documentation of standard library split()
to understand what it does.
Description of changes: when computing f1 measures for QA, split the text by any number of newlines or spaces (previously was considering only one new line).
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.