Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: split text by any newline and spaces #178

Merged
merged 5 commits into from
Jan 30, 2024
Merged

Conversation

franluca
Copy link
Contributor

Description of changes: when computing f1 measures for QA, split the text by any number of newlines or spaces (previously was considering only one new line).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -124,7 +134,7 @@ def _f1_score(
"""
model_output = _normalize_and_strip_text(model_output, normalize_text=normalize_text, strip_text=strip_text)
target_output = _normalize_and_strip_text(target_output, normalize_text=normalize_text, strip_text=strip_text)
ret = f_measure(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))
ret = f_measure(reference=set(_split(target_output)), test=set(_split(model_output)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a separate function? Could we not replace split(" ") with split(). Simply using split() gives the desired behavior that you specified in the unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. I didn't know the behavior of split with no separator. Actually from the docs the behavior is not that clear. https://docs.python.org/3.10/library/stdtypes.html?highlight=str%20split#str.split
there is no mention about newlines and other special characters. But yes str.split() seems to achieve precisely what we need. Will change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing _split does is to call the standard library split() function, so consider calling split() directly. Its just one more level that the reader has to go through and not sure that it adds any clarity since the user can simply look at the documentation of standard library split() to understand what it does.

@franluca
Copy link
Contributor Author

Made changes to documentation, added a test case and corrected lint.

I think it's fine to have one separate function for split even if it calls only str.split . This adds clarity to the code.

bilalaws
bilalaws previously approved these changes Jan 25, 2024
Copy link
Contributor

@bilalaws bilalaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Luca. Left a couple of non-blocking comments.

@@ -105,6 +105,17 @@ def _normalize_and_strip_text(text: str, *, normalize_text: bool = False, strip_
return text


def _split(text: str) -> List[str]:
"""
Split text matching one or more spaces (\s+) or one or more new lines (\n+) or any other string.whitespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of conciseness, consider updating this to: Split text based on string.whitespace characters.
Reason: space and newline are included in string.whitespace characters.

@@ -124,7 +134,7 @@ def _f1_score(
"""
model_output = _normalize_and_strip_text(model_output, normalize_text=normalize_text, strip_text=strip_text)
target_output = _normalize_and_strip_text(target_output, normalize_text=normalize_text, strip_text=strip_text)
ret = f_measure(reference=set(target_output.split(" ")), test=set(model_output.split(" ")))
ret = f_measure(reference=set(_split(target_output)), test=set(_split(model_output)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing _split does is to call the standard library split() function, so consider calling split() directly. Its just one more level that the reader has to go through and not sure that it adds any clarity since the user can simply look at the documentation of standard library split() to understand what it does.

@franluca franluca merged commit 341b3b3 into main Jan 30, 2024
4 checks passed
@xiaoyi-cheng xiaoyi-cheng deleted the new_lines_qa_patch branch February 20, 2024 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants