Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that documents are properly segmented into sentences before submitting to the pipeline #22

Open
albertmeronyo opened this issue Oct 4, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@albertmeronyo
Copy link
Member

Especially with HTML documents which don't seem to be stripped correctly.

@dignityc Find an example for this and add it here.

@albertmeronyo albertmeronyo added the bug Something isn't working label Oct 4, 2024
@dignityc
Copy link
Collaborator

I created a Google spreadsheet to include bad examples of sentence segmentation in ProVe's functionality.
Link: https://docs.google.com/spreadsheets/d/1fNIw7RUNRfyi8Ek6tomhUYDtHL0_Xrg5B47KnD42joI/edit?usp=sharing

Before summarizing the examples, I will explain the sentence segmentation process in ProVe's backend code.

  1. It loads HTML documents from the database, which must have been executed in the previous code.

  2. An HTML tag cleaner removes most tags and their contents. For example, <head> {contents} </head> will be removed because these tags typically contain metadata related to the type of HTML document or versions of used scripts. After cleaning, plain text with some noise is expected.

  3. Sentence Segmentation with the sent_tokenizer model: The current ProVe code uses a pre-trained model from the NLTK package for sentence segmentation. This model splits the cleaned HTML text into separate sentences.

My observation for the bad examples is that most issues are caused by structured data in the HTML document. The current ProVe code cannot properly handle structured data like tables or bullet lists. I suspect there may be some ways to handle structured data with Python libraries, which I will explore later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants