Ensure that documents are properly segmented into sentences before submitting to the pipeline #22

albertmeronyo · 2024-10-04T11:17:03Z

Especially with HTML documents which don't seem to be stripped correctly.

@dignityc Find an example for this and add it here.

dignityc · 2024-10-24T09:33:24Z

I created a Google spreadsheet to include bad examples of sentence segmentation in ProVe's functionality.
Link: https://docs.google.com/spreadsheets/d/1fNIw7RUNRfyi8Ek6tomhUYDtHL0_Xrg5B47KnD42joI/edit?usp=sharing

Before summarizing the examples, I will explain the sentence segmentation process in ProVe's backend code.

It loads HTML documents from the database, which must have been executed in the previous code.
An HTML tag cleaner removes most tags and their contents. For example, <head> {contents} </head> will be removed because these tags typically contain metadata related to the type of HTML document or versions of used scripts. After cleaning, plain text with some noise is expected.
Sentence Segmentation with the sent_tokenizer model: The current ProVe code uses a pre-trained model from the NLTK package for sentence segmentation. This model splits the cleaned HTML text into separate sentences.

My observation for the bad examples is that most issues are caused by structured data in the HTML document. The current ProVe code cannot properly handle structured data like tables or bullet lists. I suspect there may be some ways to handle structured data with Python libraries, which I will explore later.

albertmeronyo added the bug Something isn't working label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that documents are properly segmented into sentences before submitting to the pipeline #22

Ensure that documents are properly segmented into sentences before submitting to the pipeline #22

albertmeronyo commented Oct 4, 2024

dignityc commented Oct 24, 2024

Ensure that documents are properly segmented into sentences before submitting to the pipeline #22

Ensure that documents are properly segmented into sentences before submitting to the pipeline #22

Comments

albertmeronyo commented Oct 4, 2024

dignityc commented Oct 24, 2024