You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before summarizing the examples, I will explain the sentence segmentation process in ProVe's backend code.
It loads HTML documents from the database, which must have been executed in the previous code.
An HTML tag cleaner removes most tags and their contents. For example, <head> {contents} </head> will be removed because these tags typically contain metadata related to the type of HTML document or versions of used scripts. After cleaning, plain text with some noise is expected.
Sentence Segmentation with the sent_tokenizer model: The current ProVe code uses a pre-trained model from the NLTK package for sentence segmentation. This model splits the cleaned HTML text into separate sentences.
My observation for the bad examples is that most issues are caused by structured data in the HTML document. The current ProVe code cannot properly handle structured data like tables or bullet lists. I suspect there may be some ways to handle structured data with Python libraries, which I will explore later.
Especially with HTML documents which don't seem to be stripped correctly.
@dignityc Find an example for this and add it here.
The text was updated successfully, but these errors were encountered: