-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coref model skipping long documents during training #1465
Comments
Aye, that seems like a reasonable request. Let me see if I can make it an option. |
AngledLuffa
added a commit
that referenced
this issue
Feb 28, 2025
AngledLuffa
added a commit
that referenced
this issue
Mar 1, 2025
Preserve existing models with the new config attribute
AngledLuffa
added a commit
that referenced
this issue
Mar 1, 2025
Preserve existing models with the new config attribute
Someone had sent me a coref dataset for Tamil a little while back, so I brushed that off and added the flag you were looking for as part of that. |
AngledLuffa
added a commit
that referenced
this issue
Mar 1, 2025
Preserve existing models with the new config attribute
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have been investigating the code for the coreference model to better understand its inner workings.
One thing that caught my attention is that during training documents longer than 5,000 subtokens are skipped without any warning.
stanza/stanza/models/coref/model.py
Lines 458 to 460 in af3d42b
I think this behavior should be optional, or at least notify the user with a message when a document is skipped. Especially since some datasets (e.g. French-Democrat) have quite long documents (>10,000 tokens), which could lead to unexpectedly low performance.
The text was updated successfully, but these errors were encountered: