Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coref model skipping long documents during training #1465

Open
501Good opened this issue Feb 27, 2025 · 2 comments
Open

Coref model skipping long documents during training #1465

501Good opened this issue Feb 27, 2025 · 2 comments

Comments

@501Good
Copy link

501Good commented Feb 27, 2025

I have been investigating the code for the coreference model to better understand its inner workings.

One thing that caught my attention is that during training documents longer than 5,000 subtokens are skipped without any warning.

# skip very long documents during training time
if len(doc["subwords"]) > 5000:
continue

I think this behavior should be optional, or at least notify the user with a message when a document is skipped. Especially since some datasets (e.g. French-Democrat) have quite long documents (>10,000 tokens), which could lead to unexpectedly low performance.

@AngledLuffa
Copy link
Collaborator

Aye, that seems like a reasonable request. Let me see if I can make it an option.

AngledLuffa added a commit that referenced this issue Mar 1, 2025
Preserve existing models with the new config attribute
AngledLuffa added a commit that referenced this issue Mar 1, 2025
Preserve existing models with the new config attribute
@AngledLuffa
Copy link
Collaborator

Someone had sent me a coref dataset for Tamil a little while back, so I brushed that off and added the flag you were looking for as part of that.

#1467

AngledLuffa added a commit that referenced this issue Mar 1, 2025
Preserve existing models with the new config attribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants