Coref model skipping long documents during training #1465

501Good · 2025-02-27T10:41:36Z

I have been investigating the code for the coreference model to better understand its inner workings.

One thing that caught my attention is that during training documents longer than 5,000 subtokens are skipped without any warning.

stanza/stanza/models/coref/model.py

Lines 458 to 460 in af3d42b

    
           # skip very long documents during training time 
        
           if len(doc["subwords"]) > 5000: 
        
               continue

I think this behavior should be optional, or at least notify the user with a message when a document is skipped. Especially since some datasets (e.g. French-Democrat) have quite long documents (>10,000 tokens), which could lead to unexpectedly low performance.

AngledLuffa · 2025-02-28T08:30:02Z

Aye, that seems like a reasonable request. Let me see if I can make it an option.

Preserve existing models with the new config attribute

AngledLuffa · 2025-03-01T07:40:32Z

Someone had sent me a coref dataset for Tamil a little while back, so I brushed that off and added the flag you were looking for as part of that.

#1467

Preserve existing models with the new config attribute

AngledLuffa added a commit that referenced this issue Feb 28, 2025

Attempt to turn max_train_len into a parameter - #1465

801d835

AngledLuffa added a commit that referenced this issue Mar 1, 2025

Attempt to turn max_train_len into a parameter - #1465

730654a

Preserve existing models with the new config attribute

AngledLuffa added a commit that referenced this issue Mar 1, 2025

Attempt to turn max_train_len into a parameter - #1465

90cca5d

Preserve existing models with the new config attribute

AngledLuffa added a commit that referenced this issue Mar 1, 2025

Attempt to turn max_train_len into a parameter - #1465

1f98d8f

Preserve existing models with the new config attribute

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coref model skipping long documents during training #1465

Coref model skipping long documents during training #1465

501Good commented Feb 27, 2025

AngledLuffa commented Feb 28, 2025

AngledLuffa commented Mar 1, 2025

Coref model skipping long documents during training #1465

Coref model skipping long documents during training #1465

Comments

501Good commented Feb 27, 2025

AngledLuffa commented Feb 28, 2025

AngledLuffa commented Mar 1, 2025