You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I import a single CoNLL-U Document via CoNLL.conll2doc and then run a pipeline with tokenize_pretokenized=True, tokenize_no_ssplit=True on it, it gets processed without problems.
However, when I put several CoNLL-U Documents imported via CoNLL.conll2doc into a list and run bulk_process on that list, I get the error
File "/path/to/…/stanza/pipeline/tokenize_processor.py", line 71, in process_pre_tokenized_text
for sentence in sentences:
UnboundLocalError: local variable 'sentences' referenced before assignment
Any ideas? Am I using CoNLL files in a wrong way? – Thanks a lot in advance, any help much appreciated.
To Reproduce
import stanza
from stanza.utils.conll import CoNLL
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse', tokenize_pretokenized=True, tokenize_no_ssplit=True)
conll_str = """
# text = This is a test sentence.
# sent_id = 0
1 This this PRON DT Number=Sing|PronType=Dem 5 nsubj _ start_char=0|end_char=4
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 5 cop _ start_char=5|end_char=7
3 a a DET DT Definite=Ind|PronType=Art 5 det _ start_char=8|end_char=9
4 test test NOUN NN Number=Sing 5 compound _ start_char=10|end_char=14
5 sentence sentence NOUN NN Number=Sing 0 root _ start_char=15|end_char=23|SpaceAfter=No
6 . . PUNCT . _ 5 punct _ start_char=23|end_char=24|SpaceAfter=No
"""
with open("doc.conllu", "w") as o:
o.write(conll_str)
conll = CoNLL.conll2doc("doc.conllu")
# works fine:
out = nlp(conll)
# throws the error:
conlls = [conll, conll]
out = nlp.bulk_process(conlls)
Expected behavior
nlp.bulk_process(conlls) should return a list of Documents which have run through nlp.
Environment:
OS: MacOS
Python version: 3.10.14
Stanza version: 1.10.0
The text was updated successfully, but these errors were encountered:
Ah, I figured it out. When you create the document via CoNLL.conll2doc, it creates sentences and words from the conll, but doesn't stitch together the entire document text into a text field. Interestingly, some would say wrongly, the pretokenized in bulk_process tries to whitespace tokenize the document a second time, but fails because there's no entire document text available. The single document version doesn't run into this problem because it sees that it was passed a document and assumes it already has sentences & words & stuff
Should be fixed in the multidoc_tokenize branch. If that's no longer there by the time you get this message, it's because I merged it after the unit tests ran. I'll try to make a new version soon - there were a few other small bugfixes as well recently
When I import a single CoNLL-U Document via CoNLL.conll2doc and then run a pipeline with tokenize_pretokenized=True, tokenize_no_ssplit=True on it, it gets processed without problems.
However, when I put several CoNLL-U Documents imported via CoNLL.conll2doc into a list and run bulk_process on that list, I get the error
File "/path/to/…/stanza/pipeline/tokenize_processor.py", line 71, in process_pre_tokenized_text
for sentence in sentences:
UnboundLocalError: local variable 'sentences' referenced before assignment
With Documents created from raw text, as described on https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents , bulk_process works fine.
Any ideas? Am I using CoNLL files in a wrong way? – Thanks a lot in advance, any help much appreciated.
To Reproduce
Expected behavior
nlp.bulk_process(conlls) should return a list of Documents which have run through nlp.
Environment:
The text was updated successfully, but these errors were encountered: