Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk_process of CoNLL-U Documents throws error in process_pre_tokenized_text() #1464

Open
rohlik-hu opened this issue Feb 26, 2025 · 4 comments
Labels

Comments

@rohlik-hu
Copy link

When I import a single CoNLL-U Document via CoNLL.conll2doc and then run a pipeline with tokenize_pretokenized=True, tokenize_no_ssplit=True on it, it gets processed without problems.

However, when I put several CoNLL-U Documents imported via CoNLL.conll2doc into a list and run bulk_process on that list, I get the error

File "/path/to/…/stanza/pipeline/tokenize_processor.py", line 71, in process_pre_tokenized_text
for sentence in sentences:
UnboundLocalError: local variable 'sentences' referenced before assignment

With Documents created from raw text, as described on https://stanfordnlp.github.io/stanza/getting_started.html#processing-multiple-documents , bulk_process works fine.

Any ideas? Am I using CoNLL files in a wrong way? – Thanks a lot in advance, any help much appreciated.

To Reproduce

import stanza
from stanza.utils.conll import CoNLL
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse', tokenize_pretokenized=True, tokenize_no_ssplit=True)

conll_str = """
# text = This is a test sentence.
# sent_id = 0
1	This	this	PRON	DT	Number=Sing|PronType=Dem	5	nsubj	_	start_char=0|end_char=4
2	is	be	AUX	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	cop	_	start_char=5|end_char=7
3	a	a	DET	DT	Definite=Ind|PronType=Art	5	det	_	start_char=8|end_char=9
4	test	test	NOUN	NN	Number=Sing	5	compound	_	start_char=10|end_char=14
5	sentence	sentence	NOUN	NN	Number=Sing	0	root	_	start_char=15|end_char=23|SpaceAfter=No
6	.	.	PUNCT	.	_	5	punct	_	start_char=23|end_char=24|SpaceAfter=No
"""
with open("doc.conllu", "w") as o:
	o.write(conll_str)
conll = CoNLL.conll2doc("doc.conllu")

# works fine:
out = nlp(conll)

# throws the error:
conlls = [conll, conll]
out = nlp.bulk_process(conlls)

Expected behavior
nlp.bulk_process(conlls) should return a list of Documents which have run through nlp.

Environment:

  • OS: MacOS
  • Python version: 3.10.14
  • Stanza version: 1.10.0
@rohlik-hu rohlik-hu added the bug label Feb 26, 2025
@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Feb 27, 2025 via email

@AngledLuffa
Copy link
Collaborator

Ah, I figured it out. When you create the document via CoNLL.conll2doc, it creates sentences and words from the conll, but doesn't stitch together the entire document text into a text field. Interestingly, some would say wrongly, the pretokenized in bulk_process tries to whitespace tokenize the document a second time, but fails because there's no entire document text available. The single document version doesn't run into this problem because it sees that it was passed a document and assumes it already has sentences & words & stuff

@AngledLuffa
Copy link
Collaborator

Should be fixed in the multidoc_tokenize branch. If that's no longer there by the time you get this message, it's because I merged it after the unit tests ran. I'll try to make a new version soon - there were a few other small bugfixes as well recently

@rohlik-hu
Copy link
Author

So cool! Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants