Skip to content
/ SUP Public

EMNLP2021-Syntactically-Informed Unsupervised Paraphrasing with Non-Parallel Data

Notifications You must be signed in to change notification settings

lanse-sir/SUP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Syntactically-Informed Unsupervised Paraphrasing with Non-Parallel Data

The source code of the Syntactically-Informed Unsupervised Paraphrasing with Non-Parallel Data.

Datasets

Quora: download from https://drive.google.com/file/d/1RdIQEoWJbm4HtNYaxFHjleBgX5FIZZtp/view?usp=sharing.

ParaNMT: You can download from this paper ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations.

Requirements

python >= 3.6
torch == 1.6.0
nltk == 3.4.5
zss == 1.2.0

Data Processing

We use the Stanford Parser to obtain the parse tree and template.

The command is as follows:

input file:
java -Xmx12g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -threads 1 -annotators tokenize,ssplit,pos,parse -ssplit.eolonly -file file.txt -outputFormat text -outputDirectory /outputdir/
input filelist:
java -Xmx12g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -threads 1 -annotators tokenize,ssplit,pos,parse -ssplit.eolonly -filelist filenames.txt -outputFormat text -outputDirectory /outputdir/

If the data is large, you can use the split command to divide the file into multiple small files for parsing. Then you can use the pos_to_file.py and template.py in the autocg directory to extract parse tree and template.

About

EMNLP2021-Syntactically-Informed Unsupervised Paraphrasing with Non-Parallel Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published