USING SUPERVISED MACHINE LEARNING (cross-validation framework) TO PREDICT PREPRINTS' PUBLICATION STATUS

As part of my dissertation I developed an innovative record-linkage program to match preprint records to their final published, peer-reviewed counterparts. The extensive diagnostic test I perform reveals a F1-score of 0.86, which is the highest F-1 score ever recorded for an algorythm trying to match preprints with their peer-reviewed counterparts. That said, the main approach I developed as part of my PhD dissertation rely on a rule-based approach and not machine-learning. Using rule-based methods such as fuzzy matching is not problem in itself, but it is extremely cumbersome to maintain and optimize since the sensitivity analysis need to be performed manually. Machine-learning approaches on the other hand simplify both code optimization and performance analysis. Therefore, in addition to the main method I developed in my PhD, I am also currently building a logistic regression as a supervised machine learning model within a cross-validation framework.

FEATURE ENGINEERING

create thousands of independent variables to predict COVID-19 preprints' publication status (see ML_string_feature_engineering_abstract_temp.py)

Optimization of Logistic Regression's Hyperparameters (see cross__validation__v3.py)
Machine-Learning (see CORD_PP_ML_v2.py)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md