-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228
Comments
This appears to have been done by lightning: Lightning-AI/pytorch-lightning#19303 - not sure when it will be released |
Noticing a few changes now when using the latest with the files being produced in the directory below. In this example I did 3 training runs (finetune_checkpoint) 10epoch + 10epoch + 10epoch.
*** SIDE NOTE: One good thing about how it is working now automatically is that for example if a job gets Preemption multiple times in a row, it will auto-resume successfully on the cluster till the job reach the end or on Trixie with a job run time max where it might need multiple runs to reach the end :-) |
we could change this file to monitored_ckpt_callback = ModelCheckpoint(
monitor=monitor,
mode="min",
save_top_k=config.training.save_top_k_ckpts,
every_n_train_steps=config.training.ckpt_steps,
every_n_epochs=config.training.ckpt_epochs,
enable_version_counter=True,
) |
If we run three times to 29 epochs:
We discussed as a group and we need to confirm the correct behaviour for checkpointing |
The symlink --> logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt is created as an absolute link.
It should be created as a relative like like this below:
last.ckpt -> 'epoch=497-step=8466.ckpt'
VS in my example what we currently get including the full path:
last.ckpt -> '/gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/MOH/MULTI-SPEAKER/logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=497-step=8466.ckpt
The text was updated successfully, but these errors were encountered: