Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228

Open
marctessier opened this issue Jan 23, 2024 · 4 comments
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@marctessier
Copy link
Collaborator

The symlink --> logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt is created as an absolute link.

It should be created as a relative like like this below:

last.ckpt -> 'epoch=497-step=8466.ckpt'

VS in my example what we currently get including the full path:
last.ckpt -> '/gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/MOH/MULTI-SPEAKER/logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=497-step=8466.ckpt

@roedoejet
Copy link
Member

This appears to have been done by lightning: Lightning-AI/pytorch-lightning#19303 - not sure when it will be released

@marctessier
Copy link
Collaborator Author

Noticing a few changes now when using the latest with the files being produced in the directory below.

In this example I did 3 training runs (finetune_checkpoint) 10epoch + 10epoch + 10epoch.

  1. last.ckpt is now a file and not a symlink to the "real" last "numbered" checkpoint.
    Also , notice below that I had to do a md5sum to confirm witch of the 2 ( epoch=29-step=420.ckpt OR epoch=29-step=420-v1.ckpt ) was the "real" last. Strange that they are not the same md5sum but same file size. ( reason why we want a symlink )

  2. v1.ckpt does not increment after finishing a successful run. ( only the first) example, I would presume after doing 3 full and not partial training runs it would be at v3 .

  3. I think the system should be saving every "v*.ckpt" checkpoints. The config says to keep the last best 5 by default but should also keep and not delete any "v*.ckpt" run of finetune_checkpoint. ( This is kind of related with 2) and might not be an issue once 2) is resolved )

logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/

(EveryVoice) [U20-GPSC7]:$ ls -lstra
total 1492738
     1 drwxr-x--- 6 tes001 nrc_ict      4096 Feb 16 15:21  ..
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=25-step=364.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=26-step=378.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=27-step=392.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=28-step=406.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=29-step=420.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=29-step=420-v1.ckpt'
     1 drwxr-x--- 2 tes001 nrc_ict      4096 Feb 16 15:22  .
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last.ckpt
(EveryVoice) [U20-GPSC7]:$ md5sum last.ckpt 'epoch=29-step=420.ckpt' 'epoch=29-step=420-v1.ckpt'
4bb60cdfb5fa31f55ef6ad8f300478bf  last.ckpt
4106127b6e9e9f51f70da68a61e69819  epoch=29-step=420.ckpt
4bb60cdfb5fa31f55ef6ad8f300478bf  epoch=29-step=420-v1.ckpt

*** SIDE NOTE: One good thing about how it is working now automatically is that for example if a job gets Preemption multiple times in a row, it will auto-resume successfully on the cluster till the job reach the end or on Trixie with a job run time max where it might need multiple runs to reach the end :-)

@roedoejet
Copy link
Member

we could change this file to

monitored_ckpt_callback = ModelCheckpoint(
        monitor=monitor,
        mode="min",
        save_top_k=config.training.save_top_k_ckpts,
        every_n_train_steps=config.training.ckpt_steps,
        every_n_epochs=config.training.ckpt_epochs,
        enable_version_counter=True,
    )

@roedoejet
Copy link
Member

roedoejet commented Feb 19, 2024

If we run three times to 29 epochs:

213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=25-step=364.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=26-step=378.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=27-step=392.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=28-step=406.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=29-step=420.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last.ckpt
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last-v1.ckpt
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last-v2.ckpt

We discussed as a group and we need to confirm the correct behaviour for checkpointing

@roedoejet roedoejet added this to the beta milestone Feb 19, 2024
@roedoejet roedoejet added bug Something isn't working enhancement New feature or request labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants