logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228

marctessier · 2024-01-23T17:12:31Z

The symlink --> logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt is created as an absolute link.

It should be created as a relative like like this below:

last.ckpt -> 'epoch=497-step=8466.ckpt'

VS in my example what we currently get including the full path:
last.ckpt -> '/gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/TxT2SPEECH/MOH/MULTI-SPEAKER/logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/epoch=497-step=8466.ckpt

roedoejet · 2024-01-23T17:26:34Z

This appears to have been done by lightning: Lightning-AI/pytorch-lightning#19303 - not sure when it will be released

marctessier · 2024-02-16T21:02:32Z

Noticing a few changes now when using the latest with the files being produced in the directory below.

In this example I did 3 training runs (finetune_checkpoint) 10epoch + 10epoch + 10epoch.

last.ckpt is now a file and not a symlink to the "real" last "numbered" checkpoint.
Also , notice below that I had to do a md5sum to confirm witch of the 2 ( epoch=29-step=420.ckpt OR epoch=29-step=420-v1.ckpt ) was the "real" last. Strange that they are not the same md5sum but same file size. ( reason why we want a symlink )
v1.ckpt does not increment after finishing a successful run. ( only the first) example, I would presume after doing 3 full and not partial training runs it would be at v3 .
I think the system should be saving every "v*.ckpt" checkpoints. The config says to keep the last best 5 by default but should also keep and not delete any "v*.ckpt" run of finetune_checkpoint. ( This is kind of related with 2) and might not be an issue once 2) is resolved )

logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/

(EveryVoice) [U20-GPSC7]:$ ls -lstra
total 1492738
     1 drwxr-x--- 6 tes001 nrc_ict      4096 Feb 16 15:21  ..
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=25-step=364.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=26-step=378.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=27-step=392.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=28-step=406.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=29-step=420.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=29-step=420-v1.ckpt'
     1 drwxr-x--- 2 tes001 nrc_ict      4096 Feb 16 15:22  .
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last.ckpt
(EveryVoice) [U20-GPSC7]:$ md5sum last.ckpt 'epoch=29-step=420.ckpt' 'epoch=29-step=420-v1.ckpt'
4bb60cdfb5fa31f55ef6ad8f300478bf  last.ckpt
4106127b6e9e9f51f70da68a61e69819  epoch=29-step=420.ckpt
4bb60cdfb5fa31f55ef6ad8f300478bf  epoch=29-step=420-v1.ckpt

*** SIDE NOTE: One good thing about how it is working now automatically is that for example if a job gets Preemption multiple times in a row, it will auto-resume successfully on the cluster till the job reach the end or on Trixie with a job run time max where it might need multiple runs to reach the end :-)

roedoejet · 2024-02-19T19:22:01Z

we could change this file to

monitored_ckpt_callback = ModelCheckpoint(
        monitor=monitor,
        mode="min",
        save_top_k=config.training.save_top_k_ckpts,
        every_n_train_steps=config.training.ckpt_steps,
        every_n_epochs=config.training.ckpt_epochs,
        enable_version_counter=True,
    )

roedoejet · 2024-02-19T19:29:22Z

If we run three times to 29 epochs:

213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=25-step=364.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=26-step=378.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=27-step=392.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=28-step=406.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22 'epoch=29-step=420.ckpt'
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last.ckpt
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last-v1.ckpt
213248 -rw-r----- 1 tes001 nrc_ict 218357390 Feb 16 15:22  last-v2.ckpt

We discussed as a group and we need to confirm the correct behaviour for checkpointing

marctessier mentioned this issue Feb 16, 2024

We should resume from last.ckpt by default #247

Open

roedoejet added this to the beta milestone Feb 19, 2024

roedoejet added bug Something isn't working enhancement New feature or request labels Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228

logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228

marctessier commented Jan 23, 2024

roedoejet commented Jan 23, 2024

marctessier commented Feb 16, 2024

roedoejet commented Feb 19, 2024

roedoejet commented Feb 19, 2024 •

edited

Loading

logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228

logs_and_checkpoints/FeaturePredictionExperiment.good/base/checkpoints/last.ckpt should be a relative link #228

Comments

marctessier commented Jan 23, 2024

roedoejet commented Jan 23, 2024

marctessier commented Feb 16, 2024

roedoejet commented Feb 19, 2024

roedoejet commented Feb 19, 2024 • edited Loading

roedoejet commented Feb 19, 2024 •

edited

Loading