Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training checkpoint - 5500 (1 hour on 3090) #1

Open
johndpope opened this issue Jun 27, 2024 · 3 comments
Open

training checkpoint - 5500 (1 hour on 3090) #1

johndpope opened this issue Jun 27, 2024 · 3 comments

Comments

@johndpope
Copy link
Owner

johndpope commented Jun 27, 2024

image

Screenshot from 2024-06-28 00-43-05

i had to rework the generator to use less layers / and use 64 x 64 image resizing.

Screenshot from 2024-06-28 00-45-17

@johndpope
Copy link
Owner Author

johndpope commented Jun 27, 2024

@fenghe12 / @JaLnYn / @ChenyangWang95
this might actually work.

In megaportaits I use custom resnet50 probably safer to switch that in because otherwise the model is going to just discard the updates???. I check in the morning.

@francqz31
Copy link

@johndpope is it just me or sonnet 3.5 machine learning code output is actually way more readable than opus ? feels like actual working code this time !!!

@johndpope
Copy link
Owner Author

johndpope commented Jun 27, 2024

something maybe not quite right. i train overnight.
this is still epoch 0 -

checkpoint-86500
recon_step_86500

Screenshot from 2024-06-28 07-49-49

i change the code back to use 512x512
resume training - and get this.

recon_step_87000

im seeing newer clearer images advancing in epoch 1 - even after a few more cycles - will udpate here later. i think by epoch 4 - probably going to be fairly decent.

i add some tensorboard stuff - and surface the losses.

recon_step_126000.png
image

UPDATE -
my bad was overfitting to one image. I just push updated dataloader.
new debug image.
debug_step_164000

Starting training again.
was seeing OOM errors - check your num_of_workers.
debug_step_168000

UPDATE - i restart training - I change the generator to use resblocks - maybe will help recreate the image better.

Screenshot from 2024-06-28 16-40-33

debug_step_4000

UPDATE - Sunday
so i rebuilt code to do progressive training with resolution upscaling - 64,128....256 ...512
added tensorboard losses
Screenshot from 2024-06-30 05-43-40

i give up training across celebA -
i overfit to one pair of images....

training progress so far
Screenshot from 2024-06-30 14-10-09

UPDATE - Sunday night

so had some battle with gradient explosions

ending up having to add some accummulation steps in that helped stablize things
#3

looks like the learning rate is getting things into a minima....
debug_step_3500_resolution_64

UPDATE - i switch to use 256 because resnet50 cant return rich features 2048,7,7 for images less than 224x224.

Screenshot from 2024-07-01 09-50-51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants