-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how long does the training cost? #11
Comments
On my 1080ti one epoch is approximately 2-2.5 hours, and you need 30-35 epochs to finish. |
My training loss is larger than your loss reporting in michalfaber#45. It is still 740+ after 10 epochs. What may cause it? |
To be honest no idea. Which project do you training? absolute value of loss in michalfaber and mine project is different due to different hdf5 content, but this is big loss for both mine and michael's versions |
I use your latest project. The different setting may be 2 gpus used for training. |
Did you scaled batch size for 2 gpus? If yes you may need to scale learning rate |
How to scale lr according multi-GPU according to your experience? |
According this https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf |
I did not. It seems to be the reason. |
btw did 2 gpus train significally faster in your case? in mine setup gpu load([watch] nvidia-smi) just jumping from one to another, but my gpus are different (1080Ti and 1080) and have no nvlink, so i've just training different models on them |
Yes, it is 1.7 faster. But the utilizations of two gpus are not always high at the same time. |
@anatolix The data feeding is low if using this code for multi-gpu training. |
py_rpme_server_tester.py could test speed of augmentation, it is approximately 140 images per second on my machine(alhough hdf5 should be on ssd for that) and this is 5 times faster than C++ implementation, it is far more than we really need for training(10/per second per gpu). I think it is keras implementation of multigpu it is really new and unfinished |
To be sure just have committed speed test inside train pose On my server it says: I.e. in enough for approx 6 cards. This is with parallel model training, i.e. it is second running augmentation on this server. |
I train a small model according to the prototxt of the original. and it cost about 3 days, however the result is nearly same as the original. It is only half of the paper's model and 2x faster |
@Ai-is-light what is your result in coco? |
@anatolix yes,keras do not support multi-gpu well, the bug of saving multi-gpu model has not been solved for months. Hoping a tensorflow version. |
Keras is actually on top of tensorflow, you could use all TF code with keras as well. |
I find that if the PAF-branch can't work well or the PAF-branch do not work , the mAP and AR can not compute from the output of the network. Did you meet it @anatolix |
@tranorrepository @anatolix @Ai-is-light |
This may be naive but what parameter controls the learning rate, and where can I change it? |
No description provided.
The text was updated successfully, but these errors were encountered: