Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how long does the training cost? #11

Open
trantorrepository opened this issue Jan 20, 2018 · 21 comments
Open

how long does the training cost? #11

trantorrepository opened this issue Jan 20, 2018 · 21 comments

Comments

@trantorrepository
Copy link

No description provided.

@anatolix
Copy link
Owner

On my 1080ti one epoch is approximately 2-2.5 hours, and you need 30-35 epochs to finish.

@trantorrepository
Copy link
Author

My training loss is larger than your loss reporting in michalfaber#45. It is still 740+ after 10 epochs. What may cause it?

@anatolix
Copy link
Owner

To be honest no idea. Which project do you training? absolute value of loss in michalfaber and mine project is different due to different hdf5 content, but this is big loss for both mine and michael's versions

@trantorrepository
Copy link
Author

I use your latest project. The different setting may be 2 gpus used for training.

@anatolix
Copy link
Owner

Did you scaled batch size for 2 gpus? If yes you may need to scale learning rate

@Minotaur-CN
Copy link

How to scale lr according multi-GPU according to your experience?
Just multiply the lr by numOfGPU?

@anatolix
Copy link
Owner

anatolix commented Jan 22, 2018

According this https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
Multiply according batch size. I.e. if you raise batch size from 10 to 20, you should multiply LR by two.
But I've hasn't tested this project with this setup

@trantorrepository
Copy link
Author

I did not. It seems to be the reason.

@anatolix
Copy link
Owner

btw did 2 gpus train significally faster in your case? in mine setup gpu load([watch] nvidia-smi) just jumping from one to another, but my gpus are different (1080Ti and 1080) and have no nvlink, so i've just training different models on them

@trantorrepository
Copy link
Author

Yes, it is 1.7 faster. But the utilizations of two gpus are not always high at the same time.

@Minotaur-CN
Copy link

@anatolix
That may be using python to do data augmentation.

The data feeding is low if using this code for multi-gpu training.

@anatolix
Copy link
Owner

anatolix commented Jan 24, 2018

py_rpme_server_tester.py could test speed of augmentation, it is approximately 140 images per second on my machine(alhough hdf5 should be on ssd for that) and this is 5 times faster than C++ implementation, it is far more than we really need for training(10/per second per gpu). I think it is keras implementation of multigpu it is really new and unfinished

@anatolix
Copy link
Owner

To be sure just have committed speed test inside train pose
7e79fd3#diff-5f9553ab64c88cb242f0b55068ca2e49

On my server it says:
batches per second 5.786476637952872
batches per second 5.7619510163686
batches per second 5.842369421224827
batches per second 5.962092320266882
batches per second 5.999656360337656
batches per second 5.951338023827906
batches per second 5.9165966302952695
batches per second 5.906818176697108
batches per second 5.940744568724261
batches per second 5.967964646505151
batches per second 5.970570172200173
batches per second 5.940416025591697
batches per second 5.929933008772442
batches per second 5.9478481273904835
batches per second 5.9353772224932175
batches per second 5.939926683901685
batches per second 5.862215485602886
batches per second 5.87035626635639
batches per second 5.798390861812536
batches per second 5.78362199792317
batches per second 5.7078112813578095
batches per second 5.7466899871438635
batches per second 5.768000631491158
batches per second 5.733300557500513

I.e. in enough for approx 6 cards. This is with parallel model training, i.e. it is second running augmentation on this server.

@Ai-is-light
Copy link

I train a small model according to the prototxt of the original. and it cost about 3 days, however the result is nearly same as the original. It is only half of the paper's model and 2x faster

@trantorrepository
Copy link
Author

@Ai-is-light what is your result in coco?

@trantorrepository
Copy link
Author

@anatolix yes,keras do not support multi-gpu well, the bug of saving multi-gpu model has not been solved for months. Hoping a tensorflow version.

@anatolix
Copy link
Owner

Keras is actually on top of tensorflow, you could use all TF code with keras as well.

@Ai-is-light
Copy link

@tranorrepository it is shown as following,
wechatimg274
how about you?

@Ai-is-light
Copy link

I find that if the PAF-branch can't work well or the PAF-branch do not work , the mAP and AR can not compute from the output of the network. Did you meet it @anatolix

@hellojialee
Copy link

@tranorrepository @anatolix @Ai-is-light
Hi everyone, If I use multi-gpu and double the batch size, do I need change the learning rate (i.e. × 2) accordingly? The multi-model in keras duplicates the model and separate the input data evenly, so I wonder if we really need change the base (original) learning rate.

@murrayLuke
Copy link

murrayLuke commented May 1, 2018

This may be naive but what parameter controls the learning rate, and where can I change it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants