how long does the training cost? #11

trantorrepository · 2018-01-20T06:36:39Z

No description provided.

anatolix · 2018-01-20T16:24:32Z

On my 1080ti one epoch is approximately 2-2.5 hours, and you need 30-35 epochs to finish.

trantorrepository · 2018-01-21T10:02:02Z

My training loss is larger than your loss reporting in michalfaber#45. It is still 740+ after 10 epochs. What may cause it?

anatolix · 2018-01-21T18:47:36Z

To be honest no idea. Which project do you training? absolute value of loss in michalfaber and mine project is different due to different hdf5 content, but this is big loss for both mine and michael's versions

trantorrepository · 2018-01-22T01:54:07Z

I use your latest project. The different setting may be 2 gpus used for training.

anatolix · 2018-01-22T06:26:53Z

Did you scaled batch size for 2 gpus? If yes you may need to scale learning rate

Minotaur-CN · 2018-01-22T11:24:50Z

How to scale lr according multi-GPU according to your experience?
Just multiply the lr by numOfGPU?

anatolix · 2018-01-22T21:47:51Z

According this https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf
Multiply according batch size. I.e. if you raise batch size from 10 to 20, you should multiply LR by two.
But I've hasn't tested this project with this setup

trantorrepository · 2018-01-23T02:39:27Z

I did not. It seems to be the reason.

anatolix · 2018-01-23T03:56:43Z

btw did 2 gpus train significally faster in your case? in mine setup gpu load([watch] nvidia-smi) just jumping from one to another, but my gpus are different (1080Ti and 1080) and have no nvlink, so i've just training different models on them

trantorrepository · 2018-01-23T14:44:03Z

Yes, it is 1.7 faster. But the utilizations of two gpus are not always high at the same time.

Minotaur-CN · 2018-01-24T08:46:46Z

@anatolix
That may be using python to do data augmentation.

The data feeding is low if using this code for multi-gpu training.

anatolix · 2018-01-24T10:47:43Z

py_rpme_server_tester.py could test speed of augmentation, it is approximately 140 images per second on my machine(alhough hdf5 should be on ssd for that) and this is 5 times faster than C++ implementation, it is far more than we really need for training(10/per second per gpu). I think it is keras implementation of multigpu it is really new and unfinished

anatolix · 2018-01-24T11:03:05Z

To be sure just have committed speed test inside train pose
7e79fd3#diff-5f9553ab64c88cb242f0b55068ca2e49

On my server it says:
batches per second 5.786476637952872
batches per second 5.7619510163686
batches per second 5.842369421224827
batches per second 5.962092320266882
batches per second 5.999656360337656
batches per second 5.951338023827906
batches per second 5.9165966302952695
batches per second 5.906818176697108
batches per second 5.940744568724261
batches per second 5.967964646505151
batches per second 5.970570172200173
batches per second 5.940416025591697
batches per second 5.929933008772442
batches per second 5.9478481273904835
batches per second 5.9353772224932175
batches per second 5.939926683901685
batches per second 5.862215485602886
batches per second 5.87035626635639
batches per second 5.798390861812536
batches per second 5.78362199792317
batches per second 5.7078112813578095
batches per second 5.7466899871438635
batches per second 5.768000631491158
batches per second 5.733300557500513

I.e. in enough for approx 6 cards. This is with parallel model training, i.e. it is second running augmentation on this server.

Ai-is-light · 2018-01-24T12:58:06Z

I train a small model according to the prototxt of the original. and it cost about 3 days, however the result is nearly same as the original. It is only half of the paper's model and 2x faster

trantorrepository · 2018-01-26T02:51:13Z

@Ai-is-light what is your result in coco？

trantorrepository · 2018-01-26T02:57:11Z

@anatolix yes，keras do not support multi-gpu well, the bug of saving multi-gpu model has not been solved for months. Hoping a tensorflow version.

anatolix · 2018-01-26T10:47:18Z

Keras is actually on top of tensorflow, you could use all TF code with keras as well.

Ai-is-light · 2018-02-06T06:56:13Z

@tranorrepository it is shown as following,

how about you?

Ai-is-light · 2018-02-06T06:59:32Z

I find that if the PAF-branch can't work well or the PAF-branch do not work , the mAP and AR can not compute from the output of the network. Did you meet it @anatolix

hellojialee · 2018-02-06T07:29:40Z

@tranorrepository @anatolix @Ai-is-light
Hi everyone, If I use multi-gpu and double the batch size, do I need change the learning rate （i.e. × 2） accordingly？ The multi-model in keras duplicates the model and separate the input data evenly, so I wonder if we really need change the base （original） learning rate.

murrayLuke · 2018-05-01T15:22:48Z

This may be naive but what parameter controls the learning rate, and where can I change it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how long does the training cost? #11

how long does the training cost? #11

trantorrepository commented Jan 20, 2018

anatolix commented Jan 20, 2018

trantorrepository commented Jan 21, 2018

anatolix commented Jan 21, 2018

trantorrepository commented Jan 22, 2018

anatolix commented Jan 22, 2018

Minotaur-CN commented Jan 22, 2018

anatolix commented Jan 22, 2018 •

edited

Loading

trantorrepository commented Jan 23, 2018

anatolix commented Jan 23, 2018

trantorrepository commented Jan 23, 2018

Minotaur-CN commented Jan 24, 2018

anatolix commented Jan 24, 2018 •

edited

Loading

anatolix commented Jan 24, 2018

Ai-is-light commented Jan 24, 2018

trantorrepository commented Jan 26, 2018

trantorrepository commented Jan 26, 2018

anatolix commented Jan 26, 2018

Ai-is-light commented Feb 6, 2018

Ai-is-light commented Feb 6, 2018

hellojialee commented Feb 6, 2018

murrayLuke commented May 1, 2018 •

edited

Loading

how long does the training cost? #11

how long does the training cost? #11

Comments

trantorrepository commented Jan 20, 2018

anatolix commented Jan 20, 2018

trantorrepository commented Jan 21, 2018

anatolix commented Jan 21, 2018

trantorrepository commented Jan 22, 2018

anatolix commented Jan 22, 2018

Minotaur-CN commented Jan 22, 2018

anatolix commented Jan 22, 2018 • edited Loading

trantorrepository commented Jan 23, 2018

anatolix commented Jan 23, 2018

trantorrepository commented Jan 23, 2018

Minotaur-CN commented Jan 24, 2018

anatolix commented Jan 24, 2018 • edited Loading

anatolix commented Jan 24, 2018

Ai-is-light commented Jan 24, 2018

trantorrepository commented Jan 26, 2018

trantorrepository commented Jan 26, 2018

anatolix commented Jan 26, 2018

Ai-is-light commented Feb 6, 2018

Ai-is-light commented Feb 6, 2018

hellojialee commented Feb 6, 2018

murrayLuke commented May 1, 2018 • edited Loading

anatolix commented Jan 22, 2018 •

edited

Loading

anatolix commented Jan 24, 2018 •

edited

Loading

murrayLuke commented May 1, 2018 •

edited

Loading