Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

learning rate decay is depending on batch size (was: Training never achieves results of original caffe models) #39

Open
anatolix opened this issue Nov 27, 2017 · 7 comments

Comments

@anatolix
Copy link

anatolix commented Nov 27, 2017

Update: I've probably found the problem, see last comments

I tried to train model with original C++ augmentation (rmpe_server) and my own python implementation (py_rmpe_server) it never train correctly.

To prove my point this is demo.py output with weights.best.h5 converted from caffe
canonical caffe
Note it is perfect joint match and no additional unconnected points.

@anatolix
Copy link
Author

These are training with C++ rmpe_server picture for each 10 generations
cpp10
cpp20
cpp30
cpp40
cpp50

there is perfect skeletons but additional points
cpp60

near perfect but double dots in legs of center guy on the background
cpp70

more overfitted models later
cpp80
cpp90
cpp100
cpp110
cpp120

@anatolix
Copy link
Author

These are my results with py_rmpe_server so far:
py10
py20
py30
py40

@anatolix
Copy link
Author

anatolix commented Nov 27, 2017

The questions is did this project are able to achieve results of original training at all?
It may be caught somewhere around 70th generation or may be it wasn't achieved at all?
Do original authors know some magic trick? Do we have same learning rate, initialization, etc?

@anatolix
Copy link
Author

I check val_stage6_Lx losses which inflience qualities. For Cpp augmentation best are

gen 117 for L2 loss
cpp107

gen 107 for L1 loss
cpp117

They are almost perfect again. May be training step decay made better

@anatolix
Copy link
Author

anatolix commented Nov 27, 2017

May be I've found problem:

0 4e-05
1 4e-05
2 4e-05
3 4e-05
...
50 4e-05
51 4e-05
52 1.3320000000000001e-05
53 1.3320000000000001e-05
...
102 1.3320000000000001e-05
103 1.3320000000000001e-05
104 4.435560000000001e-06
105 4.435560000000001e-06
...
148 4.435560000000001e-06
149 4.435560000000001e-06

Note: Actually we achieve small enough learning rate after generation 100.

Original code has 25 generations, each twice size as ours(meta.write_number: 121000), btw why? Did they had more images or made more agmentation?
Learning rate changed after epoch 17 (ours will be 36) and training finished by epoch 25 (ours 50)

@anatolix
Copy link
Author

And all of this affected by batch size, I have batch size = 20, so probably there is my problems in training

@anatolix anatolix changed the title Training never achieves results of original caffe models learning rate decay is depending on batch size (was: Training never achieves results of original caffe models) Nov 27, 2017
@anatolix
Copy link
Author

anatolix commented Nov 27, 2017

Probably I need help here - should be learning rate changed with batch size?
On one thought larger batch size is larger sum of gradients, I do't remember did the use sum or mean.
On other thought larger batch mean less stochastic.
Haven't found answer in google.
Was the dependency of lr decay on batch size intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant