You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I tried using SSRL for training on a real robot, I encounter NaN values during model training. I suspect it’s because the real setup is quite noisy. To verify, I added the same level of noise to the observations in simulation and observed a similar issue. Has this issue ever come up for you?
Model epoch 0: train total loss 16197.876953125, train mean loss 26707.083984375, test mean loss [2.1955036e+14]
Model epoch 1: train total loss 2.241743061862318e+17, train mean loss 3.6961121031788954e+17, test mean loss [3.817644e+12]
Model epoch 2: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 3: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 4: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 5: train total loss nan, train mean loss nan, test mean loss [nan]
The text was updated successfully, but these errors were encountered:
hdadong
changed the title
When the observations are highly noisy, I encounter NaN values during world model training. Has this issue ever come up for you?
When the observations are highly noisy, I encounter NaN values during model training. Has this issue ever come up for you?
Jan 16, 2025
Hmm...I'm thinking you're getting nans from exploding gradients due to very noisy data or because the lagrangian dynamics are very stiff. For the former, maybe you could try a lower learning rate or preprocessing the data with a zero-phase filter? For the latter, double check your robot xml definition -- if, for example, the leg link masses are very small, small changes in contact forces will result in very large changes in acceleration (and state).
You can also check out the debug nans feature of jax (https://jax.readthedocs.io/en/latest/debugging/flags.html) to see where the nan comes up. You might have to disable jit for it to give you the exact location where the nan occurs first.
When I tried using SSRL for training on a real robot, I encounter NaN values during model training. I suspect it’s because the real setup is quite noisy. To verify, I added the same level of noise to the observations in simulation and observed a similar issue. Has this issue ever come up for you?
Model epoch 0: train total loss 16197.876953125, train mean loss 26707.083984375, test mean loss [2.1955036e+14]
Model epoch 1: train total loss 2.241743061862318e+17, train mean loss 3.6961121031788954e+17, test mean loss [3.817644e+12]
Model epoch 2: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 3: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 4: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 5: train total loss nan, train mean loss nan, test mean loss [nan]
The text was updated successfully, but these errors were encountered: