Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the observations are highly noisy, I encounter NaN values during model training. Has this issue ever come up for you? #12

Open
hdadong opened this issue Jan 16, 2025 · 1 comment

Comments

@hdadong
Copy link

hdadong commented Jan 16, 2025

When I tried using SSRL for training on a real robot, I encounter NaN values during model training. I suspect it’s because the real setup is quite noisy. To verify, I added the same level of noise to the observations in simulation and observed a similar issue. Has this issue ever come up for you?

Model epoch 0: train total loss 16197.876953125, train mean loss 26707.083984375, test mean loss [2.1955036e+14]
Model epoch 1: train total loss 2.241743061862318e+17, train mean loss 3.6961121031788954e+17, test mean loss [3.817644e+12]
Model epoch 2: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 3: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 4: train total loss nan, train mean loss nan, test mean loss [nan]
Model epoch 5: train total loss nan, train mean loss nan, test mean loss [nan]

@hdadong hdadong changed the title When the observations are highly noisy, I encounter NaN values during world model training. Has this issue ever come up for you? When the observations are highly noisy, I encounter NaN values during model training. Has this issue ever come up for you? Jan 16, 2025
@jake-levy
Copy link
Member

Hmm...I'm thinking you're getting nans from exploding gradients due to very noisy data or because the lagrangian dynamics are very stiff. For the former, maybe you could try a lower learning rate or preprocessing the data with a zero-phase filter? For the latter, double check your robot xml definition -- if, for example, the leg link masses are very small, small changes in contact forces will result in very large changes in acceleration (and state).

You can also check out the debug nans feature of jax (https://jax.readthedocs.io/en/latest/debugging/flags.html) to see where the nan comes up. You might have to disable jit for it to give you the exact location where the nan occurs first.

@jake-levy jake-levy reopened this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants