-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to start from checkpoint (with a different system) #36
Comments
Hi Thede, I tried reproducing your error and couldn't do so within a system test. Is there any chance you can send me your checkpoints that you're trying to boot from? |
Hi Thede, Just briefly looking at your files, it doesn't look like you're doing something fundamentally incorrect with PV. I'd like to track this issue down in case it is a bug with booting from a checkpoint. However, if your goal is to continue learning weights, you can manually load them into the connection "V1ToInputError_V1" by setting the parameter weightInitType = "FileWeight", and initWeightsFile = "/home/neuralnet/workspace/OpenPV/Sparsity/output/Checkpoints/batchsweep_00/Checkpoint6000000/V1ToInputError_V1_W.pvp". The weights are shared across a batch, so that particular file should be the same in all batchsweep directories. You can continue training where the previous run left off by setting the Movie parameter start_frame_index to the index of the next desired input. Note that if all other layers/connections are initialized as in the params file (and not by checkpoint), then the initial conditions (V1 activity, etc) will be different than if you had loaded the checkpoint. Additionally, you can load all activities from their respective checkpoint files, but I think all elements of a batch will have the same activity at initialization (instead of each batch element being initialized with its previous activity, which should happen if you could boot from a checkpoint). As far as the error with your run:
~Will |
Hello Will, thank you for the advice, but I still can't get it running. It looks like it stops as soon as it attempts to use the GPU. I'm not sure how to collect the information about the MPI and CUDA errors, but |
Is there any indication that you're running out of GPU memory? I'm wondering if your nbatch of 40 is too large for the 750ti. Can you try running the same params with updateGpu = false; in "V1" and recieveGpu = false; in "InputError_V1ToV1"? This will run the network without the GPU. It will probably be quite slow, but I want to see if it errors out in the same way. |
The program should throw an error if there is not enough gpu memory. I got your checkpoints email, so I'll see if I can reproduce the error this Sheng
|
First of I have to apologize for replying this late. |
So I know that changing batchwidth and nbatch between runs will indeed break checkpointing, and is on our todo list of things to fix. However, the other parameters you mentioned should be okay to change between checkpoints. |
Hello,
in an eMail, I was once asked to openly present my questions so I'll do it here.
It is, however, more related to using PetaVision, rather than improving it, I'm afraid.
For a student project I slightly modified your LCACifar params file (from the tutorial) and successfully got the desired results for 4 different conditions.
The computer with a Quattro K6000 had to be handed over to someone else before I could check whether my fourth condition was trained sufficiently and as it turns out now, it wasn't.
For the last remaining steps I've got a GTX 750 Ti.
However, I have problems loading from a Checkpoint.
As you'll see in the stereoV1-P2.log.txt it stops working before it writes anything (before the first 1000 steps are done). But I don't really understand the error message in there.
The command I use to run the Network is run_sparsity.sh.txt the params file isstereoLCA_V1.params.txt.
Here is the log file of the first time I ran that setup (with the K6000): stereoV1.log.txt. The parameters were the same except for the amount of iterations and the initial write time.
Personally I believe that I just show myself incapable of making PV do what it can, rather than facing actual issues within PetaVision.
I want to point out that the way several checkpoints are give to mpirun was not intuitive for me and the only explanation of how to do that was given in the log file after it aborted. Maybe adding a "(separated with a colon)" to the appropriate section in the tutorial would be an option.
I'll just thank you all in advance for reading this,
Thede
The text was updated successfully, but these errors were encountered: