Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to start from checkpoint (with a different system) #36

Open
twitschel opened this issue Feb 23, 2016 · 7 comments
Open

Failure to start from checkpoint (with a different system) #36

twitschel opened this issue Feb 23, 2016 · 7 comments

Comments

@twitschel
Copy link

Hello,
in an eMail, I was once asked to openly present my questions so I'll do it here.
It is, however, more related to using PetaVision, rather than improving it, I'm afraid.

For a student project I slightly modified your LCACifar params file (from the tutorial) and successfully got the desired results for 4 different conditions.
The computer with a Quattro K6000 had to be handed over to someone else before I could check whether my fourth condition was trained sufficiently and as it turns out now, it wasn't.
For the last remaining steps I've got a GTX 750 Ti.

However, I have problems loading from a Checkpoint.
As you'll see in the stereoV1-P2.log.txt it stops working before it writes anything (before the first 1000 steps are done). But I don't really understand the error message in there.

The command I use to run the Network is run_sparsity.sh.txt the params file isstereoLCA_V1.params.txt.

Here is the log file of the first time I ran that setup (with the K6000): stereoV1.log.txt. The parameters were the same except for the amount of iterations and the initial write time.

Personally I believe that I just show myself incapable of making PV do what it can, rather than facing actual issues within PetaVision.
I want to point out that the way several checkpoints are give to mpirun was not intuitive for me and the only explanation of how to do that was given in the log file after it aborted. Maybe adding a "(separated with a colon)" to the appropriate section in the tutorial would be an option.

I'll just thank you all in advance for reading this,
Thede

@slundqui
Copy link
Contributor

Hi Thede,

I tried reproducing your error and couldn't do so within a system test. Is there any chance you can send me your checkpoints that you're trying to boot from?

@wshainin
Copy link
Member

Hi Thede,

Just briefly looking at your files, it doesn't look like you're doing something fundamentally incorrect with PV. I'd like to track this issue down in case it is a bug with booting from a checkpoint. However, if your goal is to continue learning weights, you can manually load them into the connection "V1ToInputError_V1" by setting the parameter weightInitType = "FileWeight", and initWeightsFile = "/home/neuralnet/workspace/OpenPV/Sparsity/output/Checkpoints/batchsweep_00/Checkpoint6000000/V1ToInputError_V1_W.pvp". The weights are shared across a batch, so that particular file should be the same in all batchsweep directories. You can continue training where the previous run left off by setting the Movie parameter start_frame_index to the index of the next desired input.

Note that if all other layers/connections are initialized as in the params file (and not by checkpoint), then the initial conditions (V1 activity, etc) will be different than if you had loaded the checkpoint. Additionally, you can load all activities from their respective checkpoint files, but I think all elements of a batch will have the same activity at initialization (instead of each batch element being initialized with its previous activity, which should happen if you could boot from a checkpoint).

As far as the error with your run:

  • The -l input parameter will only forward PV generated output, but anything else getting sent to std out/error will not end up in the log. Some of this information (such as MPI or CUDA errors) might be useful for debugging.
  • You mentioned that this run is on a different machine. Are you able to check the versions of cuda, cudnn, mpi, etc? I would like to verify that the dependencies on the new machine are not significantly different from the machine you first ran on.

~Will

@twitschel
Copy link
Author

Hello Will,

thank you for the advice, but I still can't get it running. It looks like it stops as soon as it attempts to use the GPU.
stereoLCA_V2.params.txt here I also direct it to the InputToInputScaled_W but the result is the same if I don't do that
stereoV2-P2.log.txt

I'm not sure how to collect the information about the MPI and CUDA errors, but
Versions.txt contains the nvcc -V, mpirun -V, nvidia-smi reports and the cudnn version (which is 3.0).

@wshainin
Copy link
Member

Is there any indication that you're running out of GPU memory? I'm wondering if your nbatch of 40 is too large for the 750ti. Can you try running the same params with updateGpu = false; in "V1" and recieveGpu = false; in "InputError_V1ToV1"? This will run the network without the GPU. It will probably be quite slow, but I want to see if it errors out in the same way.

@slundqui
Copy link
Contributor

The program should throw an error if there is not enough gpu memory.
Furthermore, the code should break in initialization if it's a memory
issue, as we shouldn't be allocating memory in update steps.

I got your checkpoints email, so I'll see if I can reproduce the error this
weekend. I suspect it's a bug in loading checkpoints with multiple batches
across batches.

Sheng
On Feb 25, 2016 9:40 AM, "William Shainin" [email protected] wrote:

Is there any indication that you're running out of GPU memory? I'm
wondering if your nbatch of 40 is too large for the 750ti. Can you try
running the same params with updateGpu = false; in "V1" and recieveGpu =
false; in "InputError_V1ToV1"? This will run the network without the GPU.
It will probably be quite slow, but I want to see if it errors out in the
same way.


Reply to this email directly or view it on GitHub
#36 (comment).

@twitschel
Copy link
Author

First of I have to apologize for replying this late.
It does appear to be a hardware problem as I can start from the checkpoints fine when I run this on the original computer (you've seen the images, I think).
But I did try to vary some of the parameters I gave him (number of processes, batchwidth, number of threads, nbatch) and I still couldn't get it to work. If it obviously can't work with different settings I'll leave it at that, but it'll help I'd offer to test and record the logs of the different parameter settings.
As well as anything else I can manage and you might suggest.

@slundqui
Copy link
Contributor

So I know that changing batchwidth and nbatch between runs will indeed break checkpointing, and is on our todo list of things to fix. However, the other parameters you mentioned should be okay to change between checkpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants