Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of state of the art deep network tools into PetaVision #22

Open
slundqui opened this issue Dec 28, 2015 · 7 comments
Open
Assignees

Comments

@slundqui
Copy link
Contributor

Hi all,

As many of you know, I have implemented a standalone AlexNet implementation in C++, and I would like to incorporate this functionality into PetaVision. However, there are some fundamental differences between my implementation and the current PetaVision implementation that needs to be addressed. I'm starting this tread to pick your brains for implementation details, as well as an informative thread as to what to expect.

First off, here are the advantages of my implementation over the current attempted AlexNet implementation in PetaVision.

  • All memory lives on the GPU. In contrast, PetaVision requires that the data on the CPU stays in sync with the data on the GPU.
  • Gradient calculations are done via CuDNN. This includes gradients calculated with respect to the bias and weights (analogous to how we calculate the dw buffer), as well as gradients with respect to the activations (for backprop to previous layers in deep networks).
  • Max pooling and its gradients are done via CuDNN, with further options for mean pooling.
  • All gradient calculations are encapsulated into a single layer. This means you don't have to follow the confusing "original connection to the backprop error, with a clone of that connection going into the next feedforward layer, and a transpose of the next connection from the next error to this error".
  • All bias connections are encapsulated into a single connection, with a flag to turn it off. No more "connection attached to a constant layer set to all 1's, with the original connected to the error and the clone connected to the feedforward layer, and you probably need a transpose of it somewhere too".
  • Cost functions are encapsulated into a single layer. Currently, options are least squares and softmax multinomial regression.
  • It works.

I plan on keeping this part of the code encapsulated in the mlearning auxlib, but I can see how many of these features can be useful in the core toolbox as well (pooling on GPUs, updating weights on GPUs, etc). I'll have to figure out which parts go where.

One major design decision that I have to solve has to do with the encapsulation of gradient calculations into a single layer. As AlexNet and the like depend on a feedforward stage and a backprop stage, encapsulating both in a single layer is tricky. Currently, we control stages via the phase parameter. However, backprop must execute in the opposite order of the feedforward stage. When these 2 computations were separate, a user explicitly set phases to achieve the desired result, which is no longer possible when these computations are combined into a single layer. Here are several possible solutions to this problem.

  • Implement an "backwardsUpdateState" in layers that go in the opposite direction of phases. While this would be the easiest to implement, we are putting an AlexNet specific function into the general HyPerLayer class.
  • Implement a mlearning HyPerCol derived from the base HyPerCol that implements a backwardsUpdateState. However, this may create an unnecessary dependency of this special HyPerCol whenever you use any mlearning layers.
  • Implement an automatic gradient layer generator with correct phases. This is an idea I got from TensorFlow. This is my favorite idea so far, but I'm sure there's some reason as to why this wouldn't work which I hope you guys can bring to light.

One final minor thing would be to split up GPU timing info to separate memcpys and computations. I know Nvidia has a timing toolbox that they provide with Cuda, which we might be able to integrate into our current timing implementation.

I'm sure this post is nowhere near a through review of all the problems I will run into, but it's a good start, and a good place for further discussion of new issues that may come up later as well.

Sheng

@slundqui slundqui assigned slundqui and peteschultz and unassigned slundqui and peteschultz Dec 28, 2015
@slundqui
Copy link
Contributor Author

@peteschultz

@slundqui
Copy link
Contributor Author

For more information on the standalone implementation, here is the final writeup and presentation of the project.

https://docs.google.com/document/d/1ahbASGYgRrncbBUHdT38h13L7Vy9bdLJJmGaDrKnL_c/edit?usp=sharing

https://docs.google.com/presentation/d/1eSMSeFLNq2ul-bIx1BEhgC1vTBurL842IUpwTkzBpUk/edit?usp=sharing

@dpaiton
Copy link
Contributor

dpaiton commented Dec 28, 2015

This is great, Sheng! Would you mind elaborating more on what you mean by the "automatic gradient layer generator"? I know that TensorFlow and Theano both have the capability of auto-differentiating a function for back prop. Is this what you are describing? I am not making the connection for how this solves your problem of implementing back-prop in PetaVision.

Of the three options you listed, I actually like the first the best. I think adding an ability for any layer & conn to propagate a signal backwards down the network would be valuable. This one upgrade will make implementing standard ML algorithms easier, as well as novel semi-supervised models that combine LCA dynamics with back-propagated label error signals. Of course the default back-prop functionality would have to allow for all of the current models to run uninhibited, which might be difficult to do.

Whatever you chose, make sure you document it well. Let me know if there is anything I can do to help!

@slundqui
Copy link
Contributor Author

I think it's related. In Tensorflow, you build the feedforward net, and gradients are automatically calculated based on the feedforward net. I assumed it was an equivalent backwards computation for every feedforward computation that automatically gets added in as needed, but it could very well be based on an empirical calculation of a gradient.

@slundqui
Copy link
Contributor Author

slundqui commented Jan 5, 2016

After much thought into how exactly to implement this, I seem to be stuck. While encapsulating backprop into layers and connections will make building deep convolutional nets much easier, such a model does not fit well into the current implementation, which is a simple data delivery model between layers using connections. By encapsulating backprop into special layers and connections, we create a massive dependencies between these layers and connections, which makes it hard for future users to combine the two (for example, backprop a classification error along with a sparse reconstruction error, although plasticCloneConns may still be able to make this possible). The alternative, however, is the current way we're trying to do deep networks in PetaVision, with the complicated set of connections to achieve backprop, which also makes it very hard to utilize cudnn's gradient calculations.

From here, I see two options. We can separate the backprop architecture from the core part of PV (and create lots of dependencies between specific backprop layers and connections), or we can try to extend the current implementation for backprop (with very complicated networks that would probably get more complicated to incorporate the GPU). What do you guys think?

@slundqui
Copy link
Contributor Author

slundqui commented Jan 5, 2016

Talking to Will, it seems that a third option is to incorporate backprop functionality into all of PetaVision (where the default option is to not do any backprop). I'm thinking this could actually work. All plastic connections have the option of learning either off of the activity (what is currently being done now), or learning off of the gradient (backprop). This way, the user has the option of building a backprop network using the layers and connections we have now. Furthermore, each layer would incorporate the gradient calculations as well. This would also mean we would take Dylan's backing of using a backwardsUpdateState.

I like this option, and unless someone says otherwise or brings up any caveats to this plan, I will start flushing out the details and start implementing.

@dpaiton
Copy link
Contributor

dpaiton commented Jan 5, 2016

This sounds like the best option. Weights can implement a forward learning rule and/or a backward learning rule. The conns receive forward activity and backward gradients. Love it! I can't wait for you to get this done :-D Let me know what I can do to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants