-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example not running #1
Comments
Hi Soren,
Apparently you have a problem in casting. Try to run it with this flag: On Sun, May 24, 2015 at 9:49 AM, Søren Kaae Sønderby <
|
|
Thanks. Your code seems to run fine without blocks. I do have floatX=float32 but isn’t that necessary when you use GPU? What is the license on the code? I’m planning to include a CTC example, using your code, in the theano lasagne library. best regards Søren On 26 May 2015, at 12:29, Mohammad Pezeshki [email protected] wrote:
|
Happy to hear that you want to use it in lasagne. Good luck, |
Good to hear. I'll probably see if i can reproduce Alex Graves handwritten digit recognition results. I have to admit that I havent looked closely at the implementation yet, but i'll do that in the coming days. Out of curiosity, have you tested your private repo code on some "real" datasets? Would you be willing to put up unclean code? I can clean it up in a PR to this repo then? I'll of course attribute you, Rakesh Var, Shawn Tan and other people who contributed. Your help is appreciated :) |
@skaae , the new changes to code is made by Phil Brakel. |
Great! Thanks |
Hi thanks for sharing. I started to work my way through the CTC and came across some differences between the formulation in http://www.machinelearning.org/proceedings/icml2006/047_Connectionist_Tempor.pdf (your reference) and in Alex Graves' book: http://www.cs.toronto.edu/~graves/preprint.pdf The differences are in the initial states of the backward pass. In the paper, eq. 9, they are specified as But in the book eq. 7.13 specifies them as 1. From the definition of the beta values i believe that 1 is the correct value? I haven't fully understood how you define the recursion with a matrix, but given you calculate the backward pass as the reverse of the forward pass i don't believe that they handle the inital states differently? |
additionally equation 10 in the paper uses y_t while eq. 7.15 in the book uses y_{t+1}? |
@skaae , my pleasure 👍 |
Do you train with pseudo cost or cost function? Also from eq 15 in http://www.cs.toronto.edu/~graves/icml_2006.pdf it seems that you need both alpha and beta? |
I see. There are new changes in the recent version that I don't know well about. |
I’m working on some tests for the forward and backward matrices if you are interested?. I just need to figure out the initial states for beta which i’m fairly sure should be 1 and not y probs On 29 May 2015, at 15:23, Mohammad Pezeshki [email protected] wrote:
|
Hey @skaae, More tests are always nice and if you find bugs please let us know! First of all, not all the functions in my version of the code might be correct anymore because I just focused on the higher level log domain ones. The tests are messy as well. We train using the pseudo cost function because for some reason the gradient of the normal cost function is unstable. The pseudo cost simply computes the CTC gradient directly without using automated differentiation. To turn this gradient into a cost that can be used for automated differentiation through the rest of your model, I either use the cross entropy between the output of your model and the CTC targets (i.e., label probabilities after summing over all the paths that are compatible with the target sequence) or the sum of the element wise product of the gradient with respect to the softmax inputs and the pre-softmax activation of your model. The latter variant is more stable because it skips the softmax gradient and prevents the computation of For ease of implementation, I simply computed beta in exactly the same way as alpha (except for some mask related issues). This is not the same as in some formulations of the algorithm where beta(t) doesn't include a multiplication with the local softmax output y_hat(t). This is why in the thesis the likelihood is defined as sum_u(alpha(t, u)beta(t, u)) while in the paper it's sum_u(alpha(t, u)beta(t, u)/y_hat(t, u)). Hopefully this clarifies things a bit. Cheers |
Thanks for the reply. I have a few more questions. "This is not the same as in some formulations of the algorithm where beta(t) doesn't include a multiplication with the local softmax output y_hat(t)." do you then refer to the different initial states in the book and in the paper. I see that equation 7.26 in the book and eqation 14 in the paper differ by only the division with y^t_{l_s} ? I dont follow your description on how to use the From what you write i should use In the docs for
Does that mean that y_hat could be in log domain or should be in log domain? Secondly I have no clue what you mean with the line :) "...or the sum of the element wise product of the gradient with respect to the softmax inputs and the pre-softmax activation of your model." Could you give an example? say i have the following:
How would i then get the gradients for the parameters in the model? |
I tried to write an example using lasagne. Its mostly copied from the ctc_test file. I try to do what you described here:
I'm not sure I correctly understood how to combine the CTC gradients and the gradients from the rest of the network. import lasagne
from lasagne.layers import RecurrentLayer, InputLayer, DenseLayer,\
NonlinearityLayer, ReshapeLayer, EmbeddingLayer
import theano
import theano.tensor as T
import numpy as np
num_batch, input_seq_len = 10, 45
num_classes = 10
target_seq_len = 5
Y_hat = np.asarray(np.random.normal(
0, 1, (input_seq_len, num_batch, num_classes + 1)), dtype=floatX)
Y = np.zeros((target_seq_len, num_batch), dtype='int64')
Y[25:, :] = 1
Y_hat_mask = np.ones((input_seq_len, num_batch), dtype=floatX)
Y_hat_mask[-5:] = 0
# default blank symbol is the highest class index (3 in this case)
Y_mask = np.asarray(np.ones_like(Y), dtype=floatX)
X = np.random.random(
(num_batch, input_seq_len)).astype('int32')
input_mask = T.matrix('features_mask')
y_hat_mask = input_mask
y = T.lmatrix('phonemes')
y_mask = T.matrix('phonemes_mask')
x = T.imatrix() # batchsize, input_seq_len
# setup Lasagne Recurrent network
# The output from the network is:
# a) output_lin_ctc is the activation before softmax (input_seq_len, batch_size, num_classes + 1)
# b) ouput_softmax is the output after softmax (batch_size, input_seq_len, num_classes + 1)
l_inp = InputLayer((num_batch, input_seq_len))
l_emb = EmbeddingLayer(l_inp, input_size=num_classes, output_size=15)
l_rnn = RecurrentLayer(l_emb, num_units=10)
l_rnn_shp = ReshapeLayer(l_rnn, (num_batch*input_seq_len, 10))
l_out = DenseLayer(l_rnn_shp, num_units=num_classes+1,
nonlinearity=lasagne.nonlinearities.identity) # + blank
l_out_shp = ReshapeLayer(l_out, (num_batch, input_seq_len, num_classes+1))
# dimshuffle to shape format (input_seq_len, batch_size, num_classes + 1)
l_out_shp_ctc = lasagne.layers.DimshuffleLayer(l_out_shp, (1, 0, 2))
l_out_softmax = NonlinearityLayer(
l_out, nonlinearity=lasagne.nonlinearities.softmax)
l_out_softmax_shp = ReshapeLayer(
l_out_softmax, (num_batch, input_seq_len, num_classes+1))
output_lin_ctc = lasagne.layers.get_output(l_out_shp_ctc, x)
output_softmax = lasagne.layers.get_output(l_out_softmax_shp, x)
all_params = lasagne.layers.get_all_params(l_out_shp)
###############
# GRADIENTS #
###############
# the CTC cross entropy between y and linear output network
pseudo_cost = ctc_cost.pseudo_cost(
y, output_lin_ctc, y_mask, y_hat_mask,
skip_softmax=True)
# calculate the gradients of the CTC wrt. linar output of network
pseudo_cost_sum = pseudo_cost.sum()
pseudo_cost_grad = T.grad(pseudo_cost_sum, output_lin_ctc)
# multiply CTC gradients with RNN output activation before softmax
output_to_grad = T.sum(pseudo_cost_grad * output_lin_ctc)
# calculate the gradients
all_grads = T.grad(output_to_grad, all_params)
updates = lasagne.updates.rmsprop(all_grads, all_params, learning_rate=0.0001)
train = theano.function([x, y, y_hat_mask, y_mask],
[output_lin_ctc, output_softmax, pseudo_cost_sum],
updates=updates)
test_val = train(X, Y, Y_hat_mask, Y_mask)
print test_val[0].shape
print test_val[1].shape
# Create test dataset
num_samples = 1000
np.random.seed(1234)
# create simple dataset of format
# input [5,5,5,5,5,2,2,2,2,2,3,3,3,3,3,....,1,1,1,1]
# targets [5,2,3,...,1]
# etc...
input_lst, output_lst = [], []
for i in range(num_samples):
this_input = []
this_output = []
prev_class = -1
for j in range(target_seq_len):
this_class = np.random.randint(num_classes)
while prev_class == this_class:
this_class = np.random.randint(num_classes)
prev_class = this_class
this_class = np.random.randint(num_classes)
this_len = np.random.randint(1, 10)
this_input += [this_class]*this_len
this_output += [this_class]
this_input += (input_seq_len - len(this_input))*[this_input[-1]]
input_lst.append(this_input)
output_lst.append(this_output)
input_arr = np.concatenate([input_lst]).astype('int32')
y_arr = np.concatenate([output_lst]).astype('int64')
y_mask_arr = np.ones((target_seq_len, num_batch), dtype='float32')
input_mask_arr = np.ones((input_seq_len, num_batch), dtype='float32')
for nn in range(200):
for i in range(num_samples//num_batch):
idx = range(i*num_batch, (i+1)*num_batch)
_, _, cost = train(
input_arr[idx],
np.transpose(y_arr[idx]),
input_mask_arr,
y_mask_arr)
print cost |
Hey Søren, While the pseudo cost is not the same as the CTC cost, it should have the same gradient and already does the multiplication with the outputs internally so you don't have to compute the gradient with respect to the outputs separately and can just treat it as you would with any other cost. You can use the actual CTC cost function for performance monitoring. When you use the skip_softmax option, the function expects the linear activations. I see you implemented this correctly. Internally, it still computes softmax, but it makes sure theano doesn't try to compute its gradient. The skip_softmax variant should be far more reliable because it can deal with very large input values and I'm guessing it might be a bit faster too but I didn't test that. I'll try to answer your earlier questions when I find more time. Best, |
Thanks. I think i'm getting there. Changed these lines and printed pseudo_cost_grad = T.grad(pseudo_cost.mean(), all_params)
true_cost = ctc_cost.cost(y, output_softmax.dimshuffle(1, 0, 2), y_mask, y_hat_mask)
cost = T.mean(true_cost)
updates = lasagne.updates.rmsprop(pseudo_cost_grad, all_params, learning_rate=0.0001) The cost seems to go down on my test data. |
Hello Søren, Did you get the CTC code working with lasagne (recurrent)? Could you share that code? It would save me a lot of time ;-) Cheers, |
I did set it up, but i didnt get around to test it on timit. I can share my
code on monday, please email me if i forget :) on what dataset do you plan
to use it?
|
Cool, thank you! |
Awesome, I'm very interested in the results. Do you have a script for Den 27/06/2015 23.46 skrev "Richi91" [email protected]:
|
Well, he does not completely specify his preprocessing in his paper. He uses HTK for his "fourier-based filter banks", which is explained here: http://www.ee.columbia.edu/ln/LabROSA/doc/HTKBook21/node54.html For a first try, I am using the complete frequency range from 200-8kHz. For the other parameters I just use the standard values (25ms,10ms, 2). |
I put up the code here: https://github.com/skaae/Lasagne-CTC. Im very interested in your progress :) |
Hello @pbrakel , I am still/again working with your CTC code, but I cannot get it working correctly. During training, I get both positive and negative values for the cost. This shouldn't be possible, should it? Training my net with cross-entropy error (at each timestep) worked fine, thus the problem must be the CTC-cost. Kind regards |
Hey @Richi91, I just wrote an explanation of what If you show me an example of your code I can look at it. Perhaps these couple of lines will be helpful as well (
|
Hey @pbrakel , thank you for your answer, it helped me to understand the need for pseudo_cost. Here is a snippet of my code (using lasagne):
|
@Richi91 hi, I am meeting the same problem as yours... ctc loss is negative and the train results output all blanks. Do you figure out these problems? |
@Michlong hi, sorry for late reply. Don't forget gradient clipping, especially with high LR ;-) |
@Richi91 are you willing to share your code? |
I do not longer have an implementation for an RNN with CTC in Lasagne. Actually, all you need to do is use the cost function like this: then write a theano function for the train loop with cost_train and a function for validation without updates with cost_monitor. y: targets (e.g. words or phonemes. This is not frame-wise) Best regards |
I'm trying to run your ctc example but i get the following error:
Which i think i can workaround by setting
allow_input_downcast=True
in line 224 inblocks/algorithms/__init__.py
But then i get another error:
Can you add a few notes explaning what
S, T, B, D, L, C
andF
are?Maybe you could also explain the input format for
apply(cls, y, y_hat, y_mask, y_hat_mask, scale='log_scale')
?Is it correct that:
y
: one-hot encoded labels, LABEL_LENGTH x BATCH_SIZEy_hat
: predictions: INPUT_SEQUENCE_LENGTH x BATCH_SIZE x {num_classes + blank}y_mask
: binary mask? shp? I assume that one is used for included sequences?y_mask_hat
: Used to mask if the input is not INPUT_SEQUENCE_LENGTH?Where INPUT_SEQUENCE_LENGTH is the length of the input sequences (30 for the example data) and LABEL_LENGTH is the label sequence for each target. Is LABEL_LENGTH padded if the true label length vary?
-Søren
The text was updated successfully, but these errors were encountered: