-
Notifications
You must be signed in to change notification settings - Fork 12
Understanding the Models
The first model built is based on the Google Translate API. The code for converting the requests into Spanish can be found here in the file google_translate.ipynb
. The model is built using the googletrans
library.
Note: Only the 4.0.0.rc1 version works because of the ever-changing Google API rules. The command for the installation of this particular version can be seen in the code and the documentation.
- The library contains the Translator class, which calls the Google Translate API.
- The
translate
method within the class takes the text to be translated, the source language, and the target language as parameters. - It returns the translated text as a string.
The second model built is transformer-based. The code for converting the requests into Spanish can be found here in the file english_to_spanish.ipynb
. The model is mainly built using the tensorflow
library. The sequence-to-sequence Transformer consists of a TransformerEncoder
and a TransformerDecoder
chained together. To make the model aware of word order, a PositionalEmbedding
layer is also used.
Note: The complete list of requirements can be found in the libraries required section in this part of the Wiki.
The code loosely follows a Keras tutorial that can be found here. The basic steps that constitute the model are as follows:
- Text Vectorization
- TransformerEncoder Layer Implementation
- TransformerDecoder Layer Implementation
- PositionalEmbedding Layer Implementation
Using the TextVectorization
library from Keras, we implement two TextVectorization layers (one for English and one for Spanish). This helps transform every original string into integer sequences where every integer represents the index of a word in a vocabulary.
The English layer will use the default string standardization (strip punctuation characters) and splitting scheme (split on whitespace). In contrast, the Spanish layer will use a custom standardization, where we add the character "¿" to the set of punctuation characters to be stripped.
The source sequence is passed to the TransformerEncoder
, which produces a new representation of the sequence.
The representation created in the previous layer is now passed to the TransformerDecoder
layer with the target sequence so far (target words 0 to N). The TransformerDecoder
will then seek to predict the following words in the target sequence (N+1 and beyond).
The PositionalEmbedding
layer makes the model aware of the word order in a particular sequence. This is necessary because the TransformerDecoder
sees the entire sequence simultaneously. Thus we must make sure that it only uses information from target tokens 0 to N when predicting token N+1 (otherwise, it could use information from the future, which would result in a model that cannot be used at inference time).
To understand the transformers model, it is essential to understand the concept and the mechanism of attention. The Transformer architecture follows an encoder-decoder structure, but does not rely on recurrence and convolutions in order to generate an output.
In a nutshell, the task of the encoder, on the left half of the Transformer architecture, is to map an input sequence to a sequence of continuous representations, which is then fed into a decoder.
The decoder, on the right half of the architecture, receives the output of the encoder together with the decoder output at the previous time step, to generate an output sequence.
The Transformer model runs as follows:
- Each word forming an input sequence is transformed into a d-dimensional embedding vector.
- Each embedding vector representing an input word is augmented by summing it (element-wise) to a positional encoding vector of the same length, hence introducing positional information into the input.
- The augmented embedding vectors are fed into the encoder block, consisting of the two sublayers explained above. Since the encoder attends to all words in the input sequence, irrespective if they precede or succeed the word under consideration, then the Transformer encoder is bidirectional.
- The decoder receives as input its own predicted output word at time-step, t - 1.
- The input to the decoder is also augmented by positional encoding, in the same manner as this is done on the encoder side.
- The augmented decoder input is fed into the three sublayers comprising the decoder block explained above. Masking is applied in the first sublayer, in order to stop the decoder from attending to succeeding words. At the second sublayer, the decoder also receives the output of the encoder, which now allows the decoder to attend to all of the words in the input sequence.
- The output of the decoder finally passes through a fully connected layer, followed by a softmax layer, to generate a prediction for the next word of the output sequence.
For a detailed understanding refer to this blog by Stefania Cristina.