Update quickstart documentation

Helsinki-NLP · Dec 2, 2024 · 6092641 · 6092641
1 parent 1182e6d
commit 6092641
Show file tree

Hide file tree

Showing 3 changed files with 547 additions and 193 deletions.
diff --git a/docs/source/old_quickstart.md b/docs/source/old_quickstart.md
@@ -0,0 +1,250 @@
+
+
+# Old quickstart
+
+**Note: this is an older quickstart documentation. It describes a manual procedure that may be difficult to adapt to your use case.**
+
+MAMMOTH is specifically designed for distributed training of modular systems in multi-GPUs SLURM environments.
+
+In the example below, we will show you how to configure Mammoth to train a machine translation model with language-specific encoders and decoders.
+
+### Step 0: Install mammoth
+
+```bash
+pip install mammoth-nlp
+```
+
+Check out the [installation guide](install) to install in specific clusters.
+
+### Step 1: Prepare the data
+
+Before running the training, we will download data for chosen pairs of languages and create a sentencepiece tokenizer for the model.
+
+**Refer to the data preparation [tutorial](prepare_data) for more details.**
+
+In the following steps, we assume that you already have an encoded dataset containing `*.sp` file for `europarl` dataset, and languages `cs` and `bg`. Thus, your data directory `europarl_data/encoded` should contain 8 files in a format `{train/valid}.{cs/bg}-en.{cs/bg}.sp`. If you use other datasets, please update the paths in the configurations below.
+
+### Step 2: Configurations
+
+Mamooth uses configurations to build a new transformer model and configure your training settings, such as which modules are trained with the data from which languages.
+
+Below are a few examples of training configurations that will work for you out-of-box in a one-node, two-GPU environment.
+
+<details>
+<summary>Task-specific encoders and decoders</summary>
+
+In this example, we create a model with encoders and decoders **unshared** for the specified languages. This is defined by `enc_sharing_group` and `enc_sharing_group`.
+Note that the configs expect you have access to 2 GPUs.
+
+```yaml
+# TRAINING CONFIG
+world_size: 2
+gpu_ranks: [0, 1]
+
+batch_type: tokens
+batch_size: 4096
+
+# INPUT/OUTPUT VOCABULARY CONFIG
+
+src_vocab:
+  bg: vocab/opusTC.mul.vocab.onmt
+  cs: vocab/opusTC.mul.vocab.onmt
+  en: vocab/opusTC.mul.vocab.onmt
+tgt_vocab:
+  cs: vocab/opusTC.mul.vocab.onmt
+  en: vocab/opusTC.mul.vocab.onmt
+
+# MODEL CONFIG
+
+model_dim: 512
+
+tasks:
+  train_bg-en:
+    src_tgt: bg-en
+    enc_sharing_group: [bg]
+    dec_sharing_group: [en]
+    node_gpu: "0:0"
+    path_src: europarl_data/encoded/train.bg-en.bg.sp
+    path_tgt: europarl_data/encoded/train.bg-en.en.sp
+  train_cs-en:
+    src_tgt: cs-en
+    enc_sharing_group: [cs]
+    dec_sharing_group: [en]
+    node_gpu: "0:1"
+    path_src: europarl_data/encoded/train.cs-en.cs.sp
+    path_tgt: europarl_data/encoded/train.cs-en.en.sp
+  train_en-cs:
+    src_tgt: en-cs
+    enc_sharing_group: [en]
+    dec_sharing_group: [cs]
+    node_gpu: "0:1"
+    path_src: europarl_data/encoded/train.cs-en.en.sp
+    path_tgt: europarl_data/encoded/train.cs-en.cs.sp
+
+enc_layers: [6]
+dec_layers: [6]
+```
+</details>
+
+
+<details>
+<summary>Arbitrarily shared layers in encoders and task-specific decoders</summary>
+
+The training and vocab config is the same as in the previous example.
+
+```yaml
+# TRAINING CONFIG
+world_size: 2
+gpu_ranks: [0, 1]
+
+batch_type: tokens
+batch_size: 4096
+
+# INPUT/OUTPUT VOCABULARY CONFIG
+
+src_vocab:
+  bg: vocab/opusTC.mul.vocab.onmt
+  cs: vocab/opusTC.mul.vocab.onmt
+  en: vocab/opusTC.mul.vocab.onmt
+tgt_vocab:
+  cs: vocab/opusTC.mul.vocab.onmt
+  en: vocab/opusTC.mul.vocab.onmt
+
+# MODEL CONFIG
+
+model_dim: 512
+
+tasks:
+  train_bg-en:
+    src_tgt: bg-en
+    enc_sharing_group: [bg, all]
+    dec_sharing_group: [en]
+    node_gpu: "0:0"
+    path_src: europarl_data/encoded/train.bg-en.bg.sp
+    path_tgt: europarl_data/encoded/train.bg-en.en.sp
+  train_cs-en:
+    src_tgt: cs-en
+    enc_sharing_group: [cs, all]
+    dec_sharing_group: [en]
+    node_gpu: "0:1"
+    path_src: europarl_data/encoded/train.cs-en.cs.sp
+    path_tgt: europarl_data/encoded/train.cs-en.en.sp
+  train_en-cs:
+    src_tgt: en-cs
+    enc_sharing_group: [en, all]
+    dec_sharing_group: [cs]
+    node_gpu: "0:1"
+    path_src: europarl_data/encoded/train.cs-en.en.sp
+    path_tgt: europarl_data/encoded/train.cs-en.cs.sp
+
+enc_layers: [4, 4]
+dec_layers: [4]
+```
+</details>
+
+<details>
+<summary>Non-modular multilingual system </summary>
+
+In this example, we share the input/output vocabulary over all languages. Hence, we define a vocabulary for an `all` language, that we use in the definition of the model.
+
+```yaml
+# TRAINING CONFIG
+world_size: 2
+gpu_ranks: [0, 1]
+
+batch_type: tokens
+batch_size: 4096
+
+# INPUT/OUTPUT VOCABULARY CONFIG
+
+src_vocab:
+  all: vocab/opusTC.mul.vocab.onmt
+tgt_vocab:
+  all: vocab/opusTC.mul.vocab.onmt
+
+# MODEL CONFIG
+
+model_dim: 512
+
+tasks:
+  train_bg-en:
+    src_tgt: all-all
+    enc_sharing_group: [shared_enc]
+    dec_sharing_group: [shared_dec]
+    node_gpu: "0:0"
+    path_src: europarl_data/encoded/train.bg-en.bg.sp
+    path_tgt: europarl_data/encoded/train.bg-en.en.sp
+  train_cs-en:
+    src_tgt: all-all
+    enc_sharing_group: [shared_enc]
+    dec_sharing_group: [shared_dec]
+    node_gpu: "0:1"
+    path_src: europarl_data/encoded/train.cs-en.cs.sp
+    path_tgt: europarl_data/encoded/train.cs-en.en.sp
+  train_en-cs:
+    src_tgt: all-all
+    enc_sharing_group: [shared_enc]
+    dec_sharing_group: [shared_dec]
+    node_gpu: "0:1"
+    path_src: europarl_data/encoded/train.cs-en.en.sp
+    path_tgt: europarl_data/encoded/train.cs-en.cs.sp
+
+enc_layers: [6]
+dec_layers: [6]
+```
+</details>
+
+**To proceed, copy-paste one of these configurations into a new file named `my_config.yaml`.**
+
+For further information, check out the documentation of all parameters in **[train.py](options/train)**.
+
+For more complex scenarios, we recommend our [automatic configuration generation tool](config_config) for generating your configurations. 
+
+## Step 3: Start training
+
+You can start your training on a single machine, by simply running a python script `train.py`, possibly with a definition of your desired GPUs. 
+Note that the example config above assumes two GPUs available on one machine.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1 python3 train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir
+```
+
+Note that when running `train.py`, you can use all the parameters from [train.py](options/train) as cmd arguments. In the case of duplicate arguments, the cmd parameters override the ones found in your config.yaml.
+
+
+
+### Step 4: Translate
+
+Now that you have successfully trained your multilingual machine translation model using Mammoth, it's time to put it to use for translation. 
+
+```bash
+python3 -u $MAMMOTH/translate.py \
+  --config "my_config.yml" \
+  --model "$model_checkpoint" \
+  --task_id  "train_$src_lang-$tgt_lang" \
+  --src "$path_to_src_language/$lang_pair.$src_lang.sp" \
+  --output "$out_path/$src_lang-$tgt_lang.hyp.sp" \
+  --gpu 0 --shard_size 0 \
+  --batch_size 512
+```
+
+Follow these configs to translate text with your trained model.
+
+- Provide necessary details using the following options:
+   - Configuration File: `--config "my_config.yml"`
+   - Model Checkpoint: `--model "$model_checkpoint"`
+   - Translation Task: `--task_id "train_$src_lang-$tgt_lang"`
+
+- Point to the source language file for translation:
+   `--src "$path_to_src_language/$lang_pair.$src_lang.sp"`
+- Define the path for saving the translated output: `--output "$out_path/$src_lang-$tgt_lang.hyp.sp"`
+- Adjust GPU and batch size settings based on your requirements: `--gpu 0 --shard_size 0 --batch_size 512`
+- We provide the model checkpoint trained using the encoder shared scheme described in [this tutorial](examples/sharing_schemes.md).
+    ```bash
+    wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz
+    ```
+
+Congratulations! You've successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks.
+
+### Further reading
+A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md).