Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
Assignment-02.pdf		Assignment-02.pdf
README.md		README.md
clean_corpus.py		clean_corpus.py
cut_corpus.py		cut_corpus.py
download_data.py		download_data.py
get_data.sh		get_data.sh
preprocess-ted-bilingual.sh		preprocess-ted-bilingual.sh
preprocess-ted-flores-bilingual.sh		preprocess-ted-flores-bilingual.sh
preprocess-ted-multilingual.sh		preprocess-ted-multilingual.sh
reproduce_results.txt		reproduce_results.txt
requirements.txt		requirements.txt
score.py		score.py
ted_reader.py		ted_reader.py
traineval_az_en.sh		traineval_az_en.sh
traineval_aztr_en.sh		traineval_aztr_en.sh
traineval_be_en.sh		traineval_be_en.sh
traineval_beru_en.sh		traineval_beru_en.sh
traineval_en_az.sh		traineval_en_az.sh
traineval_en_aztr.sh		traineval_en_aztr.sh
traineval_en_be.sh		traineval_en_be.sh
traineval_en_beru.sh		traineval_en_beru.sh
traineval_flores_az_en.sh		traineval_flores_az_en.sh
traineval_flores_be_en.sh		traineval_flores_be_en.sh
traineval_flores_en_az.sh		traineval_flores_en_az.sh
traineval_flores_en_be.sh		traineval_flores_en_be.sh

README.md

Assignment-02 (Multilingual Translation)

This assignment requires a lot of work to debug the codes and get the files to run.

Debugging

Environment

Follow the steps in the assignment webpage. Make sure to have numpy version less than 1.24.
Install COMET and Fairseq v0.10.2 correctly and export its directory. This is an important step.

Data

The file download_data.py will not work correctly so skip it.
Run the command bash get_data.sh in the terminal and the data from the main repo and preprocessing file will be downloaded.
Change the src_lang and trg_lang for each pair in the experiment as follows: low resource (src_lang) and high resource (trg_lang).

Bash Files

Once the data is provided, we need to fix all bash file to work properly.

Change languages names to follow the ISO 639-1 coding in codes and file naming.
Apply fixes to the source and target languages paths. Specifically, remove all .orig and ted- from all files.
Make sure to provide the correct directory for the COMET module.
Now, we can run each experiment correcly without errors.

Reproducing Results

Bilingual Baselines

Once we run the experiment. The following results are produced.

Pair	az - en	en - az	be - en	en - be
BLUE	3.01	20.21	4.67	11.46
COMET	-1.4159	-1.2864	-1.3589	-1.3506

As seen from the results, when changing to en being the source language we get much higher scores. This is to be investigated.

Note: The reported results are the test sets'. For extended results see reproduce_results

Multilingual Training

First of all run pip install --upgrade lxml.

Pair	az - en	en - az	be - en	en - be
BLUE	14.31	5.92	18.90	9.40
COMET	-0.2684	-0.0691	-0.3816	-0.4756

Note: Az is enriched with Tr and Be is enriched with Ru.

Finetuning Pretrained Multilingual Models

In this experiment, we will fine tune the FLORES-101 models of the language pairs we have.

Pair	az - en	en - az	be - en	en - be
BLUE	17.56	7.25	22.82	14.84
COMET	-0.0318	0.0368	-0.0674	0.0593

Once again, the complete results are reported in reproduce_results file.

Final note

If you come by from any prospective and would like to run this assignment, you are likely to face some issues that you don't know how to solve. When this happens, feel free to reach out instead of wasting valuable time on the internet trying to figure them out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment-02

Assignment-02

README.md

Assignment-02 (Multilingual Translation)

Debugging

Environment

Data

Bash Files

Reproducing Results

Bilingual Baselines

Multilingual Training

Finetuning Pretrained Multilingual Models

Final note

Files

Assignment-02

Directory actions

More options

Directory actions

More options

Latest commit

History

Assignment-02

Folders and files

parent directory

README.md

Assignment-02 (Multilingual Translation)

Debugging

Environment

Data

Bash Files

Reproducing Results

Bilingual Baselines

Multilingual Training

Finetuning Pretrained Multilingual Models

Final note