allies_llmt_data

Data for the ALLIES Lifelong Learning Machine Translation evaluation campaign

Description

This dataset contains the documents from Europarl v10 and NewsCommentary v15. Each document has been renamed to the following format: corpus-date-time1-time2-suffix

corpus is either 'ep' for Europarl or 'nc' for NewsCommentary.
date is the document date in the formnat YYYYMMDD.
time1 and time2 (both optional) are integer values making sure that each document has a unique date-time1-time2.
suffix is the remaining of the originial document name.

For Europarl, the renaming simply consists in merging year, month and day values from original name and add a 'time1' tag to avoid multiple files with same timestamps. For NewsCommentary, the timestamps correspond to the 'article:published_time' extracted from the original html files.

Recreating the data

Recreate the tgz file then extract it:

For EN-FR:

cat en-fr.tgz.part* > en-fr.tgz
tar xzvf en-fr.tgz

For EN-DE:

cat en-de.tgz.part* > en-de.tgz
tar xzvf en-de.tgz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

allies_llmt_data

Description

Recreating the data

For EN-FR:

For EN-DE:

Files

README.md

Latest commit

History

README.md

File metadata and controls

allies_llmt_data

Description

Recreating the data

For EN-FR:

For EN-DE: