We see the timeline as physical dimension, of course it is in DL model training LOL!
Requirements
- conda environment (conda>=4.12.0)
- python>=3.8
Run the installation script
bash install.sh
Create a folder named data
after cloning this repository and download the TempLAMA dataset.
There are a total of 50,310 samples after the combing of the data.
Number of datasamples that have changed over time (unique queries) = 5,823
More are available in the Google Colab
Use the following script
- First command converts the dataset to the dataset of diff format (Contains only the change where the subject changed over the year)
- Second command converts the dataset to the csv format - question, answer format
python src/dataset_prep/combine_restructure_data.py --input ./data/train.json,./data/test.json,./data/val.json
python src/dataset_prep/finetuning_data_jsonl_to_csv.py --dataset_path ./data/restructured_data.json --year 2010-2018
Note: You can add more data by seperating the data path in
--input
parameter by comma.
The list of the paraphrased relation can be seen in the file utils/templama_relation_rephrase.jsonl
for the TempLAMA dataset.
Number of samples in train: 9149 Number of samples in validation: 3000
For zeroshot there are two cases,
- Not at all seen dataset, Newly added information in that particular year
- Previously seen but the year in the query is changed
python src/dataset_prep/valdata_zeroshot.py --dataset-path ./data/restructured_data.json --val-year 2019-2020
For oneshot there is one case
- Sampled from the finetuning dataset - 1000 samples
python src/dataset_prep/valdata_oneshot.py --dataset-path ./data/ft-2010-2018.csv --val-sample 1000
- For T5-model: We replace
_X_
mask provided in the dataset with<extra_id_0>
which is by default mask for the T5 model. - For GPT2 model:
Finetune the model using the code
python src/seq2seq.py --model t5-base --train ./data/ft-2010-2018.csv --val ./data/ft-val-2010-2018.csv --cuda 3
python -m tensorboard.main --logdir ./logs --port 9000
python -m tensorboard.main --logdir ./results --port 8000