Implementation of Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19 - Gencoglu O. (2020)
This repository provides the full implementation in python 3.7. Requires Twitter developer account.
Utilizing Language-agnostic BERT Sentence Embeddings (LaBSE) to analyze 28 million tweets in 109 languages related to COVID-19
Follow steps 1-5 below.
See directory_info in the data directory for the expected files.
1.1 - Download 30+ million tweet IDs and hydrate them into timestamp and tweet text (requires Twitter developer account).
Jan 17,tweet_text_string
Jan 27,tweet_text_string
...
Once tweets.csv is in the example format above, preprocess by running:
python3.7 preprocess.py
1.2 - Download Intent and Questions datasets
--Intent Dataset Link --Questions Dataset Link
2.1 - BERT
python3.7 extract_BERT_embeddings.py -m intent
python3.7 extract_BERT_embeddings.py -m questions
2.2 - Language-agnostic BERT Sentence Embeddings (LaBSE)
python3.7 extract_LaBSE_embeddings.py -m tweets
python3.7 extract_LaBSE_embeddings.py -m intent
python3.7 extract_LaBSE_embeddings.py -m questions
Relevant configurations are defined in configs.py, e.g.:
--model_url 'https://tfhub.dev/google/LaBSE/1' --max_seq_length 128 --bert_model 'bert-base-multilingual-uncased'
python3.7 train.py -m hyper_opt -c "model_identifier" -e "embeddings_identifier"
python3.7 train.py -m train -c "model_identifier"
python3.7 inference.py -c "model_identifier"
source directory tree:
├── configs.py
├── extract_BERT_embeddings.py
├── extract_LaBSE_embeddings.py
├── inference.py
├── LaBSE.py
├── preprocess.py
├── train.py
├── umap_vis.py
└── utils.py
@article{gencoglu2020large, title={Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19}, author={Gencoglu, Oguzhan}, journal={Machine Learning and Knowledge Extraction}, volume={2}, number={4}, pages={603--616}, year={2020}, doi={10.3390/make2040032} }
Or
Gencoglu, Oguzhan. "Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19." Machine Learning and Knowledge Extraction. 2020; 2(4):603-616.