Implementation of our 2nd place solution to Twitter RecSys Challenge 2021. The goal of the competition was to predict user engagement with 1 billion tweets selected by Twitter. An additional challenge was the test phase - the models were evaluated in a very constrained test environment - just 1 CPU core, with no GPUs available, and a time limit of 24h for all predictions which gives about 6ms per single tweet prediction.
The challenge focuses on a real-world task of tweet engagement prediction in a dynamic environment. It considers predicting four different engagement types: Likes, Retweet, Quote, and Replies.
-
Register and download training and validation set from competition webstie
-
Setup a configuration file
config.yaml
:working_dir
- path where all preprocess files will be savedrecsys_data
- path to directory with uncompressed training data partsvalidation_part
- path to validation uncompressed partmax_n_parts
- maximum number of training parts that are taken into training. Limit it for speedup training.max_n_parts_in_memory
- number of training parts that are loaded into memory at the same time. Limiting it allows to limit RAM usageauthors_similarity_top_N
- denotes maximum number of similar users to current oneauthors_similarity_threshold
- users similarity thresholdvalidation_percentage
- percentage of validation set used for testing, the other part is used for finetuningnum_validation_chunks
- number of validation chunkssketch_width
- sketch width for tweet sketchsketch_depth
- sketch depth for tweet sketch
-
Finetune DistilBERT and precompute token sketches
python bert_finetuning.py
BERT checkpoints will be saved periodically. You can run sketch computation on any chosen checkpoint:
Prepare sketches of tokens from checkpoint model that was trained with the above script. Change /distilbert_checkpoints/checkpoint-1000
to the most updated model path.
python prepare_token_embeddings.py --checkpoint-path ./distilbert_checkpoints/checkpoint-1000
Apply EMDE to compute sketches
python emde.py
Steps 2 and 3 can be run simultaneously
- Preprocess dataset and compute interactions of users:
python interactions_extraction.py
- Train model
python train.py