#Competition Quora question pairs
https://www.kaggle.com/c/quora-question-pairs
Key moments:
- It's unacceptable to make classification on all pairs with such heavy classifier as BERT in real life. I would use BERT to get good encodings for questions.
- Train dataset has selection bias, seems private set also has this issue. I prefer not to use these leakage features to improve the score. But such train
dataset leads to overfitting problem, which's still unresolved by me.
- To make BERT <CLS> encodings more suitable for final task I try to finetune them with metric learning on triplets. Also this procedure helps with selection bias problem in case of good triplet generator.
- Token embeddings are also important to train a good model. To save information from not <CLS> tokens I use classifier head with extra input: sum of all token embeddings attended to other question.
- I didn't use any ensambles, more interesting for me was to do experiments with embedding learning with single heavy model.
Requirements
- check
requirements.txt
and install missing packages - download a pre-trained BERT model and place it in this folder, specify the chosen model in
config.yml
(path_to_pretrained_model
). For example, for a medium uncased model that will be: wget -q https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip'
unzip uncased_L-12_H-768_A-12.zip
- place your CSV files in the input folder (
path_to_data
inmodels/bert_finetuning/config.yml
) - specify batch size in
models/bert_finetuning/config.yml
andmodels/bert_finetuning/config_triplets.yml
: - install apex or change
apex_mixed_precision
toFalse
- debug with option
toy=True
, get real submission withtoy=False
Run whole training process and get submission file
python submission.py
- split train pair dataset with stratification on
train.csv
data (validation part = 0.1) - build positive and negative connection graphs on train set
- collect buckets with duplicated questions
- detect all negative connections from each bucket to other buckets
- generate triplets: for each pair of questions in bucket - 3 negative examples
- encode anchor, positive and negative questions from triplet with BERT separately
- train with Triplet Loss for 3 encoded <CLS> tokens respectively
- split train pair dataset in same way as for metric learning to reduce data leak
- load metric learned encoder BERT encoder
- encode left and right question with BERT separately
- pool all encoded tokens from one question attended to encoded <CLS> token of another question
- concat attended embedding with <CLS>\ embedding for each question
- get elementwise product of question embedding
- make binary classification with 2-layer classifier's head
- freezing BERT layers for first epoch, different learning rate for head and BERT layers
Kaggle link: https://www.kaggle.com/mfside
-
implemented in
submission.py
: 2 epochs metric learning + 1 epoch pair clf with freezed BERT + 3 epochs pair clf with unfreezed BERT.Private = 0.38092 Public = 0.37776
-
finetune one more epoch with decreased lr.
Private = 0.37830 Public = 0.37406
-
finetune another one epoch with decreased lr and weighted loss (try to increase precision, balance of duplicate question in test is lower)
Private = 0.36893 Public = 0.36465
Main unresolved problem: overfitting.
BERT model learns well, but on large epoches overfit effect could be recognized both on validation and test set.
Ways to resolve:
- detailed data analysis. Influence of selection bias, explore ways to select representative validation datasets.
- make metric learning more efficient to generalize model better (hard to train, but could be optimized with N-Pair Loss)
- change sampling from train dataset for pair classification (class imbalance, test class imbalance). Simple method with weighted loss gives better results (on last experiment).
- hyper parameters tuning, reduce model capacity (DisillBERT, less amount of linear layers, reduced hidden states). Not have enough time for experiments, hard to train without good GPUs.
Make experiments more clear
- Detect influence of each phase and architecture part and take the best variant
Private = 0.33768 Public = 0.33507
Private = 0.36800 Public = 0.36714