Results From Pan CLEF19 Test Datasets
Dataset
lang
type
gender
1
es
0.8611
0.7556
1
en
0.9280
0.7652
2
es
0.8839
0.7261
2
es
0.9227
0.7583
Pan Author Identification (Bots and Gender Profiling)
Identify Author of text on bases of their stylometry and writing style.
Use the package manager pip to install foobar.
pip install -r requirments.txt
python train.py -i 'trainingdatapath'
python train .py - i '/input/train/data/'
python test.py -i 'testdatapath' -o 'outputpath'
python test .py - i '/input/test/data/' - o '/output/'
1. emoji_count -> Count all kind Kind of emojis
2. face_smiling -> Count 😀😃😄😁😆😅🤣😂🙂🙃😉😊😇
3. face_affection -> Count 🥰😍🤩😘😗☺😚😙
4. face_tongue -> Count 😋😛😜🤪😝🤑
5. face_hand -> Count 🤗🤭🤫🤔
6. face_neutral_skeptical -> Count 🤐🤨😐😑😶😏😒🙄😬🤥
7. face_concerned -> Count 😕😟🙁☹😮😯😲😳🥺😦😧😨😰😥😢😭😱😖😣😞
8. monkey_face -> Count 🙈🙉🙊
9. emotions -> Count 💋💌💘💝💖💗💓💞💕💟❣💔❤🧡💛💚💙💜🤎🖤'
10. url_count -> Count all kind of link/urls
11. space_count -> Spaces count
12. capital_count -> Capital letter count
13. text_length -> Total length of messge
14. curly_brackets_count -> Count { }
15. round_brackets_count -> Count ( )
16. underscore_count -> Count _
17. question_mark_count -> Count ?
18. exclamation_mark_count -> Count !
19. dollar_mark_count -> Count $
20. ampersand_mark_count -> Count &
21. hash_count -> Count #
22. tag_count -> Count @
23. slashes_count -> Count Slashes // / \
24. operator_count -> Count Operators +-*/%<>^|
25. punc_count -> Count Puntuations '",.:;`
26. line_count -> Count nextlines \n
27. word_count -> Count Words A-Za-z
Results for English Train Test Split Dataset:
Classifier
Accuracy
'LogisticRegression'
0.9158576051779935
'RandomForestClassifier'
0.9757281553398058
'LinearSVC'
0.8770226537216829
'BernoulliNB'
0.9239482200647249
'MultinomialNB'
0.8236245954692557
'SVC'
0.5056634304207119
Best Model RandomForestClassifier
Author
precision
recall
f1-score
support
bot
0.98
0.97
0.98
622
human
0.97
0.98
0.98
614
micro avg
0.98
0.98
0.98
1236
macro avg
0.98
0.98
0.98
1236
weighted avg
0.98
0.98
0.98
1236
Classifier
Accuracy
'LogisticRegression'
0.7265372168284789
'RandomForestClassifier'
0.8106796116504854
'LinearSVC'
0.6019417475728155
'BernoulliNB'
0.616504854368932
'MultinomialNB'
0.616504854368932
'SVC'
0.4967637540453074
Best Model RandomForestClassifier
Gender
precision
recall
f1-score
support
female
0.79
0.85
0.82
311
male
0.83
0.77
0.80
307
micro avg
0.81
0.81
0.81
618
macro avg
0.81
0.81
0.81
618
weighted avg
0.81
0.81
0.81
618
Results for Spanish Train Test Split Dataset:
Classifier
Accuracy
'LogisticRegression'
0.8433333333333334
'RandomForestClassifier'
0.9288888888888889
'LinearSVC'
0.7488888888888889
'BernoulliNB'
0.8188888888888889
'MultinomialNB'
0.7644444444444445
'SVC'
0.4888888888888889
Best Model RandomForestClassifier
Author
precision
recall
f1-score
support
bot
0.93
0.93
0.93
440
human
0.93
0.93
0.93
460
micro avg
0.93
0.93
0.93
900
macro avg
0.93
0.93
0.93
900
weighted avg
0.93
0.93
0.93
900
Classifier
Accuracy
'LogisticRegression'
0.6844444444444444
'RandomForestClassifier'
0.7844444444444445
'LinearSVC'
0.5666666666666667
'BernoulliNB'
0.6066666666666667
'MultinomialNB'
0.6355555555555555
'SVC'
0.48444444444444446
Best Model RandomForestClassifier
Gender
precision
recall
f1-score
support
female
0.77
0.83
0.80
232
male
0.80
0.74
0.77
218
micro avg
0.78
0.78
0.78
450
macro avg
0.79
0.78
0.78
450
weighted avg
0.79
0.78
0.78
450