Sentiment Analysis on movie reviews. The aim is for a neural network to classify reviews left by users.
Workload (approximation):
- 40% coding
- 60% Analysis of methods and results
Dataset: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data (Reviews labeled in 5 classes, from Rottentomatoes)
(Eventually use IMDB dataset)
Pre-Processing in juypter-notebook Exploration.ipynb:
- Keep only full sentences
- Binarize Sentiment (1 or 0)
- Tokenize sentences
- Remove stopwords and similar (***** CHANGE THIS! KEEP IMPORTANT ONES *********)
- ...
To Do:
- Choose vocabulary size (tf-id? "by-hand" select few good features?...)
- simple word2Vec, GloVe, ... pre-processing
- Compare performance with different pre-processing
- Compare performance with different algorithms (simple perceptron, Neural Nets, basic ML algo?)
- Best performance: upload on Kaggle to compare with existing solutions?
Preprocessing:
- Tokenizer, without stopwords : 15273 unique words
- Tokenizer and Stemmer: 10493 unique words!
Sentiment Analysis on a binarized version of the RottenTomatoes dataset.
Two binarization strategies:
- Naive: Every reviews with y <= 2 mapped to 0 otherwise 1.
- Ambiguous reviews removed: Removal of every reviews where y = 2 (neutral). Otherwise, f(y) = 0 if y < 2 and f(y) if y > 2.
Classification Model:
- pre-processing: Tokenization, GloVe embedding
- Convolutional Neural Network
Related files:
- binary_mapping_01.ipynb, binary_mapping_02.ipynb: ipython notebooks used to visualize and map the reviews sentiments into binary values.
- Data/binary_naive.csv, Data/binary_2removed.csv: csv files containing the two binarized datasets after mapping.
- binary_cnn.py: python script training and running the CNN model on either one of the latter dataset. See comment of scripts for execution.