Skip to content

Latest commit

 

History

History
49 lines (39 loc) · 1.86 KB

README.md

File metadata and controls

49 lines (39 loc) · 1.86 KB

Sentiment-Analysis

Sentiment Analysis on movie reviews. The aim is for a neural network to classify reviews left by users.

Workload (approximation):

  • 40% coding
  • 60% Analysis of methods and results

Dataset: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data (Reviews labeled in 5 classes, from Rottentomatoes)

(Eventually use IMDB dataset)

Pre-Processing in juypter-notebook Exploration.ipynb:

  • Keep only full sentences
  • Binarize Sentiment (1 or 0)
  • Tokenize sentences
  • Remove stopwords and similar (***** CHANGE THIS! KEEP IMPORTANT ONES *********)
  • ...

To Do:

  • Choose vocabulary size (tf-id? "by-hand" select few good features?...)
  • simple word2Vec, GloVe, ... pre-processing
  • Compare performance with different pre-processing
  • Compare performance with different algorithms (simple perceptron, Neural Nets, basic ML algo?)
  • Best performance: upload on Kaggle to compare with existing solutions?

Preprocessing:

  • Tokenizer, without stopwords : 15273 unique words
  • Tokenizer and Stemmer: 10493 unique words!

Binary Classification task

Sentiment Analysis on a binarized version of the RottenTomatoes dataset.

Two binarization strategies:

  • Naive: Every reviews with y <= 2 mapped to 0 otherwise 1.
  • Ambiguous reviews removed: Removal of every reviews where y = 2 (neutral). Otherwise, f(y) = 0 if y < 2 and f(y) if y > 2.

Classification Model:

  • pre-processing: Tokenization, GloVe embedding
  • Convolutional Neural Network

Related files:

  • binary_mapping_01.ipynb, binary_mapping_02.ipynb: ipython notebooks used to visualize and map the reviews sentiments into binary values.
  • Data/binary_naive.csv, Data/binary_2removed.csv: csv files containing the two binarized datasets after mapping.
  • binary_cnn.py: python script training and running the CNN model on either one of the latter dataset. See comment of scripts for execution.