This project describes a model that predicts whether movie text line belongs to one or more emotional classes. After model is trained over one data-set of movie lines, it is used for character analysis of other data-set - MARVEL movie lines. This part includes exploring what emotions characters encounter through a movie. For character analysis dataset of MARVEL movie lines is used, where most important characters are analysed. This model uses features derived from word and char n-grams, parts-ofspeech, word embedding and Opinion Lexicon.
-
XED dataset consists of emotion annotated movie subtitles (data/en-annotated.tsv). Movie lines in this dataset have following distribution:
-
Marvel Universe dataset is created from the transcripts of Marvel Universe movies (data/mcu.csv). This dataset contains lines from over 600 characters. In this project only the most important ones are considered:
-
GloVe - Global Vectors for Word Representation
Two approaches for classification are compared: LinearRegression and LinearSVC (Suport Vector Classifier) classification algorithms. To translate these into multi-label problem, OneVsRestClassifier was used. This estimator uses the binary relevance method, which involves training one binary classifier independently for each label.
In file Sentiment_multi_label_MARVEL.pdf you can find detailed project description. This includes preprocessing and feature extraction as well as presentation of results.