The task of this competition is to predict if a tweet message used to contain a positive :) or negative :( smiley, by considering only the remaining text.
ID: 25279 - by Jonas Jäggi
├── data <-Empty folder, copy .txt files here
├── embeddings <-Empty folder, copy vocabulary and embedding files here
├── Report
│ ├── Machine_Learning_AEJ_Sentiment.pdf <-Report in .pdf format
│ └── Machine_Learning_AEJ_Sentiment.tex <-Report in LaTeX format
├── build_vocab.sh <-Creates a vocab.txt vocabulary from the .txt files
├── CNN_CPU.ipynb <-Python notebook to run the Convolutional neural network locally
├──CNN_Cuda.ipynb <-Python notebook to run the Convolutional neural network on Google Colab
├── cnns.py <-Contains the Convolutional neuronal network methods
├── cooc_memory_friendly.py <-Memory friendly extraction of coocurrence from a .pickle file
├── cooc.py <-Coocurrence from a .pickle file
├── cut_vocab.sh <-Cuts from a vocab.txt vocabulary file into a vocab_cut.txt
├── glove_solution.py <-Generates embeddings from a vocabulary
├── helpers.py <-General custom helper methods
├── Logistic_regression.ipynb <-Python notebook to run a logistic regression prediction
├── logreg.py <-Contains the logistic regression methods
├── pickle_vocab.py <-Converts from a vocab_cut.txt file to a vocab.pkl file
├── pre_processing.ipynb <-Python notebook that runs the pre-processing steps on the *.txt data files
├── pre_processing.py <-Helper methods for the pre-processing
├── run.py <-Runs the Convolutional neural network
├── README.md <-The README file of the project
-
You will need to have Anaconda installed, the following tutorials will guide you through the installation.
Operating System Tutorial Mac https://docs.anaconda.com/anaconda/install/mac-os/ Windows https://docs.anaconda.com/anaconda/install/windows/ Ubuntu https://docs.anaconda.com/anaconda/install/linux/ -
Once you have it installed, try running the following command on a terminal:
python
You should see a message similar to the following:
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
-
For language processing corpora and lexical resources, we need to install NLTK (Natural Language Toolkit), by running the commands:
pip install -U nltk
-
For the preprocessing steps, the TextBlob library is used. It is installed running the followind commands:
pip install -U textblob python -m textblob.download_corpora
-
You will also need to have PyTorch installed, since we are using this ML library for the base neural networks functionality, and we are building on top of it. To install PyTorch please follow the instructions found in their web page: Install Pytorch
-
In a terminal, change your directory to the location of the compressed zip file of the project.
cd {path_of_zip_file}
-
Unzip the zip file of the project.
unzip -a mlProject2.zip
-
Download the provided datasets
twitter-datasets.zip
from here, and extract the*.txt
files from the zip.unzip -a twitter-datasets.zip
-
Move the downloaded
*.txt
files inside the /data folder. -
Download the pretrained vocabulary and embeddings from here.
Also unzip the
embeddings.zip
file.unzip -a embeddings.zip
-
Move the downloaded
vocab_pretrained.pkl
andembeddings200_pretrained.npy
files inside the /embeddings folder.
-
If your terminal is not in the location of the project files, change your directory to that location.
cd {path_of_project_files}
-
Run the run.py script in the terminal.
python run.py
NOTE: If you are using your own vocabulary/embeddings, make sure the script is pointing to the correct filenames
A new file called "submissionX.csv" will be generated, which contains the predictions.
- The "submissionX.csv" file can be uploaded to crowdAI, where you can verify the obtained classification score.
In order to pre-process the tweets, you should edit and run the pre_processing.ipynb jupyter notebook, please follow these instructions:
-
Start Jupyter Notebook
Run this command:
jupyter notebook
This will launch a browser window, with a file explorer open.
-
Open notebook
Navigate through the folders, and open the pre_processing.ipynb file.
-
Set Feature treatment methods
Inside the second cell you can modify the following variables:
-
already_cleaned_neg = True/False
-
already_cleaned_pos = True/False
-
already_cleaned_neg_full = True/False
-
already_cleaned_pos_full = True/False
-
already_cleaned_test = True/False
The previous variables will determine which of the files will be pre-processed. If you set them to False, and run the next cells in the notebook, they will be pre-processed again. The pre-processing will be skipped for the ones with a value of True.
If you want to try different combinations of the pre-processing steps you can change the follwing variables:
- duplicates = True
- emojis = False
- punctuaction = False
- handle_number = False
- special_symbols = True
- moreLetters = False
- contractions = False
- clean_stopwords = False
- spelling = False
- lemmatize = False
If you set them to True and run the next cells, the will be pre-processed depending on previous choise. Please keep in mind that if you set the spelling to true it will take a long time to finish the pre-processing (depending on the specifics of your machine, 6-8 hours).
-
-
Run the cells of the notebook
Now open the Cell menu and click on the option Run All, or manually run every cell.
You will see the results/outputs between cells, and in the
data/pre_processed/
path you will find the new pre-processed.txt
and.pkl
files.
If you want to build your own vocabulary and embeddings, follow these instructions:
NOTE: These scripts have the .txt
test data filenames set to train_pos_full.txt
and train_neg_full.txt
. If you wish to generate embeddings from a smaller dataset, you should modify those filenames in the code.
In order to load the training tweets and construct a vocabulary list of words appearing at least 5 times, run the following commands.
```bash
build_vocab.sh
cut_vocab.sh
python pickle_vocab.py
```
This will output a vocab.pkl
vocabulary file.
For this we have two options:
You can train your GloVe word embeddings, that is to compute an embedding vector for each word in the vocabulary, by using the following commands.
NOTE: The cooc.py
script provided to us with the project description is prone to crash, so we are using cooc_memory_friendly.py
which is an optimized version of it. They should be interchangeable.
You might need to modify the script to have the correct filenames for the vocabulary and data.
```bash
python cooc_memory_friendly.py
python glove_solution.py
```
This will output a embeddings*.npy
file.
For better results, we used an embedding matrix which was created at the Standford University and is based on 2,6 Billion tweets. From this embedding matrix, we will extract the embeddings that we need for our vocabulary.
-
Download Standford's embedding matrix from here
-
Extract the content of the zip file
unzip -a glove.twitter.27B.zip
-
Modify the
glove_txt_to_npy.py
so it points to the correct vocabulary file on line 11, then run the script:python glove_txt_to_npy.py
This will output a embeddings*.npy
file as well as a new vocab file vocab_pretrained.pkl
, you should use both for running the prediction.
For training the logistic regresion and creating a Logistic regresion generated submission file, please follow these instructions:
-
Start Jupyter Notebook
Run this command:
jupyter notebook
This will launch a browser window, with a file explorer open.
-
Open notebook
Navigate through the folders, and open the Logistic_regression.ipynb file.
-
Check filenames and paths
Inside the second cell you should check if the paths are pointing to the correct files.
-
Run the cells of the notebook
Now open the Cell menu and click on the option Run All, or manually run every cell.
You will see the results/outputs between cells, and in the
data/pre_processed/
path you will find the new pre-processed.txt
and.pkl
files. -
A new file called "logreg_submission.csv" will be generated, which contains the predictions.
-
The "logreg_submission.csv" file can be uploaded to crowdAI, where you can verify the obtained classification score.
Here we have two options:
-
Start Jupyter Notebook
Run this command:
jupyter notebook
This will launch a browser window, with a file explorer open.
-
Open notebook
Navigate through the folders, and open the CNN_CPU.ipynb file.
-
Check filenames and paths
Inside the second cell you should check if the paths are pointing to the correct files.
-
Run the cells of the notebook
Now open the Cell menu and click on the option Run All, or manually run every cell. You will see the results/outputs between cells.
-
A new file called "submission_firstnet.csv" will be generated, which contains the predictions.
-
The "submission_firstnet.csv" file can be uploaded to crowdAI, where you can verify the obtained classification score.
Run it on Google Colab
Colaboratory is a research tool for machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use, and allows you to use GPU from a virtual machine in Colab servers.
-
Upload the notebook to your Google Drive
Login to Google Drive using a Google Account: https://drive.google.com
Upload the
CNN_Cuda.ipynb
file to your drive by clicking on the New -> File Upload option and selecting the file from your computer. You can also drag and drop the file to upload it. You will also need to upload the embeddings (embeddings200_pretrained.npy
),vocabulary (vocab_pretrained.pkl
) and data (*.txt
)files. -
Check filenames and paths
Inside the seventh cell you should check if the paths are pointing to the correct files.
-
Configure the Runtime to use GPU
Open the Runtime menu and click on Change Runtime Type. A popup will appear, where you can change the Hardware accelerator from CPU to GPU. Then click on the Save button.
-
Run the cells of the notebook
Now open the Runtime menu and click on the option Run All, or manually run every cell. You will see the results/outputs between cells.
-
A new file called "submission_firstnet.csv" will be generated in your Google Drive, which contains the predictions.
-
The "submission_firstnet.csv" file can be uploaded to crowdAI, where you can verify the obtained classification score.
- Andres Ivan Montero Cassab
- Jonas Florian Jäggi
- Elias Manuel Poroma Wiri