This project aims to perform topic modeling on song lyrics from Genius. The project consists of the following components:
data_collection.py
: Python script for collecting song lyrics from Last FM API and Genius API.basic_cleaning.py
: Python script for some basic cleaning in the lyrics data.modeler.py
: Python script for performing topic modeling by preprocessing lyrics data and visualizing the outputs.
To collect the song lyrics data, the data_collection.py
script uses the Last FM API to get the most popular tracks and then the Genius one to collect the lyrics. The collected data is stored in a JSON file in the data
folder.
I decided to get 1000 tracks for 17 genres.
The collected data is hugely noisy and contain irrelevant characters. To have a basic cleaning of the lyrics data the cleaning.py
script removes some special elements from each lyrics collected from Genius.
I decided to have the next preprocessing steps (removing stop words, punctation...) in the last python file.
The modeler.py
script defines a Python class TopicModeler with several methods for topic modeling, including preprocessing, training, and visualization, all performed with OCTIS.
The class has the following attributes:
model
: an LDA model with 5 topicspreprocessor
: a Preprocessing object with some default settingsdataset
,top_words_topics
,topics
,topic_word_matrix
, andtopic_document_matrix
: empty DataFrames and dictionaries for later usediversity_score
andcoherence_score
: initially set to 0
The class has the following methods:
- init(): the constructor for the class, which initializes the attributes described above.
- modelize(corpus_path): a method to train the LDA model on a given corpus, preprocesses the corpus, trains the LDA model, and stores the output. It also evaluates the topics using coherence and diversity scores.
- visualize(genre): a method to visualize the top words and frequencies of the trained topics for a given genre.
Here is how the visualization looks like :
The data
folder contains the following type of files:
genre_data.json
: The raw song lyrics from a specific genre data collected from Genius.genre_data.csv
: The preprocessed and cleaned lyrics data used for topic modeling for each genre.