Coursework of Advanced Machine Learning of NJUST.
This project aims at the competition WSDM Cup - Multilingual Chatbot Arena on Kaggle. In short, conversations between a user and two LLMs are given, and this competition challenges us to predict which responses users will prefer.
Data files
Related data files are placed under the
data/
folder. However, they are not uploaded here due to GitHub's file size limit. Refer to the competition page to download them.
We referenced an existing Competition Notebook WSDM || AbdBase || V2. Thank you Sheikh Muhammad Abdullah for your dedication!
Here's the visualization of source code, in Python modules:
graph TD
ROOT[Project Root]
P[Preprocessing]
O[Optuning]
S[Solvers]
LGBM[LGBMSolver]
LR[LinearRegressionSolver]
ROOT --> P
ROOT --> O
ROOT --> S
S --> LGBM
S --> LR
Preprocessing is the first operation we should perform on the data once they're loaded. The original data is provided in the form of pandas.DataFrame
, which contains the raw text of the conversation between the user and two LLMs, the conversation ID, which answer the user prefers and so on.
To convert the original data into a vectored representation that most Maching Learning models are more easily to be trained some on, preprocessing pipelines are applied, including but not limited to:
MapColumnValues
: maps the user's choice ("model_a"
or"model_b"
) into numbers (0
or1
).DropColumns
: drop some columns that are considered irrelevant to be passes as input features.Compute
-series pipeline: computes some manually picked features (length, word count, lexical diversity, etc.) of a column.VectorizationByTfidf
: turning a text paragraph into a vector of numbers (i.e. vectorization). In this project, TF-IDF vectorization is adopted.
After those preprocessing steps, the original data is transformed into a DataFrame
containing only numbers in each column.
Pipelines
Instead of writing redundant column assignment code everywhere, the pipeline design pattern (i.e. the Chain of Responsibility pattern) is adopted to make preprocessing code more tidy, extensible and maintainable.
The pattern can be described using the graph below:
Loadinggraph LR I[Input Data] O[Output Data] subgraph Pipelines direction LR D[Drop Column] C[Compute] E[...] D --> C C --> E end I --> D E --> OIn addition to processing the data, pipelines pass the processed data further along the chain. The data travels along the chain until all pipeline have had a chance to process it. The pipelines can be added, removed, reordered and reused in other processing steps (e.g. the processing pipeline in
LGBMSolver
can be easily reused inLinearRegressionSolver
), without writing some column-assignment boilerplate.
The solvers module is the core abstraction of this project.
A ProblemSolver
is an object that stores data at creation, takes a corresponding Params
object when solve()
is called and returns a ProblemSolution
object when done solving the problem.
graph LR
D[Data]
S[Problem Solution]
subgraph SOLVER[Problem Solver]
direction TB
P[Params]
F["solve()"]
P --> F
end
D --> SOLVER
SOLVER --> S
Different solver implements the solve
method differently. For example, LGBMSolver
trains multiple models using Stratified K-Fold strategy, and using those models to make predictions together.
This module is aimed at utilizing the optuna
package to choose the best hyperparameters automatically.
Solvers are designed to be called multiple times, and hyperparameters are passed in the Params
object. This means we can easily generate hyperparameters and test how they perform on a specific problem solver using optuna
.
flowchart LR
D[Data]
B[Best Parameters]
subgraph O[Optuna]
direction LR
H[Hyperparameters]
S[Solver]
C{After N Trials?}
H --> S
S --> C
C --No, generate new--> H
end
D --> O
C --Yes, return--> B
This project adopts PDM as build system. Modern build systems (like PDM) are preferred over vanilla pip
tool, because PDM handles virtual environment & package management automatically and correctly.
For example, the vanilla
pip
tool won't remove indirectly referenced dependencies when removing the direct ones. When installingtorchvision
, pip will installpillow
and other packages required bytorchvision
, but will leave them unchanged whentorchvision
is removed.
To install & use PDM, see the official installation guide.
For compatibility, a requirements.txt
file is provided along with the modern python package config file (pyproject.toml
). To perform vanilla installation of dependencies, run:
pip install -r requirements.txt
NOTE: This
requirements.txt
is automatically generated by PDM usingpdm export
command. No guarantee is made when it comes to fullpip
compatibility.
Virtual environments are recommended in case to keep the global Python environment clean.