RaceFit

The main goal of the RaceFit is to help professional cycling team coaches to allocate cyclist to races. Additional goal is to model the decision making process of the coach and understand the most important factors in the decision. I invite you to read the paper "Modelling Coach Decisions in Professional Cycling Teams", describing the method more deeply.

You can find the poster and the presentation presented in the ECML-PKDD 2022 conference in the deliverables directory.

Also, the paper is available here and the poster here.

Installation

It is required to have a python environment containing the necessary packages. For you convenience, a file named 'requirements.txt' is attached from which you can install the libraries easily.

In order to install the environment from the file run the following command:

  pip install -r requirements.txt

Data

Most of the data is accessible and can be downloaded here, The data CSV files should be located in the working directory in "db" folder.

Data description can be found in the paper under "The IPT’s Cyclists’ Workouts and Races Dataset" section and in the README file contained in the ZIP file.

*The Training Peaks cyclist workouts can not be published, STRAVA workouts are available in the ZIP file.

Running instructions

These are the actions (tasks) the system allows:

Create teams cyclist participation in race and in stage matrices
Create examples and labels input from raw data
Preprocessing
Evaluate popularity baselines
Training clustering model
Training and evaluation of RaceFit

In the next sections I will describe the parameters for each one of the tasks with actual usage examples.

Create teams cyclist participation in race and in stage

Creating binary matrices of cyclist-race and cyclist-stage. The matrices fill in 1 if the cyclist participated in a race (or stage) and 0 otherwise. The team pcs IDs:

Team	PCS Id
AG2R Citroën Team	1060
CCC Team	1103
Lotto Soudal	2088
Team Jumbo-Visma	1330
Israel - Premier Tech	2738
Cofidis	1136
Groupama - FDJ	1187
Movistar Team	2040
UAE Team Emirates	1253
Trek - Segafredo	1258

mandatory parameters:

-a create_matrix
-ti <team pcs id>

optional:

-o 1 (overwrite the existing files)

Usage example

python -a create_matrix -ti 1258 -o 1

Create examples and labels input from raw data

Given race, cyclist and workouts of the last weeks prior the race (we defined it as workouts time window), this process is creating examples. Each example consist of race, cyclist and summarized workout vector, the label of the example is represented by the cyclist participation in race/stage. It is possible to choose to create the input for the model by stages or by races, the training section will describe this matter.

Possible parameters:

Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
Time Window Size: int, number of weeks
Data Source: STRAVA, TP
Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average

mandatory parameters:

-a create_input
-ti <team pcs id>            # continue the last process of creating matrices by insert the same team id
-iw <workouts imputer>       # choose whether to use imputation method for the workouts table
-t <time window size> 
-ws <data source> 
-af <aggregation function>

optional:

-rp 1                        # change to race prediction instead of stage prediction - which is the default
-o 1                         # overwrite the existing files

Usage example

python -a create_input -ti 2738 -iw without -t 5 -ws STRAVA -af SmartAgg -rp 1 -o 1

Preprocessing

Data cleaning and preprocessing using multiple methods such as drop high-value missing ratio examples features, encoding categorical features, scaling data and data imputation.

Possible parameters:

Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
Time Window Size: int, number of weeks
Data Source: STRAVA, TP
Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
Examples Features Non-Missing Ratio: without, float (0.4 value will cause dropping examples features with missing ratio of 60% or greater)
Standardization: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler

mandatory parameters:

-a preprocessing
-iw <workouts imputer>                      # continue the last process of creating input by insert the same imputer
-t <time window size>                       # continue the last process of creating input by insert the same number of weeks
-ti <team pcs id>                           # continue the last process of creating input by insert the same team id
-ws <data source>                           # continue the last process of creating input by insert the same data source
-af <aggregation function>                  # continue the last process of creating input by insert the same function
-i <examples imputer>                       # choose whether to use imputation method for the examples
-c <examples features non-missing ratio>

optional:

-rp 1                                       # continue the last process of creating input by insert the same rp value
-o 1                                        # overwrite the existing files
-s <scaler>

Usage example

python -a preprocessing -iw without -t 5 -ti 2738 -ws STRAVA -af SmartAgg  -i SimpleImputer -c 0.4 -o 1 -rp 1 -s StandardScaler

Evaluate popularity baselines

This task is for evaluating the popularity baslines. The popularity values computed as features in the examples.

Possible parameters:

Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
Time Window Size: int, number of weeks
Data Source: STRAVA, TP
Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
Examples Features Non-Missing Ratio: without, float (0.4 value will cause dropping examples features with missing ratio of 60% or greater)
Standardization: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler

mandatory parameters:

-a eval_baselines
-iw <workouts imputer>                      # continue the last process of preprocessing by insert the same workouts imputer
-t <time window size>                       # continue the last process of preprocessing by insert the same number of weeks
-ti <team pcs id>                           # continue the last process of preprocessing by insert the same team id
-ws <data source>                           # continue the last process of preprocessing by insert the same data source
-af <aggregation function>                  # continue the last process of preprocessing by insert the same function
-i <examples imputer>                       # continue the last process of preprocessing by insert the same imputer
-c <examples features non-missing ratio>    # continue the last process of preprocessing by insert the same c value

optional:

-rp 1                                       # continue the last process of preprocessing by insert the same rp value
-o 1                                        # overwrite the existing files
-s <scaler>                                 # continue the last process of preprocessing by insert the same scaler
-oi 1                                       # evaluate only important races (taken from PCS dropdown races list)

Usage example

python -a eval_baselines -iw without -i SimpleImputer -t 5 -ti 2738 -o 1 -ws STRAVA -af SmartAgg -c 0.4 -oi 1

Train Clustering Model

Optional use of the method include clustering process when summing the cyclist-stages scores to cyclist-races scores. If this is the case, clustering models should be pretrained to the RaceFit training.

Possible parameters:

Number of Clusters: int, specify how many types of stages (clusters) to make

mandatory parameters:

-a clustering
-kc <number of clusters>

Usage example

python -a clustering -kc 3

Train and Evaluate RaceFit

The algorithm of RaceFit and its evaluation including the training of the models the algorithm use. The Base Classifier is defined as the cyclist-stage ranker and the default behaviour is computation of the cyclist-race ranking by using the average function. Additionally, learning the aggregation function of stages to race is allowed. Scores Classifier is used for the function weights learning, the training data fraction of the second-level classifier is defined as the Split Fraction, and the scores' classifier input preparation require using clustering pretrained model specified by the number of clusters.

Possible parameters:

Action: train_model, eval_model or train_eval that combines both
Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
Time Window Size: int, number of weeks
Data Source: STRAVA, TP
Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
Examples Features Non-Missing Ratio: without, float (0.4 value will cause dropping examples features with missing ratio of 60% or greater)
Standardization: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
Base Classifier: CatBoost, AdaBoost, Logistic, DecisionTree, RandomForest, KNN, SVC, XGBoost, LGBM, GaussianNB, GradientBoosting
Scores Classifier: CatBoost, AdaBoost, Logistic, DecisionTree, RandomForest, KNN, SVC, XGBoost, LGBM, GaussianNB, GradientBoosting
Split Fraction: float, percentage of the training data used for scores classifier
Number of Clusters: int, specify which one of the pretrained clustering models to use

mandatory parameters:

-a <action>
-iw <workouts imputer>                      # continue the last process of preprocessing by insert the same workouts imputer
-t <time window size>                       # continue the last process of preprocessing by insert the same number of weeks
-ti <team pcs id>                           # continue the last process of preprocessing by insert the same team id
-ws <data source>                           # continue the last process of preprocessing by insert the same data source
-af <aggregation function>                  # continue the last process of preprocessing by insert the same function
-i <examples imputer>                       # continue the last process of preprocessing by insert the same imputer
-c <examples features non-missing ratio>    # continue the last process of preprocessing by insert the same c value
-m <base classifier>

optional:

-rp 1                                       # continue the last process of preprocessing by insert the same rp value
-o 1                                        # overwrite the existing files
-s <scaler>                                 # continue the last process of preprocessing by insert the same scaler
-oi 1                                       # evaluate only important races (taken from PCS dropdown races list)
-sm <scores classifier>
-sms <split fraction>
-kc <number of clusters>

Usage example - Base model training

python -a train_eval -iw without -i SimpleImputer -t 5 -ti 2738 -o 1 -ws STRAVA -af SmartAgg -c 0.4 -oi 1 -m CatBoost

Usage example - Scores model training

python -a train_eval -iw without -i SimpleImputer -t 5 -ti 2738 -o 1 -ws STRAVA -af SmartAgg -c 0.4 -oi 1 -m CatBoost -sm DecisionTree -sms 0.2 -kc 3

Plot Results

The analysis tools I used and are the following:

Precision@i , Recall@i while i is the number of cyclists recommended (for each parameter while the other parameters results are averaged)
Recall@(n+k) while n is the # cyclists participated and k is gap for the coach to choose from
Plot Feature Importance bar plots
Generate feature importance csv files sorted by ranking of importance for all teams
Plot Decision Tree top nodes
Plot Catboost tree top nodes
Plot the learning curve of the model by time (Time graph)
AUC of Precision@i-Recall@i graph (for each parameter while the other parameters results are averaged)
AUC of Precision@i-Recall@i interaction between 2 parameters based on chronological order of use.

The configuration of the graphs to plot you can adjust using the file "results_consts".

Main configs:

SINGLE_RACE_TYPE: ONE_DAY_RACES,MAJOR_TOURS, GRAND_TOURS (show results only for one type of race)
WORKOUTS_SRC: STRAVA, TP
SINGLE_RACE_TYPE: None for generating all teams plots or the name of the team (i.e. "Israel - Premier Tech")
with_baseline: plot baselines lines
top_i: while plotting Time graph, define the # cyclists recommended

Support

For support you can contact me in email [email protected] or you can reach me at LinkedIn.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
allocation_matrices		allocation_matrices
deliverables		deliverables
pre calculated data		pre calculated data
.gitignore		.gitignore
DataManager.py		DataManager.py
ExperimentsLauncher.py		ExperimentsLauncher.py
Model.py		Model.py
README.md		README.md
expr_consts.py		expr_consts.py
requirements.txt		requirements.txt
results_consts.py		results_consts.py
results_plots.py		results_plots.py
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RaceFit

Installation

Data

Running instructions

Create teams cyclist participation in race and in stage

Create examples and labels input from raw data

Preprocessing

Evaluate popularity baselines

Train Clustering Model

Train and Evaluate RaceFit

Plot Results

Support

About

Releases

Packages

Languages

MaorSagi/RaceFit

Folders and files

Latest commit

History

Repository files navigation

RaceFit

Installation

Data

Running instructions

Create teams cyclist participation in race and in stage

Create examples and labels input from raw data

Preprocessing

Evaluate popularity baselines

Train Clustering Model

Train and Evaluate RaceFit

Plot Results

Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages