The main goal of the RaceFit is to help professional cycling team coaches to allocate cyclist to races. Additional goal is to model the decision making process of the coach and understand the most important factors in the decision. I invite you to read the paper "Modelling Coach Decisions in Professional Cycling Teams", describing the method more deeply.
You can find the poster and the presentation presented in the ECML-PKDD 2022 conference in the deliverables directory.
Also, the paper is available here and the poster here.
It is required to have a python environment containing the necessary packages. For you convenience, a file named 'requirements.txt' is attached from which you can install the libraries easily.
In order to install the environment from the file run the following command:
pip install -r requirements.txt
Most of the data is accessible and can be downloaded here, The data CSV files should be located in the working directory in "db" folder.
Data description can be found in the paper under "The IPT’s Cyclists’ Workouts and Races Dataset" section and in the README file contained in the ZIP file.
*The Training Peaks cyclist workouts can not be published, STRAVA workouts are available in the ZIP file.
These are the actions (tasks) the system allows:
- Create teams cyclist participation in race and in stage matrices
- Create examples and labels input from raw data
- Preprocessing
- Evaluate popularity baselines
- Training clustering model
- Training and evaluation of RaceFit
In the next sections I will describe the parameters for each one of the tasks with actual usage examples.
Creating binary matrices of cyclist-race and cyclist-stage. The matrices fill in 1 if the cyclist participated in a race (or stage) and 0 otherwise. The team pcs IDs:
Team | PCS Id |
---|---|
AG2R Citroën Team | 1060 |
CCC Team | 1103 |
Lotto Soudal | 2088 |
Team Jumbo-Visma | 1330 |
Israel - Premier Tech | 2738 |
Cofidis | 1136 |
Groupama - FDJ | 1187 |
Movistar Team | 2040 |
UAE Team Emirates | 1253 |
Trek - Segafredo | 1258 |
mandatory parameters:
-a create_matrix
-ti <team pcs id>
optional:
-o 1 (overwrite the existing files)
Usage example
python -a create_matrix -ti 1258 -o 1
Given race, cyclist and workouts of the last weeks prior the race (we defined it as workouts time window), this process is creating examples. Each example consist of race, cyclist and summarized workout vector, the label of the example is represented by the cyclist participation in race/stage. It is possible to choose to create the input for the model by stages or by races, the training section will describe this matter.
Possible parameters:
- Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
- Time Window Size: int, number of weeks
- Data Source: STRAVA, TP
- Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
mandatory parameters:
-a create_input
-ti <team pcs id> # continue the last process of creating matrices by insert the same team id
-iw <workouts imputer> # choose whether to use imputation method for the workouts table
-t <time window size>
-ws <data source>
-af <aggregation function>
optional:
-rp 1 # change to race prediction instead of stage prediction - which is the default
-o 1 # overwrite the existing files
Usage example
python -a create_input -ti 2738 -iw without -t 5 -ws STRAVA -af SmartAgg -rp 1 -o 1
Data cleaning and preprocessing using multiple methods such as drop high-value missing ratio examples features, encoding categorical features, scaling data and data imputation.
Possible parameters:
- Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
- Time Window Size: int, number of weeks
- Data Source: STRAVA, TP
- Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
- Examples Features Non-Missing Ratio: without, float (0.4 value will cause dropping examples features with missing ratio of 60% or greater)
- Standardization: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
mandatory parameters:
-a preprocessing
-iw <workouts imputer> # continue the last process of creating input by insert the same imputer
-t <time window size> # continue the last process of creating input by insert the same number of weeks
-ti <team pcs id> # continue the last process of creating input by insert the same team id
-ws <data source> # continue the last process of creating input by insert the same data source
-af <aggregation function> # continue the last process of creating input by insert the same function
-i <examples imputer> # choose whether to use imputation method for the examples
-c <examples features non-missing ratio>
optional:
-rp 1 # continue the last process of creating input by insert the same rp value
-o 1 # overwrite the existing files
-s <scaler>
Usage example
python -a preprocessing -iw without -t 5 -ti 2738 -ws STRAVA -af SmartAgg -i SimpleImputer -c 0.4 -o 1 -rp 1 -s StandardScaler
This task is for evaluating the popularity baslines. The popularity values computed as features in the examples.
Possible parameters:
- Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
- Time Window Size: int, number of weeks
- Data Source: STRAVA, TP
- Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
- Examples Features Non-Missing Ratio: without, float (0.4 value will cause dropping examples features with missing ratio of 60% or greater)
- Standardization: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
mandatory parameters:
-a eval_baselines
-iw <workouts imputer> # continue the last process of preprocessing by insert the same workouts imputer
-t <time window size> # continue the last process of preprocessing by insert the same number of weeks
-ti <team pcs id> # continue the last process of preprocessing by insert the same team id
-ws <data source> # continue the last process of preprocessing by insert the same data source
-af <aggregation function> # continue the last process of preprocessing by insert the same function
-i <examples imputer> # continue the last process of preprocessing by insert the same imputer
-c <examples features non-missing ratio> # continue the last process of preprocessing by insert the same c value
optional:
-rp 1 # continue the last process of preprocessing by insert the same rp value
-o 1 # overwrite the existing files
-s <scaler> # continue the last process of preprocessing by insert the same scaler
-oi 1 # evaluate only important races (taken from PCS dropdown races list)
Usage example
python -a eval_baselines -iw without -i SimpleImputer -t 5 -ti 2738 -o 1 -ws STRAVA -af SmartAgg -c 0.4 -oi 1
Optional use of the method include clustering process when summing the cyclist-stages scores to cyclist-races scores. If this is the case, clustering models should be pretrained to the RaceFit training.
Possible parameters:
- Number of Clusters: int, specify how many types of stages (clusters) to make
mandatory parameters:
-a clustering
-kc <number of clusters>
Usage example
python -a clustering -kc 3
The algorithm of RaceFit and its evaluation including the training of the models the algorithm use. The Base Classifier is defined as the cyclist-stage ranker and the default behaviour is computation of the cyclist-race ranking by using the average function. Additionally, learning the aggregation function of stages to race is allowed. Scores Classifier is used for the function weights learning, the training data fraction of the second-level classifier is defined as the Split Fraction, and the scores' classifier input preparation require using clustering pretrained model specified by the number of clusters.
Possible parameters:
- Action: train_model, eval_model or train_eval that combines both
- Imputation: without, SimpleImputer, KNNImputer, IterativeImputer
- Time Window Size: int, number of weeks
- Data Source: STRAVA, TP
- Workouts Aggregation Function: SmartAgg (use both AVG and SUM), Average
- Examples Features Non-Missing Ratio: without, float (0.4 value will cause dropping examples features with missing ratio of 60% or greater)
- Standardization: StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
- Base Classifier: CatBoost, AdaBoost, Logistic, DecisionTree, RandomForest, KNN, SVC, XGBoost, LGBM, GaussianNB, GradientBoosting
- Scores Classifier: CatBoost, AdaBoost, Logistic, DecisionTree, RandomForest, KNN, SVC, XGBoost, LGBM, GaussianNB, GradientBoosting
- Split Fraction: float, percentage of the training data used for scores classifier
- Number of Clusters: int, specify which one of the pretrained clustering models to use
mandatory parameters:
-a <action>
-iw <workouts imputer> # continue the last process of preprocessing by insert the same workouts imputer
-t <time window size> # continue the last process of preprocessing by insert the same number of weeks
-ti <team pcs id> # continue the last process of preprocessing by insert the same team id
-ws <data source> # continue the last process of preprocessing by insert the same data source
-af <aggregation function> # continue the last process of preprocessing by insert the same function
-i <examples imputer> # continue the last process of preprocessing by insert the same imputer
-c <examples features non-missing ratio> # continue the last process of preprocessing by insert the same c value
-m <base classifier>
optional:
-rp 1 # continue the last process of preprocessing by insert the same rp value
-o 1 # overwrite the existing files
-s <scaler> # continue the last process of preprocessing by insert the same scaler
-oi 1 # evaluate only important races (taken from PCS dropdown races list)
-sm <scores classifier>
-sms <split fraction>
-kc <number of clusters>
Usage example - Base model training
python -a train_eval -iw without -i SimpleImputer -t 5 -ti 2738 -o 1 -ws STRAVA -af SmartAgg -c 0.4 -oi 1 -m CatBoost
Usage example - Scores model training
python -a train_eval -iw without -i SimpleImputer -t 5 -ti 2738 -o 1 -ws STRAVA -af SmartAgg -c 0.4 -oi 1 -m CatBoost -sm DecisionTree -sms 0.2 -kc 3
The analysis tools I used and are the following:
- Precision@i , Recall@i while i is the number of cyclists recommended (for each parameter while the other parameters results are averaged)
- Recall@(n+k) while n is the # cyclists participated and k is gap for the coach to choose from
- Plot Feature Importance bar plots
- Generate feature importance csv files sorted by ranking of importance for all teams
- Plot Decision Tree top nodes
- Plot Catboost tree top nodes
- Plot the learning curve of the model by time (Time graph)
- AUC of Precision@i-Recall@i graph (for each parameter while the other parameters results are averaged)
- AUC of Precision@i-Recall@i interaction between 2 parameters based on chronological order of use.
The configuration of the graphs to plot you can adjust using the file "results_consts".
Main configs:
- SINGLE_RACE_TYPE: ONE_DAY_RACES,MAJOR_TOURS, GRAND_TOURS (show results only for one type of race)
- WORKOUTS_SRC: STRAVA, TP
- SINGLE_RACE_TYPE: None for generating all teams plots or the name of the team (i.e. "Israel - Premier Tech")
- with_baseline: plot baselines lines
- top_i: while plotting Time graph, define the # cyclists recommended
For support you can contact me in email [email protected] or you can reach me at LinkedIn.