Skip to content

Commit

Permalink
Adding references
Browse files Browse the repository at this point in the history
  • Loading branch information
jasonfan1997 committed Oct 31, 2024
1 parent 462874a commit 12e2a38
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 12 deletions.
7 changes: 7 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,13 @@ @article{taquet2022mapie
year={2022}
}

@article{uncertaintyToolbox,
title={Uncertainty Toolbox: an Open-Source Library for Assessing, Visualizing, and Improving Uncertainty Quantification},
author={Chung, Youngseog and Char, Ian and Guo, Han and Schneider, Jeff and Neiswanger, Willie},
journal={arXiv preprint arXiv:2109.10254},
year={2021}
}

@Manual{ResourceSelection,
title = {ResourceSelection: Resource Selection (Probability) Functions for Use-Availability Data},
author = {Subhash R. Lele and Jonah L. Keim and Peter Solymos},
Expand Down
16 changes: 4 additions & 12 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,13 +53,13 @@ Classification is one of the most fundamental tasks in machine learning. Classif

We refer to calibration as the agreement between the predicted probability and the true posterior probability of a class-of-interest, $P(D=1|\hat{p}=p) = p$. This is also termed as moderate calibration by @Calster_weak_cal .

In the `calzone` package, we provide a set of functions and classes for calibration visualization and metrics computation. Existing libraries such as `scikit-learn` are often not dedicated to calibration metrics computation and don't provide calibration metrics computation that are widely used in the statistical literature. Other libraries are focused on implementing calibration methods instead of ways to evaluate calibration [TODO: cite].
In the `calzone` package, we provide a set of functions and classes for calibration visualization and metrics computation. Existing libraries such as `scikit-learn` are often not dedicated to calibration metrics computation and don't provide calibration metrics computation that are widely used in the statistical literature. Other libraries such as `uncertainty-toolbox` are focused on implementing calibration methods and visualization instead of ways to evaluate calibration [@uncertaintyToolbox].

# Functionality

## Reliability Diagram

The reliability diagram is a graphical representation of the calibration of a classification model [@Brocker_reldia]. It groups the predicted probabilities into bins and plots the mean predicted probability against the empirical frequency in each bin. The reliability diagram can be used to assess the calibration of the model and to identify any systematic errors in the predictions. In addition, `calzone` gives the option to also plot the confidence interval of the empirical frequency in each bin. The confidence intervals are calculated using Wilson's score interval [@wilson_interval]. We provide an example analsis in the `example_data` folder using beta-binomial distribution [@beta-binomial]. Users can generate simulated data using the `fake_binary_data_generator` class in the `utils` module.
The reliability diagram is a graphical representation of the calibration of a classification model [@Brocker_reldia]. It groups the predicted probabilities into bins and plots the mean predicted probability against the empirical frequency in each bin. The reliability diagram can be used to assess the calibration of the model and to identify any systematic errors in the predictions. In addition, `calzone` gives the option to also plot the confidence interval of the empirical frequency in each bin. The confidence intervals are calculated using Wilson's score interval [@wilson_interval]. We provide an example analysis in the `example_data` folder using beta-binomial distribution [@beta-binomial]. Users can generate simulated data using the `fake_binary_data_generator` class in the `utils` module.

```python
from calzone.utils import reliability_diagram
Expand Down Expand Up @@ -91,7 +91,7 @@ plot_reliability_diagram(
`calzone` provides functions to compute various calibration metrics. The `CalibrationMetrics()` class allows the user to compute the calibration metrics in a more convenient way. The following are metrics that are currently supported in `calzone`:

### Expected Calibration Error (ECE) and Maximum Calibration Error (MCE)
Expected Calibration Error (ECE), Maximum Calibration Error (MCE) and other binning-based methods [@guo_calibration;@Naeini_ece] aim to measure the average deviation between predicted probability and true probability. We provide the option to use equal-width binning or equal-count binning, labeled as ECE-H and ECE-C respectively. Users can also choose to compute the metrics for the class-of-interest or the top-class. In the case of class-of-interest, `calzone` will evaluate the calibration of a one-vs-rest classification problem. The following snipped demonstrates how these metrics are calculated in our package:
Expected Calibration Error (ECE), Maximum Calibration Error (MCE) and other binning-based methods [@guo_calibration;@Naeini_ece] aim to measure the average deviation between predicted probability and true probability. We provide the option to use equal-width binning or equal-count binning, labeled as ECE-H and ECE-C respectively. Users can also choose to compute the metrics for the class-of-interest or the top-class. In the case of class-of-interest, `calzone` will evaluate the calibration of a one-vs-rest classification problem. The following snippet demonstrates how these metrics are calculated in our package:

```python
from calzone.metrics import calculate_ece_mce
Expand All @@ -113,11 +113,7 @@ ece_h_classone, mce_h_classone = calculate_ece_mce(


### Hosmer-Lemeshow statistic (HL)
<<<<<<< HEAD
Hosmer-Lemeshow statistic (HL) is a statistical test for the calibration of a probabilistic model [@hl_test]. It is a chi-square based test that compares the observed and expected number of events in each bin. The null hypothesis is that the model is well calibrated. HL-test first bins data into predicted probability bins (equal-width $H$ or equal-count $C$) and the test statistic is calculated as:
=======
The Hosmer-Lemeshow (HL) statistical test is for evaluating the calibration of a probabilistic model. It is a chi-square-based test that compares the observed and expected number of events in each bin. The null hypothesis is that the model is well calibrated. HL-test first bins data into predicted probability bins (equal-width $H$ or equal-count $C$) and the test statistic is calculated as:
>>>>>>> cdd3f8ac405788615653f94d7b4144af7e9acab0
$$
\text{HL} = \sum_{m=1}^{M} \frac{(O_{1,m}-E_{1,m})^2}{E_{1,m}(1-\frac{E_{1,m}}{N_m})} \sim \chi^2_{M-2}
$$
Expand All @@ -131,11 +127,7 @@ HL_H_ts, HL_H_p, df = hosmer_lemeshow_test(
bin_count=bin_counts
)
```
<<<<<<< HEAD
When performing the HL test on validation sets that are not used in training, the degree of freedom of the HL test changes from $M-2$ to $M$. Intuitively, $\frac{(O_{1,m}-E_{1,m})^2}{E_{1,m}(1-\frac{E_{1,m}}{N_m})}$ is the difference squared divided by the variance of a binomial distribution and follows a chi-square distribution with 1 degree of freedom. Hence, the sum of $M$ chi-square distributions with 1 degree of freedom is a chi-square distribution with $M$ degrees of freedom if the data has no effect on the model [@hosmer2013applied]. The increase in degree of freedom for validation samples has often been overlooked but it is crucial for the test to maintain the correct type 1 error rate. In `calzone`, users can specify the degree of freedom of the HL test by setting the `df` parameter.
=======
In `calzone`, user can sepecify the degree of freedom of the HL test by setting the `df` parameter.This is useful because when performing the HL test on validation sets that are not used in training, the degree of freedom of the HL test changes from $M-2$ to $M$ [TODO: cite]. Intuitively, $\frac{(O_{1,m}-E_{1,m})^2}{E_{1,m}(1-\frac{E_{1,m}}{N_m})}$ is the difference squared divided by the variance of a binomial distribution and follows a chi-square distribution with 1 degree of freedom. Hence, the sum of $M$ chi-square distributions with 1 degree of freedom is a chi-square distribution with $M$ degrees of freedom if the data has no effect on the model.
>>>>>>> cdd3f8ac405788615653f94d7b4144af7e9acab0
When performing the HL test on validation sets that are not used in training, the degree of freedom of the HL test changes from $M-2$ to $M$[@hosmer2013applied]. Intuitively, $\frac{(O_{1,m}-E_{1,m})^2}{E_{1,m}(1-\frac{E_{1,m}}{N_m})}$ is the difference squared divided by the variance of a binomial distribution and follows a chi-square distribution with 1 degree of freedom. Hence, the sum of $M$ chi-square distributions with 1 degree of freedom is a chi-square distribution with $M$ degrees of freedom if the data has no effect on the model. The increase in degree of freedom for validation samples has often been overlooked but it is crucial for the test to maintain the correct type 1 error rate. In `calzone`, users can specify the degree of freedom of the HL test by setting the `df` parameter.

### Cox's calibration slope/intercept
Cox's calibration slope/intercept is a regression analysis method for assessing the calibration of a probabilistic model [@Cox]. A new logistic regression model is fitted to the data, with the predicted odds ($\frac{p}{1-p}$) as the independent variable and the outcome as the dependent variable. The slope and intercept of the regression line are then used to assess the calibration of the model. A slope of 1 and intercept of 0 indicates perfect calibration. To test whether the model is calibrated, fix the slope to 1 and fit the intercept. If the intercept is significantly different from 0, the model is not calibrated. Then, fix the intercept to 0 and fit the slope. If the slope is significantly different from 1, the model is not calibrated.
Expand Down

0 comments on commit 12e2a38

Please sign in to comment.