diff --git a/paper/paper.md b/paper/paper.md index 249b42f..9a7fd4d 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -48,6 +48,7 @@ bibliography: paper.bib # Statement of need Classification is one of the most common applications in machine learning. Classification models are often evaluated by a proper scoring rule - a scoring function that assigns the best score when predicted probabilities match the true probabilities - such as cross-entropy or mean square error [@gneiting2007strictly]. Examination of the discrimination performance (resolution), such as AUC or Se/Sp are also used to evaluate the model performance. However, the reliability or calibration performance of the model is often overlooked. + @DIAMOND199285 had shown that the resolution performance of a model does not indicate the reliability of the model. @Brocker_decompose later has shown that any proper scoring rule can be decomposed into the resolution and reliability. Thus even if the model has high resolution (high AUC), it may not be a reliable or calibrated model. In many high-risk machine learning applications, such as medical diagnosis, the reliability of the model is of paramount importance. We define calibration as the agreement between the predicted probability and the true posterior probability of a class-of-interest, $P(D=1|\hat{p}=p) = p$. This has been defined as moderate calibration by @Calster_weak_cal . @@ -58,7 +59,7 @@ In the `calzone` package, we provide a set of functions and classes for calibrat ## Reliability Diagram -The reliability diagram (also referred to as a calibration plot) is a graphical representation of the calibration of a classification model [@Brocker_reldia;steyerberg2010assessing]. It groups the predicted probabilities into bins and plots the mean predicted probability against the empirical frequency in each bin. The reliability diagram can be used to assess the calibration of the model and to identify any systematic errors in the predictions. In addition, `calzone` gives the option to also plot the confidence interval of the empirical frequency in each bin. The confidence intervals are calculated using Wilson's score interval [@wilson_interval]. We provide example data in the `example_data` folder which are simulated using a beta-binomial distribution [@beta-binomial]. The predicted probabilities are sampled from a beta distribution and the true labels are assigned using a Bernoulli trial with the sampled probabilities. Users can generate simulated data using the `fake_binary_data_generator` class in the `utils` module. +The reliability diagram (also referred to as a calibration plot) is a graphical representation of the calibration of a classification model [@Brocker_reldia;steyerberg2010assessing]. It groups the predicted probabilities into bins and plots the mean predicted probability against the empirical frequency in each bin. The reliability diagram can be used to assess the calibration of the model and to identify any systematic errors in the predictions. In addition, `calzone` gives the option to also plot the confidence interval of the empirical frequency in each bin. The confidence intervals are calculated using Wilson's score interval [@wilson_interval]. We provide example data in the `example_data` folder which are simulated using a beta-binomial distribution [@beta-binomial]. The predicted probabilities are sampled from a beta distribution and the true labels are assigned by performing Bernoulli trials with the sampled probabilities. Users can generate simulated data using the `fake_binary_data_generator` class in the `utils` module. ```python from calzone.utils import reliability_diagram