Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference Drift Metrics #426

Open
emrynHofmannElephant opened this issue Oct 4, 2024 · 2 comments
Open

Reference Drift Metrics #426

emrynHofmannElephant opened this issue Oct 4, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@emrynHofmannElephant
Copy link

When calculating univariate drift, you "fit" the drift on the reference. How are the drift metrics of the chunks in the reference data then calculated? - Are they compared to the overall distribution of the reference data?

@jakubnml
Copy link
Contributor

jakubnml commented Oct 4, 2024

Yes, that's how it is done currently and we are aware it is not the optimum way. Good job on spotting that though 👏

So the correct way is: when calculating drift metric for a chunk which is a subset of the reference data, the observations that belong to that chunk should be "removed" from the reference data for the comparison. Just like in Cross Validation. Otherwise the some of the drift metrics are lower than they really should, because one dataset (reference chunk) is a subset of the other (whole reference). As an effect, in an extreme situation, one may have perfectly iid data, but the drift metrics on reference chunks will be lower than on monitored (analysis) data - yet with iid data they shouldn't.

We plan to fix this. Either by enforcing the new correct way or making it the default one, but keeping both and making the old way optional as it sometimes may be beneficial because of its lower computational cost. I can't say exactly when because our current focus is on research related to performance estimation methods.

Before we fix it, if you really want, you can hack it on your own - by fitting calculator multiple times on subsets of reference data that do not contain the reference chunk of interest.

@nnansters nnansters added the enhancement New feature or request label Oct 4, 2024
Copy link

stale bot commented Dec 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 5, 2024
@nnansters nnansters removed the stale label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants