Release 0.10.3 #371

nnansters · 2024-02-29T14:23:17Z

nnansters
Feb 29, 2024
Maintainer

Hey there!

We hope you've enjoyed the end-of-year festivities and are ready for an exciting new year. We at NannyML sure are! We've kicked off the year with a couple of small releases. We'll provide a quick overview here, check out the full release notes in our changelog!

So, without further ado, let's dive into NannyML 0.10.3!

Installing/upgrading

You can get this latest version by using pip:

pip install -U nannyml

Or conda:

conda install -c conda-forge nannyml

What’s new?

Domain classifier

We're very happy to announce that we now support using a Domain Classifier for multivariate drift detection!
Here's how it works, in a nutshell. For each chunk, we combine the reference data with the current chunk data. We then use cross validation to train a model to discriminate between chunk and reference rows. The model's predictions on the validation folds used are used to measure it's performance (via AUROC).

High performance implies that there are significant differences between both. It is easy to discriminate between reference and your chunk data, drift was detected!

A low score indicates the opposite: it is difficult to determine if the data belongs to the reference data or the chunk data, hence they must be very much alike!

The following snippet illustrates how to use the Domain Classifier:

import nannyml as nml

# Load synthetic data
reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()

# Define feature columns
feature_column_names = [
    'car_value',
    'salary_range',
    'debt_to_income_ratio',
    'loan_length',
    'repaid_loan_on_prev_car',
    'size_of_downpayment',
    'driver_tenure'
]

calc = nml.DomainClassifierCalculator(
    feature_column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_size=5000
).fit(reference_df)
results = calc.calculate(analysis_df)

results.plot().show()

Find out more about the DomainClassifierCalculator in the tutorial and how it works documentation.

Distribution calculators

Our distribution plots have been present since our very first release of NannyML as part of the UnivariateDriftCalculator. We felt the time had come to optimize a bit and give them a space of their own in the library.

We've added the ContinuousDistributionCalculator and CategoricalDistributionCalculator. They calculate the distributions for a list of continuous or categorical features, surprisingly.

We've tweaked the implementation a bit to be more resource-efficient than the previous version, as they no longer store the entire reference data set during fitting, but only some properties of the reference data distribution. This improves both the computation speed and memory usage.

The results of these calculators also support plotting, yielding the same nice "joyplots over time" or "bars over time" visualizations as before!

Here's an example of how to use them:

import nannyml as nml

reference_df, analysis_df, _ = nml.load_synthetic_car_loan_dataset()

calc = nml.ContinuousDistributionCalculator(
    column_names=['car_value', 'debt_to_income_ratio'],
    timestamp_column_name='timestamp',
).fit(reference_df)
results = calc.calculate(analysis_df)

results.plot().show()


calc2 = nml.CategoricalDistributionCalculator(
    column_names=['repaid_loan_on_prev_car', 'salary_range'],
    timestamp_column_name='timestamp',
).fit(reference_df)
results2 = calc2.calculate(analysis_df)

results2.plot().show()

The old distribution implementation embedded in the UnivariateDriftCalculator was left untouched for now, so you can continue using it as before. We'll be evaluating its role and implementation in the future.

What's changed?

We've made a lot of fixes, here's a highlight:

We made our calculator execution more robust. We've (finally) added some guardrails to catch unexpected exceptions when calculating a metric for a chunk. If an unexpected exception is raised, we'll log it and display a warning ( ❤️ for notebook users), return a np.NaN value for that chunk and then proceed with the next chunk. Previously this kind of exception would just shut down the calculator.
We've removed the p-value-based thresholds for Chi2 univariate drift detection. This was the only place where p-values were being used. They caused a lot of confusion in the plots because the alerts would not visually align with "crossing the threshold". All univariate drift methods will now be using standard deviation-based thresholds.
We've added unique identifier columns to our included datasets. This will make joins of analysis and target datasets a lot easier.

Some honorary mentions

The "like clockwork" award goes to
@Kishan Savant
for having our documented backs on every release. Implying we forget about something with every release, woops.

What's next?

We've been pretty busy working on our NannyML Cloud product, featuring some novel algorithms like the improved, multi-calibrated version of CBPE and our reverse concept drift algorithms to estimate the effect of concept drift on your model.
In the meantime, we're "weighing down" on an alternative for performance estimation and implementing a very popular (or should I say "population") drift detection method.

We hope you enjoy this new release. Any feedback is, as always, most welcome!

All the best,

Niels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.10.3 #371

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Release 0.10.3 #371

nnansters Feb 29, 2024 Maintainer

Installing/upgrading

What’s new?

Domain classifier

Distribution calculators

What's changed?

Some honorary mentions

What's next?

Replies: 0 comments

nnansters
Feb 29, 2024
Maintainer