Release 0.8.1 #164

nnansters · 2022-12-02T12:41:26Z

nnansters
Dec 2, 2022
Maintainer

Hi all,

Niels from engineering here to announce our new 0.8.1 release!
It is a packed one again, so let's dive right into it.

Installing / upgrading

You can get this latest version by using pip:

pip install -U nannyml

Or conda:

conda install -c conda-forge nannyml

What's new?

Hellinger distance for univariate drift detection

We're adding another univariate drift detection method to our list: the Hellinger distance. The Hellinger Distance is a distance metric. It quantifies the similarity between two probability distributions. It measures the overlap between the probabilities assigned to the same event by reference and analysis samples. You can read more about it in the univariate drift documentation.

The following snippet illustrates how to calculate the Hellinger distance:

import nannyml as nml
from IPython.display import display

reference_df = nml.load_synthetic_binary_classification_dataset()[0]
analysis_df = nml.load_synthetic_binary_classification_dataset()[1]

column_names = ['distance_from_office', 'salary_range', 'gas_price_per_litre', 'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure', 'y_pred_proba', 'y_pred']
calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    timestamp_column_name='timestamp',
    continuous_methods=['hellinger'],
    categorical_methods=['hellinger'],
)

calc.fit(reference_df)
results = calc.calculate(analysis_df)
display(results.to_df())

A guide to picking a univariate drift method

We've been adding a lot of univariate drift methods lately. It might not always be clear when to use them. We've written an extensive guide to help you understand the use cases for which each univariate drift method works best. It will help you decide which ones fit your use case best!

Grab a cup of coffee or tea, and start reading!

Ranking (drifting) features according to their impact on model performance.

We've supported ranking drifting features since NannyML went public, be it with a very naive implementation. We're happy to release the CorrelationRanker today. It calculates the correlation between the result of the univariate drift calculation and the absolute change in (realized or estimated) performance. The higher the correlation, the higher the rank of the feature. You can use the ranker to figure out which features might be responsible for changes in performance due to covariate shift.

The following snippet shows you how to use it:

ranker = nml.CorrelationRanker()

# ranker fits on one metric and reference period data only
ranker.fit(realized_performance_results.filter(period='reference'))

# ranker ranks on one drift method and one performance metric
correlation_ranked_features = ranker.rank(
    univariate_drift_results,
    realized_performance_results,
    only_drifting = False)

Refactoring the plotting interface

I must warn you: this is a bit of an engineering one. Did you notice how all our plots only show a single metric or line? Our way of plotting was causing us some growing pains, so we decided to rewrite the plotting modules from scratch. Our modular approach lets us construct far more complex plots in less time.

What does it mean for you? We've dropped the plotting parameters regarding data selection. If you want to limit the data to visualize you can use the filter functionality first.
Note that you can also use filtering when retrieving data from NannyML for further processing. You can check the documentation on results here.

The following snippet illustrates how to render a plot of a calculation result:

import nannyml as nml

reference_df = nml.load_synthetic_binary_classification_dataset()[0]
analysis_df = nml.load_synthetic_binary_classification_dataset()[1]

column_names = ['distance_from_office', 'salary_range', 'gas_price_per_litre', 'public_transportation_cost', 'wfh_prev_workday', 'workday', 'tenure', 'y_pred_proba', 'y_pred']
calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    timestamp_column_name='timestamp',
    continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
    categorical_methods=['chi2', 'jensen_shannon'],
)

calc.fit(reference_df)
results = calc.calculate(analysis_df)

# plot only the Jensen-Shannon distance for continuous columns
drift_fig = results.filter(column_names=results.continuous_column_names, methods=['jensen_shannon']).plot(kind='drift')
drift_fig.show()

What's changed?

We've fixed some small issues:

Usage logging was not yet disabled when running our build pipelines on GitHub. We've adjusted the workflows to disable that behavior.
Nikos kept forgetting that the metrics parameter for result filtering is a list, so we just made it accept a single string too. That was the easiest option.

What's next?

We're at the end of our internal development cycle and will finalize the plans for our next one soon. Some topics on our radar include integrations, supporting "big data" and visualizing comparisons.

Enjoy NannyML 0.8.1 and stay tuned for more!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.8.1 #164

{{title}}

Replies: 0 comments

Select a reply

Release 0.8.1 #164

nnansters Dec 2, 2022 Maintainer

Installing / upgrading

What's new?

Hellinger distance for univariate drift detection

A guide to picking a univariate drift method

Ranking (drifting) features according to their impact on model performance.

Refactoring the plotting interface

What's changed?

What's next?

Replies: 0 comments

nnansters
Dec 2, 2022
Maintainer