Release 0.7.0 #139

nnansters · 2022-11-07T17:34:06Z

nnansters
Nov 7, 2022
Maintainer

Hi all,

Niels from engineering here to announce our new 0.7.0 release!

We're focusing on eliminating some technical debt we've built up over time and have some exciting developments waiting for you. Let's dive in!

Installing / upgrading

You can get this latest version by using pip:

pip install -U nannyml

Or conda:

conda install -c conda-forge nannyml

What's new?

Refactoring the drift module

We created an elaborate structure of calculators when we first implemented drift detection. We had calculators for model inputs, scores, predictions, and targets. We thought it would make things simple. User research showed us it didn't.

We have now created a univariate drift calculator that is simpler and more streamlined. It lets you detect drift on model inputs, scores, predictions, and targets. It supports multiple methods to do so.

This new design is not only more user-friendly, but it is also extensible. It lets us introduce new drift detection methods smoothly. But I'm getting ahead of myself.

The following snipped shows off the new univariate drift calculator.

import nannyml as nml

reference_df, analysis_df, _ = nml.load_synthetic_binary_classification_dataset()

column_names = [col for col in reference_df.columns if col not in ['timestamp', 'identifier', 'period', 'work_home_actual']]

calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    timestamp_column_name='timestamp',
    continuous_methods=['kolmogorov_smirnov'],
    categorical_methods=['chi2'],
)

calc.fit(reference_df)
results = calc.calculate(analysis_df)

You can read more in the univariate drift calculator documentation.

Introducing Jensen-Shannon

We haven't only been refactoring. We added the Jensen-Shannon distance as a new univariate drift detection method. According to our experiments, Jenssen-Shannon can detect drift when KS or CHI2 tests would miss it. You can read more on this in the docs. And it works for both continuous and categorical variables!

The following snippet shows how to use it.

import nannyml as nml

reference_df, analysis_df, _ = nml.load_synthetic_binary_classification_dataset()

column_names = [col for col in reference_df.columns if col not in ['timestamp', 'identifier', 'period', 'work_home_actual']]

calc = nml.UnivariateDriftCalculator(
    column_names=column_names,
    timestamp_column_name='timestamp',
    continuous_methods=['kolmogorov_smirnov', 'jensen_shannon'],
    categorical_methods=['chi2', 'jensen_shannon'],
)

calc.fit(reference_df)
results = calc.calculate(analysis_df)

You can find out more in the univariate drift calculator documentation.

Refactoring results

The Result classes were another bit of debt to tackle. Storing the output of a calculator with multiple metrics, sometimes even for many columns, was a challenge.
We would encode the name of metrics and features within the column names of the DataFrames we use for storage. But when plotting these results, we'd have to decode all of these again.
The naming conventions were not consistent across calculators, to make things worse. It made it difficult for users to understand where to find specific data. The refactor aims to solve multiple problems. We introduced multilevel indexes to deal with hierarchies of many columns and metrics elegantly.
We ensure consistency across calculators with a new paradigm for filtering result data and turning them into DataFrames. And if you don't like multilevel indexes, you can always turn them off.

The following snippet shows how it works.

# continuing from the last snippet

# filter to see only results on the analysis period, for all methods applied to the 'distance_from_office' column
# we ask NannyML to collapse the multilevel index into a single level
results.filter(period='analysis', column_names=['distance_from_office']).to_df(multilevel=False)

# results now take over calculator properties useful for looping
for column_name in results.continuous_column_names:
    drift_fig = results.plot(
        kind='drift',
        column_name=column_name,
        method='kolmogorov_smirnov',
        plot_reference=True
    )

You can read more about it in the working with results documentation.

Exporting results

We envision NannyML to be one of many tools in an MLOps toolchain. Results should be able to live outside of NannyML to achieve that.
We could already write results to disk. We've now added exporting to a pickle file or a database.

We've already created some fun integration scenarios using a database and Grafana using the NannyML container. Check out our examples repository for more information.

This code snippet shows you how to export results in code.

database_writer = nml.DatabaseWriter(connection_string='sqlite:///nml.db')
database_writer.write(results)

You can read more about it in the API reference and CLI documentation.

What's changed?

Updated Poetry to 1.2.0. There are some breaking changes in the pyproject.toml. Be sure to upgrade Poetry if you want to build from source locally.
We've improved how the SizeBasedChunker deals with leftover data. You can now choose to drop it, allocate it to a new chunk, or append it to the last complete one. The default behavior has changed from drop to append.

What's next?

As we continue paying off technical debt, we tackle plotting next. Our current implementation lacks the flexibility we envision, but we have some ideas to improve it.

We hope you're excited about these new changes. Don't hesitate to give us your feedback and help us build a better NannyML!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.7.0 #139

{{title}}

Replies: 0 comments

Select a reply

Release 0.7.0 #139

nnansters Nov 7, 2022 Maintainer

Installing / upgrading

What's new?

Refactoring the drift module

Introducing Jensen-Shannon

Refactoring results

Exporting results

What's changed?

What's next?

Replies: 0 comments

nnansters
Nov 7, 2022
Maintainer