Stratified sampling for chunks and binary classification #429

j-adamczyk · 2024-11-25T17:52:43Z

Motivation: describe the problem to be solved

I get a lot of errors Too few unique values present in 'true_label', returning NaN as realized f1 score. for binary classification. Selecting chunks based on stratified sampling (e.g. stratified K-fold in scikit-learn) would help with this.

Describe the solution you'd like
Argument to use stratified sampling, probably True by default to avoid such errors.

The text was updated successfully, but these errors were encountered:

nnansters · 2024-11-25T18:35:59Z

Paging doctor @nikml for an expert opinion!

Thanks for bringing this up @j-adamczyk !

nikml · 2024-11-26T09:50:07Z

Hello @j-adamczyk

Stratified sampling cannot be used for chunking because it does not preserve the order the data was provided. Chunking takes a slice of the dataframe provided not a sample in the statistical sense. For example when you choose size based chunking of 1000 rows, then the 1st 1000 rows are chunk, rows 1001 up to 2000 are chunk 2 etc. No consideration is made with regards to what is inside when performing chunking.

A stratified sampling chunking method induces artificial drift on the dataset selected and can lead to misleading (or even plain wrong) results. I can think of cases where it may be appropriate but those are edge cases where the user would need to ensure everything is right. If you need something like that you can sample your dataset and then concatenate it so that the chunking method you use would reproduce the samples you created. For example, if you use SizeBasedChunking with 1000 chunk_size you can manually create the 1000-row stratified samples and concatenate them in an analysis dataset to calculate/estimate on. This is not an officially supported use case but the library is flexible enough to use it if you need to.

I guess our docs could be improved to better explain that chunking is more about how to slice a dataset than about sampling it statistically.

j-adamczyk · 2024-11-26T10:07:55Z

@nikml however, if the data has no natural order, or we assume the order does not matter, it can be stratified. I mean, if I calculate F1-score, currently most of the data is actually thrown away, as chunks without positive class are marked as NaN and ignored, right? In case of any heavily imbalanced problems with will crop up. Essentially, to make the current solution work, I have to do what you described, i.e. manually reorder my data for each chunk using stratification. This is less than convenient.

How about a separate chunker that uses stratification? This would basically be reordering (with stratification) + size chunker.

nikml · 2024-11-26T12:25:18Z

@j-adamczyk Thank you for the reply.

To help me understand. Since the data has no natural order, why even chunk them separately instead of having them in one chunk. How does this support your analysis? In order to consider adding such a feature we would need a good understanding of the use case.

For a more immediate solution, in case you are inclined to tinker a bit, you could try to create a new chunker in your code using SizeBasedChunker as a base where you use stratified shuffle to re-order the dataset before splitting it.
You could, for example, name your chunker StratifiedShuffleSizeBasedChunker copy the SizeBaseChunker code and add the shuffling part below line 377. After the dataset gets re-ordered you could let the code run it's course.
Then you provide the chunker object to the chunker argument of whichever calculator you want to use.

j-adamczyk · 2024-11-26T12:41:38Z

@nikml well... good question, actually. So basically for unordered data, I should just use one large chunk with production data gathered over longer period? This actually makes sense, but it never occurred to me. And this should just work with one chunk?

nikml · 2024-11-26T12:57:25Z

Depends on what we mean by one chunk. One chunk in analysis data is ok. If you only have one chunk in reference data you may have some issues with things like threshold calculations. You can overcome the threshold issue by using Constant Thresholds. To ensure you only have one chunk in analysis you can use either a big chunk size or chunk_number=1 depending on your data and use case.

j-adamczyk added the enhancement New feature or request label Nov 25, 2024

j-adamczyk assigned nnansters Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratified sampling for chunks and binary classification #429

Stratified sampling for chunks and binary classification #429

j-adamczyk commented Nov 25, 2024

nnansters commented Nov 25, 2024

nikml commented Nov 26, 2024

j-adamczyk commented Nov 26, 2024

nikml commented Nov 26, 2024 •

edited

Loading

j-adamczyk commented Nov 26, 2024

nikml commented Nov 26, 2024

Stratified sampling for chunks and binary classification #429

Stratified sampling for chunks and binary classification #429

Comments

j-adamczyk commented Nov 25, 2024

nnansters commented Nov 25, 2024

nikml commented Nov 26, 2024

j-adamczyk commented Nov 26, 2024

nikml commented Nov 26, 2024 • edited Loading

j-adamczyk commented Nov 26, 2024

nikml commented Nov 26, 2024

nikml commented Nov 26, 2024 •

edited

Loading