fix: Properly assign the chunks to the right worker #449

tchaton · 2025-01-14T20:24:55Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #442

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

robmarkcole · 2025-01-15T11:07:02Z

Repeat testing on this branch

with value of 8 for batch_size and num_workers, I no longer get the error ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1]) - SOLVED
with value of 4 for batch_size and num_workers, I get Test batch size: 4 but only 20 of 24 images are processed - final batch is missing..
with value of 8 for batch_size and 4 for num_workers I get Test batch size: 8 and all 24 images are processed
with value of 4 for batch_size and 2 for num_workers I get Test batch size: 4 but only 20 of 24 images are processed - final batch is missing..
with value of 8 for batch_size and 2 for num_workers I get Test batch size: 8 but only 16 of 24 images are processed - final batch is missing..

So an outstanding issue of a missing batch.

I've also noticed (for batch_size: 8) I now have Train dataset length: 192 whereas this was 200 before. I also get 196 for batch_size: 4 (val and test set lengths are similarly affected)

robmarkcole · 2025-01-15T12:12:55Z

src/litdata/utilities/shuffle.py

-
-    chunks_per_workers: List[List[int]] = [[] for _ in range(world_size)]
-    intervals_per_workers: List[List[List[int]]] = [[] for _ in range(world_size)]
+    max_batches = num_items // batch_size


If there is a remainder here that’s a “partial” batch.

It is why we are casting things to int, so it is an exact number.

robmarkcole · 2025-01-15T12:13:41Z

src/litdata/utilities/shuffle.py

+        tmp_arr = [0 for _ in range(num_workers)]
+
+        index = 0
+        for _ in range(int(max_batches // distributed_env.world_size)):


What happens to leftover batches here? The leftover full batches

They are given the last machine and last worker.

src/litdata/utilities/shuffle.py

robmarkcole · 2025-01-15T12:20:30Z

o1 suggests using PyTorch’s built-in distribution features (e.g. DistributedSampler) to avoid reinventing the wheel.

tchaton · 2025-01-15T16:02:00Z

Hey @robmarkcole. Yes, DistributedSampler won't work here.

justusschock · 2025-01-15T16:48:53Z

src/litdata/utilities/shuffle.py

+
+    num_items_per_workers: Any = []
+
+    for rank in range(distributed_env.world_size):


why only dist worksize here and not global num workers?
same question for below.

Because, we want to ensure we fill up all the workers for each process rank in the same way.

robmarkcole · 2025-01-15T17:21:01Z

OK all my checks got the desired results now!

tchaton · 2025-01-15T19:41:56Z

OK all my checks got the desired results now!

Perfect !

lantiga

Looks great! Just a minor suggestion

src/litdata/utilities/shuffle.py

robmarkcole · 2025-01-16T08:47:41Z

Be grateful for a release after this bugfix

tchaton added 2 commits January 14, 2025 20:24

update

010abbb

update

fdf8aa3

robmarkcole reviewed Jan 15, 2025

View reviewed changes

src/litdata/utilities/shuffle.py Outdated Show resolved Hide resolved

tchaton added 4 commits January 15, 2025 16:02

Merge branch 'main' into fix_association

d74c2b1

update

f644b7c

update

fa4aa67

update

951b6a0

justusschock approved these changes Jan 15, 2025

View reviewed changes

lantiga approved these changes Jan 15, 2025

View reviewed changes

src/litdata/utilities/shuffle.py Outdated Show resolved Hide resolved

update

24c85cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Properly assign the chunks to the right worker #449

fix: Properly assign the chunks to the right worker #449

tchaton commented Jan 14, 2025 •

edited

Loading

robmarkcole commented Jan 15, 2025 •

edited

Loading

robmarkcole Jan 15, 2025

tchaton Jan 15, 2025

robmarkcole Jan 15, 2025

tchaton Jan 15, 2025

robmarkcole commented Jan 15, 2025

tchaton commented Jan 15, 2025

justusschock Jan 15, 2025

tchaton Jan 15, 2025

robmarkcole commented Jan 15, 2025

tchaton commented Jan 15, 2025

lantiga left a comment

robmarkcole commented Jan 16, 2025


		num_items_per_workers: Any = []

		for rank in range(distributed_env.world_size):

fix: Properly assign the chunks to the right worker #449

Are you sure you want to change the base?

fix: Properly assign the chunks to the right worker #449

Conversation

tchaton commented Jan 14, 2025 • edited Loading

What does this PR do?

PR review

Did you have fun?

robmarkcole commented Jan 15, 2025 • edited Loading

robmarkcole Jan 15, 2025

Choose a reason for hiding this comment

tchaton Jan 15, 2025

Choose a reason for hiding this comment

robmarkcole Jan 15, 2025

Choose a reason for hiding this comment

tchaton Jan 15, 2025

Choose a reason for hiding this comment

robmarkcole commented Jan 15, 2025

tchaton commented Jan 15, 2025

justusschock Jan 15, 2025

Choose a reason for hiding this comment

tchaton Jan 15, 2025

Choose a reason for hiding this comment

robmarkcole commented Jan 15, 2025

tchaton commented Jan 15, 2025

lantiga left a comment

Choose a reason for hiding this comment

robmarkcole commented Jan 16, 2025

tchaton commented Jan 14, 2025 •

edited

Loading

robmarkcole commented Jan 15, 2025 •

edited

Loading