-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
num_workers in data_loader from torch does not seem to parallelize batch loading #1207
Comments
I think there is a bug in your code and you are not measuring the time it takes to load a batch. Here is a simpler reprex that shows that the parallel loading is working: library(torch)
dataset_slow = dataset("slow",
initialize = function() NULL,
.getitem = function(i) {
Sys.sleep(0.1)
1L
},
.length = function() 1000L
)
ds = dataset_slow()
run = function(ds, num_workers, pin_memory = TRUE) {
dl2<-torch::dataloader(ds, batch_size=10L, shuffle = TRUE, num_workers = num_workers, pin_memory = pin_memory)
iter = torch::dataloader_make_iter(dl2)
x = 1
while (!is.null(batch <<- dataloader_next(iter))) {
# avoid any optimization
x = x + batch
}
x
}
bench::mark(
run(ds, 0, pin_memory = FALSE),
run(ds, 10, pin_memory = FALSE),
run(ds, 0, pin_memory = TRUE),
run(ds, 10, pin_memory = TRUE),
max_iterations = 1
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 run(ds, 0, pin_memory = FALSE) 1.7m 1.7m 0.00980 7.77MB 0
#> 2 run(ds, 10, pin_memory = FALSE) 19.78s 19.78s 0.0506 18.71MB 0.101
#> 3 run(ds, 0, pin_memory = TRUE) 1.71m 1.71m 0.00977 3.95KB 0
#> 4 run(ds, 10, pin_memory = TRUE) 20.08s 20.08s 0.0498 12.38MB 0.0996 Created on 2025-01-10 with reprex v2.1.1 Unless I am mistaken, I think we can close this @dfalbel |
Is it surprising, however that the |
I want train cnns on a big dataset via transfer learning using torch in R. Since my dataset is to big to be loaded all at once, I have to load each sample from the SSD in the dataloader. But loading one batch from my SSD takes about 5-10x the time as processing (forward pass, back prop, optimizing) it. Therefore asynchronous parallel data loading would be advisable.
As far as I understand torch, this can be done in the dataloader via the num_workers - parameter. But using that did not decrease the loading time of a batch in the trainingsloop, except from introducing a big overhead before the first batch is gathered (probably there the workers are created). Now I need advise, if this can be done in torch and if I implemented anything wrong.
Example:
To my understanding the time it takes to load a batch should (after the first few batches) decrease significantly if I use parallel batch loading through num_workers compared to num_workers = 0.
But the printed time stays the same no matter the number of workers used.
I would be glad if anyone could help me!
The text was updated successfully, but these errors were encountered: