num_workers in data_loader from torch does not seem to parallelize batch loading #1207

D-Maar · 2024-11-06T09:57:24Z

I want train cnns on a big dataset via transfer learning using torch in R. Since my dataset is to big to be loaded all at once, I have to load each sample from the SSD in the dataloader. But loading one batch from my SSD takes about 5-10x the time as processing (forward pass, back prop, optimizing) it. Therefore asynchronous parallel data loading would be advisable.

As far as I understand torch, this can be done in the dataloader via the num_workers - parameter. But using that did not decrease the loading time of a batch in the trainingsloop, except from introducing a big overhead before the first batch is gathered (probably there the workers are created). Now I need advise, if this can be done in torch and if I implemented anything wrong.

Example:

library(torchvision)
library(torch)

dl<-torchvision::image_folder_dataset(
  root="./data/processed/satalite_images/to_use",
  loader=function(path){
     # I have images of size 299x299 with 13 channels.
    # optimizing this loading step yielded no significant improvement.
    return(array(readRDS(path), dim=c(13,299,299))*1.0)
  },
  target_transform = function(x){a<-c(0.0,1.0)[x];dim(a)<-1;return(a)}
)
#Here I set num_workers to different numbers, but that did not change the loading time
dl2<-torch::dataloader(dl, batch_size=110L, shuffle = T, num_workers = 15L, pin_memory=T)
#just a random pretrained model for transfer learning
model_torch = torchvision::model_alexnet(pretrained = T)
model_torch$parameters |>
  purrr::walk(function(param) param$requires_grad_(FALSE))

# replacing the last layer to my desired classifier

inFeat =model_torch$classifier$'6'$in_features
model_torch$classifier$'6' = nn_linear(inFeat, out_features = 1L)

# I have 13 input channels, therefore I replace the first conv layer with a equivialent one but with 13 input channels
conv1<-torch::nn_conv2d(in_channels=13L, out_channels=model_torch[[1]]$`0`$out_channels, 
                        kernel_size =model_torch[[1]]$`0`$kernel_size , 
                        stride = model_torch[[1]]$`0`$stride,
        padding =model_torch[[1]]$`0`$padding, 
        dilation = model_torch[[1]]$`0`$dilation, groups = model_torch[[1]]$`0`$groups, bias = TRUE)
model_torch[[1]]$`0`<-conv1

model_torch<-model_torch$to(device = "cuda")
opt = optim_adam(params = model_torch$parameters, lr = 0.01)

#trainings loop
for(e in 1:1){
  losses = c()
#storing the time which the loop uses for computing and data loading
  end<-Sys.time()
  coro::loop(
    for(batch in dl2){
      start<-Sys.time()
      #this is the time it takes to load a batch
      print(start-end)
      print("computing")
      opt$zero_grad()
      pred = model_torch(batch[[1]]$to(device="cuda"))
      res=batch[[2]]$to(device = "cuda")
      loss = nnf_binary_cross_entropy(input=torch_sigmoid(pred),target=res)
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
      end<-Sys.time()
      #this is the time it takes to process a batch
      print(end-start)
      print("loading")
    }
  )
}

To my understanding the time it takes to load a batch should (after the first few batches) decrease significantly if I use parallel batch loading through num_workers compared to num_workers = 0.

But the printed time stays the same no matter the number of workers used.

I would be glad if anyone could help me!

The text was updated successfully, but these errors were encountered:

sebffischer · 2025-01-10T08:02:35Z

I think there is a bug in your code and you are not measuring the time it takes to load a batch.
I believe then you run coro::loop(for (batch in loader) {t1 <- Sys.time(); ... ; t2 <- t1 - Sys.time()}), the batches are (in every iteration) loaded before you enter the body of the loop.

Here is a simpler reprex that shows that the parallel loading is working:

library(torch)

dataset_slow = dataset("slow",
  initialize = function() NULL,
  .getitem = function(i) {
    Sys.sleep(0.1)
    1L
  },
  .length = function() 1000L
)

ds = dataset_slow()

run = function(ds, num_workers, pin_memory = TRUE) {
  dl2<-torch::dataloader(ds, batch_size=10L, shuffle = TRUE, num_workers = num_workers, pin_memory = pin_memory)
  iter = torch::dataloader_make_iter(dl2)

  x = 1
  while (!is.null(batch <<- dataloader_next(iter))) {
    # avoid any optimization
    x = x + batch
  }
  x
}

bench::mark(
  run(ds, 0, pin_memory = FALSE),
  run(ds, 10, pin_memory = FALSE),
  run(ds, 0, pin_memory = TRUE),
  run(ds, 10, pin_memory = TRUE),
  max_iterations = 1
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 run(ds, 0, pin_memory = FALSE)      1.7m     1.7m   0.00980    7.77MB   0     
#> 2 run(ds, 10, pin_memory = FALSE)   19.78s   19.78s   0.0506    18.71MB   0.101 
#> 3 run(ds, 0, pin_memory = TRUE)      1.71m    1.71m   0.00977    3.95KB   0     
#> 4 run(ds, 10, pin_memory = TRUE)    20.08s   20.08s   0.0498    12.38MB   0.0996

^{Created on 2025-01-10 with reprex v2.1.1}

Unless I am mistaken, I think we can close this @dfalbel

sebffischer · 2025-01-10T08:35:16Z

I think there is a bug in your code and you are not measuring the time it takes to load a batch. I believe then you run coro::loop(for (batch in loader) {t1 <- Sys.time(); ... ; t2 <- t1 - Sys.time()}), the batches are loaded before you enter the body of the loop.

Here is a simpler reprex that shows that the parallel loading is working:

library(torch)

dataset_slow = dataset("slow",
  initialize = function() NULL,
  .getitem = function(i) {
    Sys.sleep(0.1)
    1L
  },
  .length = function() 1000L
)

ds = dataset_slow()

run = function(ds, num_workers, pin_memory = TRUE) {
  dl2<-torch::dataloader(ds, batch_size=10L, shuffle = TRUE, num_workers = num_workers, pin_memory = pin_memory)
  iter = torch::dataloader_make_iter(dl2)

  x = 1
  while (!is.null(batch <<- dataloader_next(iter))) {
    # avoid any optimization
    x = x + batch
  }
  x
}

bench::mark(
  run(ds, 0, pin_memory = FALSE),
  run(ds, 10, pin_memory = FALSE),
  run(ds, 0, pin_memory = TRUE),
  run(ds, 10, pin_memory = TRUE),
  max_iterations = 1
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 run(ds, 0, pin_memory = FALSE)      1.7m     1.7m   0.00980    7.77MB   0     
#> 2 run(ds, 10, pin_memory = FALSE)   19.78s   19.78s   0.0506    18.71MB   0.101 
#> 3 run(ds, 0, pin_memory = TRUE)      1.71m    1.71m   0.00977    3.95KB   0     
#> 4 run(ds, 10, pin_memory = TRUE)    20.08s   20.08s   0.0498    12.38MB   0.0996

Created on 2025-01-10 with reprex v2.1.1

Unless I am mistaken, I think we can close this @dfalbel

Is it surprising, however that the pin_memory has no positive effect?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_workers in data_loader from torch does not seem to parallelize batch loading #1207

num_workers in data_loader from torch does not seem to parallelize batch loading #1207

D-Maar commented Nov 6, 2024 •

edited

Loading

sebffischer commented Jan 10, 2025 •

edited

Loading

sebffischer commented Jan 10, 2025

num_workers in data_loader from torch does not seem to parallelize batch loading #1207

num_workers in data_loader from torch does not seem to parallelize batch loading #1207

Comments

D-Maar commented Nov 6, 2024 • edited Loading

sebffischer commented Jan 10, 2025 • edited Loading

sebffischer commented Jan 10, 2025

D-Maar commented Nov 6, 2024 •

edited

Loading

sebffischer commented Jan 10, 2025 •

edited

Loading