Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_workers in data_loader from torch does not seem to parallelize batch loading #1207

Open
D-Maar opened this issue Nov 6, 2024 · 2 comments

Comments

@D-Maar
Copy link

D-Maar commented Nov 6, 2024

I want train cnns on a big dataset via transfer learning using torch in R. Since my dataset is to big to be loaded all at once, I have to load each sample from the SSD in the dataloader. But loading one batch from my SSD takes about 5-10x the time as processing (forward pass, back prop, optimizing) it. Therefore asynchronous parallel data loading would be advisable.

As far as I understand torch, this can be done in the dataloader via the num_workers - parameter. But using that did not decrease the loading time of a batch in the trainingsloop, except from introducing a big overhead before the first batch is gathered (probably there the workers are created). Now I need advise, if this can be done in torch and if I implemented anything wrong.

Example:

library(torchvision)
library(torch)

dl<-torchvision::image_folder_dataset(
  root="./data/processed/satalite_images/to_use",
  loader=function(path){
     # I have images of size 299x299 with 13 channels.
    # optimizing this loading step yielded no significant improvement.
    return(array(readRDS(path), dim=c(13,299,299))*1.0)
  },
  target_transform = function(x){a<-c(0.0,1.0)[x];dim(a)<-1;return(a)}
)
#Here I set num_workers to different numbers, but that did not change the loading time
dl2<-torch::dataloader(dl, batch_size=110L, shuffle = T, num_workers = 15L, pin_memory=T)
#just a random pretrained model for transfer learning
model_torch = torchvision::model_alexnet(pretrained = T)
model_torch$parameters |>
  purrr::walk(function(param) param$requires_grad_(FALSE))

# replacing the last layer to my desired classifier

inFeat =model_torch$classifier$'6'$in_features
model_torch$classifier$'6' = nn_linear(inFeat, out_features = 1L)

# I have 13 input channels, therefore I replace the first conv layer with a equivialent one but with 13 input channels
conv1<-torch::nn_conv2d(in_channels=13L, out_channels=model_torch[[1]]$`0`$out_channels, 
                        kernel_size =model_torch[[1]]$`0`$kernel_size , 
                        stride = model_torch[[1]]$`0`$stride,
        padding =model_torch[[1]]$`0`$padding, 
        dilation = model_torch[[1]]$`0`$dilation, groups = model_torch[[1]]$`0`$groups, bias = TRUE)
model_torch[[1]]$`0`<-conv1

model_torch<-model_torch$to(device = "cuda")
opt = optim_adam(params = model_torch$parameters, lr = 0.01)

#trainings loop
for(e in 1:1){
  losses = c()
#storing the time which the loop uses for computing and data loading
  end<-Sys.time()
  coro::loop(
    for(batch in dl2){
      start<-Sys.time()
      #this is the time it takes to load a batch
      print(start-end)
      print("computing")
      opt$zero_grad()
      pred = model_torch(batch[[1]]$to(device="cuda"))
      res=batch[[2]]$to(device = "cuda")
      loss = nnf_binary_cross_entropy(input=torch_sigmoid(pred),target=res)
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
      end<-Sys.time()
      #this is the time it takes to process a batch
      print(end-start)
      print("loading")
    }
  )
}

To my understanding the time it takes to load a batch should (after the first few batches) decrease significantly if I use parallel batch loading through num_workers compared to num_workers = 0.

But the printed time stays the same no matter the number of workers used.

I would be glad if anyone could help me!

@sebffischer
Copy link
Collaborator

sebffischer commented Jan 10, 2025

I think there is a bug in your code and you are not measuring the time it takes to load a batch.
I believe then you run coro::loop(for (batch in loader) {t1 <- Sys.time(); ... ; t2 <- t1 - Sys.time()}), the batches are (in every iteration) loaded before you enter the body of the loop.

Here is a simpler reprex that shows that the parallel loading is working:

library(torch)

dataset_slow = dataset("slow",
  initialize = function() NULL,
  .getitem = function(i) {
    Sys.sleep(0.1)
    1L
  },
  .length = function() 1000L
)

ds = dataset_slow()

run = function(ds, num_workers, pin_memory = TRUE) {
  dl2<-torch::dataloader(ds, batch_size=10L, shuffle = TRUE, num_workers = num_workers, pin_memory = pin_memory)
  iter = torch::dataloader_make_iter(dl2)

  x = 1
  while (!is.null(batch <<- dataloader_next(iter))) {
    # avoid any optimization
    x = x + batch
  }
  x
}

bench::mark(
  run(ds, 0, pin_memory = FALSE),
  run(ds, 10, pin_memory = FALSE),
  run(ds, 0, pin_memory = TRUE),
  run(ds, 10, pin_memory = TRUE),
  max_iterations = 1
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 run(ds, 0, pin_memory = FALSE)      1.7m     1.7m   0.00980    7.77MB   0     
#> 2 run(ds, 10, pin_memory = FALSE)   19.78s   19.78s   0.0506    18.71MB   0.101 
#> 3 run(ds, 0, pin_memory = TRUE)      1.71m    1.71m   0.00977    3.95KB   0     
#> 4 run(ds, 10, pin_memory = TRUE)    20.08s   20.08s   0.0498    12.38MB   0.0996

Created on 2025-01-10 with reprex v2.1.1

Unless I am mistaken, I think we can close this @dfalbel

@sebffischer
Copy link
Collaborator

I think there is a bug in your code and you are not measuring the time it takes to load a batch. I believe then you run coro::loop(for (batch in loader) {t1 <- Sys.time(); ... ; t2 <- t1 - Sys.time()}), the batches are loaded before you enter the body of the loop.

Here is a simpler reprex that shows that the parallel loading is working:

library(torch)

dataset_slow = dataset("slow",
  initialize = function() NULL,
  .getitem = function(i) {
    Sys.sleep(0.1)
    1L
  },
  .length = function() 1000L
)

ds = dataset_slow()

run = function(ds, num_workers, pin_memory = TRUE) {
  dl2<-torch::dataloader(ds, batch_size=10L, shuffle = TRUE, num_workers = num_workers, pin_memory = pin_memory)
  iter = torch::dataloader_make_iter(dl2)

  x = 1
  while (!is.null(batch <<- dataloader_next(iter))) {
    # avoid any optimization
    x = x + batch
  }
  x
}

bench::mark(
  run(ds, 0, pin_memory = FALSE),
  run(ds, 10, pin_memory = FALSE),
  run(ds, 0, pin_memory = TRUE),
  run(ds, 10, pin_memory = TRUE),
  max_iterations = 1
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 run(ds, 0, pin_memory = FALSE)      1.7m     1.7m   0.00980    7.77MB   0     
#> 2 run(ds, 10, pin_memory = FALSE)   19.78s   19.78s   0.0506    18.71MB   0.101 
#> 3 run(ds, 0, pin_memory = TRUE)      1.71m    1.71m   0.00977    3.95KB   0     
#> 4 run(ds, 10, pin_memory = TRUE)    20.08s   20.08s   0.0498    12.38MB   0.0996

Created on 2025-01-10 with reprex v2.1.1

Unless I am mistaken, I think we can close this @dfalbel

Is it surprising, however that the pin_memory has no positive effect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants