Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biocparallel batchtoolsparam rmd #219

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -12,7 +12,11 @@ Authors@R: c(
person("Ryan", "Thompson", email="rct@thompsonclan.org", role="aut"),
person("Nitesh", "Turaga", role="aut"),
person("Aaron", "Lun", role = "ctb"),
person("Henrik", "Bengtsson", role = "ctb"))
person("Henrik", "Bengtsson", role = "ctb")
person("Paul", "Villafuerte",
role = "ctb",
comment = "'BiocParallel_BatchtoolsParam' vignette translation from Sweave to Rmarkdown / HTML"
))
Description: This package provides modified versions and novel
implementation of functions for parallel evaluation, tailored to
use with Bioconductor objects.
231 changes: 231 additions & 0 deletions vignettes/BiocParallel_BatchtoolsParam.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
---
title: "Introduction to *BatchtoolsParam*"
author:
- name: "Nitesh Turaga"
email: Nitesh.Turaga@RoswellPark.org
- name: "Martin Morgan"
email: Martin.Morgan@RoswellPark.org
- name: "Paul Villafuerte"
affiliation: "Vignette translation from Sweave to Rmarkdown / HTML"
date: "Edited: `r format(Sys.time(), '%B %d, %Y')`; Compiled: `r format(Sys.time(), '%B %d, %Y')`"
vignette: >
%\VignetteIndexEntry{2. Introduction to BatchtoolsParam}
%\VignetteKeywords{parallel, Infrastructure}
%\VignettePackage{BiocParallel}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
number_sections: true
toc: true
toc_depth: 4
---

# Introduction

The `BatchtoolsParam` class is an interface to `r CRANpkg('batchtools')` the package from within `r Biocpkg('BiocParallel')`. This aims to
replace BatchjobsParam as `r Biocpkg('BiocParallel')`'s class for computing on a high performance
cluster such as SGE, TORQUE, LSF, SLURM, OpenLava.

# Quick start

This example demonstrates the easiest way to launch a 100000 jobs using
batchtools. The first step involves creating a `BatchtoolsParam` class. You can compute
using 'bplapply' and then the result is stored.

```{r intro}
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
nums <- matrix(runif(2 * n), ncol = 2)
d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
4 * mean(d <= 1)
}

piApprox(1000)

## Apply piApprox over
param <- BatchtoolsParam()
result <- bplapply(rep(10e5, 10), piApprox, BPPARAM=param)
mean(unlist(result))
```

# *BatchtoolsParam* interface

The `BatchtoolsParam` interface allows you to replace the BatchJobsParam interface, and
allow more intuitive usage of your high performance cluster with `r Biocpkg('BiocParallel')`.

The `BatchtoolsParam` class allows the user to specify many arguments to customize their
jobs. Applicable to clusters with formal schedulers.

- `workers` The number of workers used by the job.

- `cluster` We currently support, SGE, SLURM, LSF, TORQUE and OpenLava. The
'cluster' argument is supported only if the R environment knows how
to find the job scheduler. Each cluster type uses a template to pass
the job to the scheduler. If the template is not given we use the
default templates as given in the 'batchtools' package. The cluster
can be accessed by 'bpbackend(param)'.

- `registryargs` The 'registryargs' argument takes a list of arguments to create a
new job registry for you `BatchtoolsParam`. The job registry is a data.table which
stores all the required information to process your jobs. The
arguments we support for registryargs are:

| `file.dir` Path where all files of the registry are saved. Note that some templates do
| not handle relative paths well. If nothing is given, a temporary directory will be
| used in your current working directory.
<p>
| `work.dir` Working directory for R process for
running jobs.
<p>
| `packages` Packages that will be loaded on each node.
<p>
| `namespaces` Namespaces that will be loaded on each
node.
<p>
| `source` Files that are sourced before executing a
job.
<p>
| `load` Files that are loaded before executing a job.
<p>
```{r}
registryargs <- batchtoolsRegistryargs(
file.dir = "mytempreg",
work.dir = getwd(),
packages = character(0L),
namespaces = character(0L),
source = character(0L),
load = character(0L)
)
param <- BatchtoolsParam(registryargs = registryargs)
param
```

- `resources` A named list of key-value pairs to be subsituted into the template
file; see `?batchtools::submitJobs`.
<p>
- `template` The template argument is unique to the class `BatchtoolsParam`. It is required by the
job scheduler. It defines how the jobs are submitted to the job
scheduler. If the template is not given and the cluster is chosen, a
default template is selected from the batchtools package.
<p>
- `log` The log option is logical, TRUE/FALSE. If it is set to TRUE, then
the logs which are in the registry are copied to directory given by
the user using the `logdir` argument.
<p>
- `logdir` Path to the logs. It is given only if `log=TRUE`.
<p>
- `resultdir` Path to the directory is given when the job has files to be saved in
a directory.

# Defining templates

The job submission template controls how the job is processed by the job
scheduler on the cluster. Obviously, the format of the template will
differ depending on the type of job scheduler. Let's look at the default
SLURM template as an example:

```{r}
fname <- batchtoolsTemplate("slurm")
cat(readLines(fname), sep="\n")
```

The `<%= =>` blocks are automatically replaced by the values of the elements in
the `resources` argument in the`BatchtoolsParam` constructor. Failing to specify critical parameters
properly (e.g., wall time or memory limits too low) will cause jobs to
crash, usually rather cryptically. We suggest setting parameters
explicitly to provide robustness to changes to system defaults. Note
that `<%= =>` the blocks themselves do not usually need to be modified in the
template.

The part of the template that is most likely to require explicit
customization is the last line containing the call to `Rscript`. A more
customized call may be necessary if the R installation is not standard,
e.g., if multiple versions of R have been installed on a cluster. For
example, one might use instead:

````{bash doJobCollection, eval=FALSE}
echo 'batchtools::doJobCollection("<%= uri %>")' |\
ArbitraryRcommand --no-save --no-echo
````

If such customization is necessary, we suggest making a local copy of
the `BiocParallelParam` template, modifying it as required, and then constructing a object
with the modified template using `template` the argument. However, we find that the
default templates accessible with `batchtoolsTemplate` are satisfactory in most cases.

# Use cases

As an example for a BatchtoolParam job being run on an SGE cluster, we
use the same `piApprox` function as defined earlier. The example runs the function
on 5 workers and submits 100 jobs to the SGE cluster.

Example of SGE with minimal code:

```{r simple_sge_example, eval=FALSE}
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
nums <- matrix(runif(2 * n), ncol = 2)
d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
4 * mean(d <= 1)
}

template <- system.file(
package = "BiocParallel",
"unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)

## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox, BPPARAM=param)
```

Example of SGE demonstrating some of BatchtoolsParam methods.

```{r demo_sge, eval=FALSE}
<<demo_sge, eval=FALSE>>=
library(BiocParallel)

## Pi approximation
piApprox <- function(n) {
nums <- matrix(runif(2 * n), ncol = 2)
d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
4 * mean(d <= 1)
}

template <- system.file(
package = "BiocParallel",
"unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)

## start param
bpstart(param)

## Display param
param

## To show the registered backend
bpbackend(param)

## Register the param
register(param)

## Check the registered param
registered()

## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox)

bpstop(param)
```

# `sessionInfo`

```{r sessionInfo, results="asis"}
sessionInfo()
```
283 changes: 0 additions & 283 deletions vignettes/BiocParallel_BatchtoolsParam.Rnw

This file was deleted.