finish updates of grch38 and grch37 data frams

Katsevich-Lab · Apr 30, 2024 · 84142ba · 84142ba
1 parent 4320b34
commit 84142ba
Show file tree

Hide file tree

Showing 10 changed files with 31 additions and 20 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -27,7 +27,6 @@ Imports:
     cowplot,
     crayon,
     data.table,
-    R.utils,
     dplyr,
     ggplot2,
     Matrix,

diff --git a/R/data.R b/R/data.R
@@ -1,8 +1,13 @@
-#' Gene position data frame
+#' Gene position data frames
 #'
-#' `gene_position_data_frame_grch38` maps each gene to the chromosome on which it is located and the position of its transcription start site on that chromosome. The data frame was constructed from the GRCh38 reference genome that has shipped with CellRanger since 2020.
+#' `gene_position_data_frame_grch38` and `gene_position_data_frame_grch37` contain the coordinate and transcription start site position of each gene relative to reference genome GRCh38 and GRCh37, respectively. Both `gene_position_data_frame_grch38` and `gene_position_data_frame_grch37` were constructed from reference genomes available on the 10x Genomics website. The GRCh38 reference genome has been used by 10x Cell Ranger since 2020.
 #'
 #' @usage data(gene_position_data_frame_grch38)
 #' @examples
 #' head(gene_position_data_frame_grch38)
+#' head(gene_position_data_frame_grch37)
 "gene_position_data_frame_grch38"
+
+#' @rdname gene_position_data_frame_grch38
+#' @usage data(gene_position_data_frame_grch37)
+"gene_position_data_frame_grch37"
diff --git a/R/pair_constructor_functs.R b/R/pair_constructor_functs.R
@@ -1,6 +1,6 @@
 #' Construct *cis* pairs
 #'
-#' `construct_cis_pairs()` is a helper function to facilitate construction the *cis* pairs. `construct_cis_pairs()` returns the set of target-response pairs for which the target and response are located on the same chromosome and in close physical proximity to one another. `construct_cis_pairs()` is a useful pair constructor function for screens that aim to map noncoding regulatory elements (e.g., enhancers or noncoding GWAS variants) to target genes in *cis*. `construct_cis_pairs()` assumes that the columns `chr`, `start`, and `stop` are present in the `grna_target_data_frame`, giving the chromosome, start position, and end position, respectively, of the region that each gRNA targets. `construct_cis_pairs()` takes several arguments: `sceptre_object` (required) `distance_threshold` (optional), `positive_control_pairs` (optional), and `response_position_data_frame` (optional). By default, `construct_cis_pairs()` pairs each gRNA target to the set of responses on the same chromosome as that target and within `distance_threshold` bases of that target. (The default value of `distance_threshold` is 500,000 bases, or half a megabase.) The `positive_control_pairs` data frame optionally can be passed to `construct_cis_pairs()`, in which case the positive control targets (i.e., the entries within the `grna_target` column of `positive_control_pairs`) are excluded from the *cis* pairs. One may want to exclude these from the discovery analysis if these targets are intended for positive control purposes only. See \href{https://timothy-barry.github.io/sceptre-book/set-analysis-parameters.html#sec-set-analysis-parameters_construct_cis_pairs}{Section 2.2.2 of the manual} for more detailed information about this function.
+#' `construct_cis_pairs()` is a helper function to facilitate construction the *cis* pairs. `construct_cis_pairs()` returns the set of target-response pairs for which the target and response are located on the same chromosome and in close physical proximity to one another. `construct_cis_pairs()` is a useful pair constructor function for screens that aim to map noncoding regulatory elements (e.g., enhancers or noncoding GWAS variants) to target genes in *cis*. `construct_cis_pairs()` assumes that the columns `chr`, `start`, and `stop` are present in the `grna_target_data_frame`, giving the chromosome, start position, and end position, respectively, of the region that each gRNA targets. `construct_cis_pairs()` takes several arguments: `sceptre_object` (required), `distance_threshold` (optional), `positive_control_pairs` (optional), and `response_position_data_frame` (optional). By default, `construct_cis_pairs()` pairs each gRNA target to the set of responses on the same chromosome as that target and within `distance_threshold` bases of that target. (The default value of `distance_threshold` is 500,000 bases, or half a megabase.) The `positive_control_pairs` data frame optionally can be passed to `construct_cis_pairs()`, in which case the positive control targets (i.e., the entries within the `grna_target` column of `positive_control_pairs`) are excluded from the *cis* pairs. One may want to exclude these from the discovery analysis if these targets are intended for positive control purposes only. See \href{https://timothy-barry.github.io/sceptre-book/set-analysis-parameters.html#sec-set-analysis-parameters_construct_cis_pairs}{Section 2.2.2 of the manual} for more detailed information about this function.
 #'
 #' @param sceptre_object a `sceptre_object`
 #' @param distance_threshold (optional) target-response pairs located within `distance_threshold` bases of one another and on the same chromosome are included in the *cis* discovery set.

diff --git a/data-raw/DATASET_gene_table.R b/data-raw/DATASET_gene_table.R
@@ -2,9 +2,9 @@ library(data.table)
 conflicts_prefer(dplyr::rename)
 conflicts_prefer(dplyr::filter)
 
-#############
-# hg 38 table
-#############
+###############
+# grch 38 table
+###############
 # CellRanger provides a human reference genome, which can be downloaded via the following command:
 # curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
 # The version of the reference is GRCh38. This script extracts the start position, end position,
@@ -31,17 +31,18 @@ gene_table <- cbind(dt_gene_chr[,c("chr", "start", "end", "strand")], gene_ids_a
   dplyr::mutate(chr = factor(chr)) |> dplyr::mutate(position = ifelse(strand == "+", start, end)) |>
   dplyr::select(-start, -end, -strand)
 data.table::setorderv(gene_table, c("chr", "position"))
+gene_table <- gene_table |> dplyr::select(response_id, response_name, chr, position)
 gene_position_data_frame_grch38 <- gene_table
 usethis::use_data(gene_position_data_frame_grch38, internal = FALSE, overwrite = TRUE)
 
-#############
-# hg 19 table
-#############
+###############
+# grch 37 table
+###############
 rm(list = ls())
-# We obtained the hg37 reference genome from cellranger
+# We obtained the grch 37 reference genome from cellranger
 # wget ftp://ftp.ensembl.org/pub/grch37/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz
 library(rtracklayer)
-dt <- readGFF("~/research_offsite/external/ref/Homo_sapiens.GRCh37.82.gtf.gz")
+dt <- readGFF("~/research_offsite/external/ref/Homo_sapiens.GRCh37.82.gtf.gz") |> as.data.table()
 # retain only genes
 dt <- dt |> dplyr::filter(type == "gene")
 # keep only those genes on a chromosome
@@ -52,6 +53,6 @@ dt <- dt |>
   dplyr::select(response_id = gene_id,
                 response_name = gene_name,
                 chr, position)
-gene_position_data_frame_grch19 <- dt
-gene_position_data_frame_grch19$chr <- factor(gene_position_data_frame_grch19$chr)
-usethis::use_data(gene_position_data_frame_grch19, internal = FALSE, overwrite = TRUE)
+gene_position_data_frame_grch37 <- dt
+gene_position_data_frame_grch37$chr <- factor(gene_position_data_frame_grch37$chr)
+usethis::use_data(gene_position_data_frame_grch37, internal = FALSE, overwrite = TRUE)
diff --git a/data/gene_position_data_frame_grch19.rda b/data/gene_position_data_frame_grch19.rda
diff --git a/data/gene_position_data_frame_grch37.rda b/data/gene_position_data_frame_grch37.rda
diff --git a/data/gene_position_data_frame_grch38.rda b/data/gene_position_data_frame_grch38.rda
diff --git a/man/construct_cis_pairs.Rd b/man/construct_cis_pairs.Rd
diff --git a/man/gene_position_data_frame_grch38.Rd b/man/gene_position_data_frame_grch38.Rd
diff --git a/vignettes/sceptre.Rmd b/vignettes/sceptre.Rmd
@@ -113,7 +113,7 @@ We describe each step of the pipeline in greater detail below.
 
 ## 1. Import data 
 
-The first step is to import the data. **Data can be imported into `sceptre` from 10X Cell Ranger or Parse outputs, as well as from R matrices.** The simplest way to import the data is to read the output of one or more calls to `cellranger_count` into `sceptre` via the function `import_data_from_cellranger()`. `import_data_from_cellranger()` requires three arguments: `directories`, `grna_target_data_frame`, and `moi`.
+The first step is to import the data. **Data can be imported into `sceptre` from 10x Cell Ranger or Parse outputs, as well as from R matrices.** The simplest way to import the data is to read the output of one or more calls to `cellranger_count` into `sceptre` via the function `import_data_from_cellranger()`. `import_data_from_cellranger()` requires three arguments: `directories`, `grna_target_data_frame`, and `moi`.
 
 1.  `directories` is a character vector specifying the locations of the directories outputted by one or more calls to `cellranger_count`. Below, we set the variable `directories` to the (machine-dependent) location of the example CRISPRi data on disk.