Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in extracted intron sequences #13

Closed
csoneson opened this issue Sep 18, 2020 · 1 comment
Closed

Errors in extracted intron sequences #13

csoneson opened this issue Sep 18, 2020 · 1 comment

Comments

@csoneson
Copy link

Hi @lambdamoses - I'm still getting strange results for the extracted intron sequences with the current release version (1.2.2), when running get_velocity_files() with a gtf file from GENCODE and a DNAStringSet genome. More details below, but the short version is that I think the problematic line is

intron_seqs <- getSeq(Genome, introns)
and that the reason is this problem with getSeq(). Basically, the extracted introns are fine, but the names may be scrambled.

The following code would illustrate the problem (reference files here are downloaded from GENCODE vM24):

suppressPackageStartupMessages({
    library(Biostrings)
    library(BUSpaRse)
    library(GenomicFeatures)
})

## Get reference files
gtf <- "gencode.vM24.annotation.gtf"
genome <- "GRCm38.primary_assembly.genome.fa"
txomefa <- "gencode.vM24.transcripts.fa"
isoform_action <- "separate"
flanklength <- 50
outdir <- paste0("busparse_", isoform_action)
genome <- Biostrings::readDNAStringSet(genome)
names(genome) <- sapply(strsplit(names(genome), " "), .subset, 1)
txome <- Biostrings::readDNAStringSet(txomefa)
names(txome) <- sapply(strsplit(names(txome), "\\|"), .subset, 1)
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf)
ebt <- GenomicFeatures::exonsBy(txdb, "tx", use.names = TRUE)
ibt <- GenomicFeatures::intronsByTranscript(txdb, use.names = TRUE)

## Extract features
BUSpaRse::get_velocity_files(
  X = gtf, L = flanklength + 1, Genome = genome, Transcriptome = NULL, 
  out_path = outdir, isoform_action = isoform_action, 
  exon_option = "full", compress_fa = TRUE, 
  transcript_id = "transcript_id", gene_id = "gene_id", 
  transcript_version = "transcript_version", 
  gene_version = "gene_version", version_sep = "."
)

## Read cDNA+intron fasta
busparse <- Biostrings::readDNAStringSet(
    file.path(paste0("busparse_", isoform_action), "cDNA_introns.fa.gz")
)
names(busparse) <- gsub("\\.-I", "-I", gsub("\\.$", "", names(busparse)))

## Compare transcripts
busparse_txome <- busparse[match(names(txome), names(busparse))]
table(busparse_txome == txome)   ## This is OK

## Compare introns for one example isoform
busparse_intrs <- busparse[grep("-I", names(busparse), value = TRUE)]
ebt$ENSMUST00000195885.1
width(ibt$ENSMUST00000195885.1)
busparse_intrs[grep("ENSMUST00000195885.1", names(busparse_intrs))]   ## This is not correct
Session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /tungstenfs/groups/gbioinfo/Appz/easybuild/software/OpenBLAS/0.3.7-GCC-8.3.0/lib/libopenblas_skylakex-r0.3.7.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3   Biobase_2.48.0        
##  [4] GenomicRanges_1.40.0   GenomeInfoDb_1.24.2    BUSpaRse_1.2.2        
##  [7] Biostrings_2.56.0      XVector_0.28.0         IRanges_2.22.2        
## [10] S4Vectors_0.26.1       BiocGenerics_0.34.0   
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2                  tidyr_1.1.2                
##  [3] bit64_4.0.5                 assertthat_0.2.1           
##  [5] askpass_1.1                 BiocFileCache_1.12.1       
##  [7] blob_1.2.1                  BSgenome_1.56.0            
##  [9] GenomeInfoDbData_1.2.3      Rsamtools_2.4.0            
## [11] yaml_2.2.1                  progress_1.2.2             
## [13] pillar_1.4.6                RSQLite_2.2.0              
## [15] lattice_0.20-41             glue_1.4.2                 
## [17] digest_0.6.25               colorspace_1.4-1           
## [19] htmltools_0.5.0             Matrix_1.2-18              
## [21] XML_3.99-0.5                pkgconfig_2.0.3            
## [23] biomaRt_2.44.1              zlibbioc_1.34.0            
## [25] purrr_0.3.4                 scales_1.1.1               
## [27] BiocParallel_1.22.0         tibble_3.0.3               
## [29] openssl_1.4.2               generics_0.0.2             
## [31] AnnotationFilter_1.12.0     ggplot2_3.3.2              
## [33] ellipsis_0.3.1              SummarizedExperiment_1.18.2
## [35] lazyeval_0.2.2              magrittr_1.5               
## [37] crayon_1.3.4                memoise_1.1.0              
## [39] evaluate_0.14               tools_4.0.2                
## [41] prettyunits_1.1.1           hms_0.5.3                  
## [43] lifecycle_0.2.0             matrixStats_0.56.0         
## [45] stringr_1.4.0               plyranges_1.8.0            
## [47] munsell_0.5.0               DelayedArray_0.14.1        
## [49] ensembldb_2.12.1            compiler_4.0.2             
## [51] rlang_0.4.7                 grid_4.0.2                 
## [53] RCurl_1.98-1.2              rappdirs_0.3.1             
## [55] bitops_1.0-6                rmarkdown_2.3              
## [57] gtable_0.3.0                DBI_1.1.0                  
## [59] curl_4.3                    R6_2.4.1                   
## [61] GenomicAlignments_1.24.0    knitr_1.29                 
## [63] dplyr_1.0.2                 rtracklayer_1.48.0         
## [65] zeallot_0.1.0               bit_4.0.4                  
## [67] ProtGenerics_1.20.0         stringi_1.4.6              
## [69] Rcpp_1.0.5                  vctrs_0.3.4                
## [71] dbplyr_1.4.4                tidyselect_1.1.0           
## [73] xfun_0.16
lambdamoses added a commit that referenced this issue Dec 4, 2020
…roblem with BSgenome::getSeq when genome is DNAStringSet. Addressing #13
@lambdamoses
Copy link
Collaborator

Thank you for pointing this out. The package has been updated to bypass this problem with getSeq.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants