Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disagreement between sequences and names for getSeq,XStringSet #5

Closed
csoneson opened this issue Nov 24, 2019 · 3 comments · Fixed by Bioconductor/IRanges#39
Closed

Comments

@csoneson
Copy link

Hi,
apologies if I have misunderstood something crucial, but it seems to me that the matching between sequences and sequence names is off when using getSeq with a DNAStringSet object, if the ranges in names are not sorted by chromosome (in increasing order by levels(seqnames(names))). A small example:

> suppressPackageStartupMessages({
+   library(Biostrings)
+   library(BSgenome)
+   library(GenomicRanges)
+ })
> (genome <- DNAStringSet(c(
+   chr1 = paste(sample(c("C", "G", "A", "T"), 100, replace = TRUE), collapse = ""), 
+   chr2 = paste(sample(c("C", "G", "A", "T"), 100, replace = TRUE), collapse = "")
+ )))
  A DNAStringSet instance of length 2
    width seq                                              names               
[1]   100 GGCTCGAAAACATGGAATCTAGC...TAGCAACCCGCGGAGCCGCAAG chr1
[2]   100 TTGGGCTAACGAGAGACACGAAG...TAGAGCGAGGTTCTCTAATACG chr2
> gtf <- GRanges(
+   seqnames = c("chr1", "chr1", "chr2", "chr2"), 
+   ranges = IRanges(start = c(3, 10, 50, 70), 
+                    end = c(15, 20, 70, 80)), 
+   strand = c("+", "-", "+", "-")
+ )
> names(gtf) <- paste0("E", 1:4)
> gtf
GRanges object with 4 ranges and 0 metadata columns:
     seqnames    ranges strand
        <Rle> <IRanges>  <Rle>
  E1     chr1      3-15      +
  E2     chr1     10-20      -
  E3     chr2     50-70      +
  E4     chr2     70-80      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
> ## Extract sequences - all good
> (gs <- getSeq(x = genome, names = gtf))
  A DNAStringSet instance of length 4
    width seq                                              names               
[1]    13 CTCGAAAACATGG                                    E1
[2]    11 AGATTCCATGT                                      E2
[3]    21 ATCGCTGGCGTGAACGCTAGC                            E3
[4]    11 TAGCGGATGCG                                      E4
> names(gs)
[1] "E1" "E2" "E3" "E4"
> ## Reverse order and extract sequences - sequences fine, names not matching
> gtf2 <- gtf[4:1]
> (gs2 <- getSeq(x = genome, names = gtf2))
  A DNAStringSet instance of length 4
    width seq                                              names               
[1]    11 TAGCGGATGCG                                      E2
[2]    21 ATCGCTGGCGTGAACGCTAGC                            E1
[3]    11 AGATTCCATGT                                      E4
[4]    13 CTCGAAAACATGG                                    E3
> names(gs2)
[1] "E2" "E1" "E4" "E3"

I suspect it's happening here:

ans <- unsplit(extractAt(x[names(rl)], unname(rl)), seqnames(gr))

> rl <- as(gtf2, "IntegerRangesList")
> extractAt(genome[names(rl)], unname(rl))
DNAStringSetList of length 2
[["chr1"]] E2=ACATGGAATCT E1=CTCGAAAACATGG
[["chr2"]] E4=CGCATCCGCTA E3=ATCGCTGGCGTGAACGCTAGC
> unsplit(extractAt(genome[names(rl)], unname(rl)), seqnames(gtf2))
  A DNAStringSet instance of length 4
    width seq                                              names               
[1]    11 CGCATCCGCTA                                      E2
[2]    21 ATCGCTGGCGTGAACGCTAGC                            E1
[3]    11 ACATGGAATCT                                      E4
[4]    13 CTCGAAAACATGG                                    E3

Session info:

> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] BSgenome_1.55.1      rtracklayer_1.47.0   GenomicRanges_1.39.1
[4] GenomeInfoDb_1.23.0  Biostrings_2.55.0    XVector_0.27.0      
[7] IRanges_2.21.1       S4Vectors_0.25.0     BiocGenerics_0.33.0 

loaded via a namespace (and not attached):
 [1] matrixStats_0.55.0          lattice_0.20-38            
 [3] XML_3.98-1.20               Rsamtools_2.3.2            
 [5] GenomicAlignments_1.23.1    bitops_1.0-6               
 [7] grid_4.0.0                  zlibbioc_1.33.0            
 [9] Matrix_1.2-17               BiocParallel_1.21.0        
[11] tools_4.0.0                 Biobase_2.47.0             
[13] RCurl_1.95-4.12             DelayedArray_0.13.0        
[15] compiler_4.0.0              SummarizedExperiment_1.17.0
[17] GenomeInfoDbData_1.2.2     

However, I tried also with Bioconductor releases 3.8, 3.9 and 3.10, with the same result.

@hpages
Copy link
Contributor

hpages commented Nov 26, 2019

Thanks Charlotte.

@lawremi Michael, can you please look into this? Thanks!

@hpages
Copy link
Contributor

hpages commented Mar 25, 2021

@lawremi Michael, this is your work. Do you think you can take care of it? Thanks

@lawremi
Copy link
Collaborator

lawremi commented Mar 25, 2021

Thanks, looks like I missed the initial ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants