Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow returning after only first match of pattern #119

Open
BlueDrink9 opened this issue Oct 31, 2024 · 3 comments
Open

Allow returning after only first match of pattern #119

BlueDrink9 opened this issue Oct 31, 2024 · 3 comments

Comments

@BlueDrink9
Copy link

The *matchPattern* family of functions currently return as many matches as possible. However, sometimes you just want to know whether there is any match between two sequences, so it is inefficient to continue searching after a match is found.

When searching a long list of sequences that are slightly longer than the pattern, vmatchPattern is about 2x slower than grepl(..., fixed=True), despite grepl having to translate to characters.

I need to eke out as much performance as possible because of the volume I am working with. It would be good to have an option, similar to grepRaw, to return early as soon as a match is found.

@hpages
Copy link
Contributor

hpages commented Oct 31, 2024

Sounds like a reasonable request and a good feature to have IMO.

Putting some timings here for reference:

library(Biostrings)
library(hgu95av2probe)
probes <- DNAStringSet(hgu95av2probe)

system.time(ok1 <- grepl("CATATG", hgu95av2probe$sequence, fixed=TRUE))
#    user  system elapsed 
#   0.005   0.000   0.005 

system.time(ok2 <- vcountPattern("CATATG", probes) != 0L)
#    user  system elapsed 
#   0.094   0.000   0.094 

identical(ok1, ok2)
# [1] TRUE

length(which(ok1))
# [1] 1086

library(microbenchmark)
microbenchmark(grepl("CATATG", hgu95av2probe$sequence, fixed=TRUE), vcountPattern("CATATG", probes) != 0L)
# Unit: milliseconds
#                                                   expr       min        lq
#  grepl("CATATG", hgu95av2probe$sequence, fixed = TRUE)  4.380226  4.564893
#                  vcountPattern("CATATG", probes) != 0L 90.140803 92.999982
#       mean   median        uq      max neval
#   5.095201  4.91384  5.397511  12.7959   100
#  94.830566 93.89611 94.740594 116.4534   100

Could be controlled via a first.only argument, set to FALSE to preserve the current behavior. Could be added to matchPattern()/vmatchPattern() and countPattern()/vcountPattern().

Note that the *PDict() family already supports something similar to this via the whichPDict()/vwhichPDict() functions but with the important difference that these functions won't return any information about the exact location of the first match (or any other match for that matter) in the subject sequence(s).

Any thoughts @ahl27 ?

Finally, and FWIW, a word of caution about grepRaw(all=TRUE). It's important to realize that even with all=TRUE, grepRaw() does not return all the matches:

pattern <- as.raw(c(5, 1, 5, 1))
x <- as.raw(c(99, 99, 99, 5, 1, 5, 1, 5, 1, 199))
grepRaw(pattern, x, fixed=TRUE, all=TRUE)
# [1] 4

The fact that it discards matches that overlap with a previous match is not mentioned in the documentation, which is unfortunate because this makes it unfit for searching most biological sequences.

H.

sessionInfo()

R Under development (unstable) (2024-10-22 r87265)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 23.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.5.r87265/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] microbenchmark_1.5.0 hgu95av2probe_2.18.0 AnnotationDbi_1.69.0
 [4] Biobase_2.67.0       Biostrings_2.75.0    GenomeInfoDb_1.43.0 
 [7] XVector_0.47.0       IRanges_2.41.0       S4Vectors_0.45.0    
[10] BiocGenerics_0.53.1  generics_0.1.3      

loaded via a namespace (and not attached):
 [1] crayon_1.5.3            vctrs_0.6.5             httr_1.4.7             
 [4] cli_3.6.3               rlang_1.1.4             DBI_1.2.3              
 [7] png_0.1-8               UCSC.utils_1.3.0        jsonlite_1.8.9         
[10] bit_4.5.0               KEGGREST_1.47.0         fastmap_1.2.0          
[13] memoise_2.0.1           compiler_4.5.0          RSQLite_2.3.7          
[16] blob_1.2.4              R6_2.5.1                GenomeInfoDbData_1.2.13
[19] tools_4.5.0             bit64_4.5.2             zlibbioc_1.53.0        
[22] cachem_1.1.0 

@ahl27
Copy link
Collaborator

ahl27 commented Oct 31, 2024

Seems like a good idea to me. I'm thinking it might just be better to add a limit argument (like nrec in readXStringSet or nlines in read.table) to future proof, with the default set to 0 for unlimited returns. I could see a request in the future asking for the first 2/3/4/n entries as well, and that would also address that.

I can add this to my todos for this cycle, should be fairly easy to implement.

@hpages
Copy link
Contributor

hpages commented Oct 31, 2024

Yep, limit sounds good. Could also be used in the future to request the last 2/3/4/n entries by letting the user specify a negative value if there's interest.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants