-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
findPalindrome / palindromeLeftArm and mismatches - extends into loop #32
Comments
The problem is that in general there is more than one way to look at a palindrome when mismatches are allowed and occur at the beginning or end of the arms. For example, if a maximum of 1 mismatch is allowed between the 2 arms, then the following palindrome:
can be seen as:
or as:
Note that this 2nd form maximizes the arms and minimizes the loop. The thing is that Note that something similar happens if the mismatch is between the first letter of the left arm and the last letter of the right arm. For example, if a maximum of 1 mismatch is allowed between the 2 arms, then the following sequence:
can be seen as containing the following palindrome:
or containing the following palindrome:
Hope this helps, |
I can see the challenge! It could get to be more like an alignment problem, where you might want to have different scores/penalties for mismatches versus extending length. In fact, in my particular use case, the palindromes are long (>50bp), and I'm looking within individual long sequencing reads for palindromes (where most of sequence reads will have palindromes, perhaps with loops, and containing mismatches between arms). So I will probably be better off using pairwiseAlignment after I split each sequence in half and reverse-complementing the second half. I'll experiment with that instead. |
Just to clarify: That will only work if the center of the palindrome is at the center of the read sequence. But not if you want to find palindromes at arbitrary locations in the read sequences. |
after a bit more testing, I'm not sure it's true that findPalindromes() will always find and report the palindrome with longest arms:
whereas this alternative choice has longer arms (but also a shorter loop):
Is findPalindomes prioritizing minimizing loop length over maximizing arm length? I can see there are many ways to do this, and I'm not sure what's the best way. But whatever way you choose, perhaps it would be good to include a little detail on the algorithm and scoring system in the man page, so that the method is more transparent. Someone using this naively (me!) may assume that there's an unambiguous best choice for how the best palindromes will be chosen. Perhaps also worth including a pointer to pairwiseAlignment, as that function might behave better for some combinations of arm length/loop length/mismatch number. |
yes, in my (likely unusual) use case, the palindromes should be centered at the read center, so I think pairwiseAlignment on the halves should work. The molecular biology should have done the palindrome-finding for us with this data.
|
It's expected to do that. If it's not then it's a bug. I need to fix that. Also will clarify this in the man page. |
Is there any update on this bug? I am also seeing what appears to be prioritization of minimizing loop length instead of maximizing arm length. Could this be because when the C function get_find_palindromes_at() reports a match and resets Biostrings/src/find_palindromes.c Lines 41 to 43 in cb171e6
|
I think I will have a fix for this today or tomorrow. Could @hpages contact me about how you would like pull requests handled? |
Erik's patch, which includes Thomas's earlier proposed changes, addresses the problem of With Biostrings 2.59.2:
The palindrome with longest arm is here. I'm not sure about also reporting the shorter one, don't see a good reason for it. This can pollute the results quite a bit if we increase the value of
which could quickly become an annoyance. Something that should probably be addressed. But at least the modified algorithm no longer misses results. H. |
Thanks, all - this looks great. One idea to handle the multiple short matches is to add an option to choose the palindrome with longest arms, or just let the user filter as they choose after the fact. I've moved on (at least for now) from the project where I was using findPalindromes, so won't be testing this in any depth in the near future. |
hi there,
Another thought about findPalindromes. Something about the way it treats mismatches between palindrome arms is a little unintuitive, biologically, while it is probably correct in the formal sense. My biologist's intuition is that mismatches should be tolerated WITHIN the palindrome arms. However, the code also allows mismatches at the termini of palindromes, so that palindromes get extended into the loop region. I want to tolerate mismatches so I can recognize diverged palindrome arms, but I don't want to artificially extend the palindrome arm into the loop. See code below.
I also found some weird behavior with one of my examples - also illustrated below.
thanks again,
Janet
The text was updated successfully, but these errors were encountered: