-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent residue pairs output #62
Comments
@sureshhewabi - is it still possible to use the API without paging? (i.e. set page size to unlimited) |
@colin-combe I think the SQL you wrote for this does not return unique pairs although @aozalevsky expecting unique residue pairs in each call. Even though we remove the pagination, it will not fix the issue. Could you please modify the SQL according to @aozalevsky requirements? |
there is a circumstance in which the returned residue pairs will not be unique - that is, you might get ProteinA Residue 1 to ProteinB residue 100 and ProteinB Residue 100 to ProteinA residue 1. It is duplicate ID's @aozalevsky is concerned about? Duplicate ID's are deliberate, assuming that the residue pairs are unique (with the exception above). The duplicate ID's denote ambiguity in the position of the peptide and, thereby, in the position of the links. So
Let me know if either of these two things are not the case. Its now harder to debug coz you can't switch off pagination.
this is the real problem, not duplication? |
yeah, i know thats confusing... i'm trying to say duplicate IDs should be associated with different residue pairs. |
...i'll check some of the claims i just made |
Is not an issue. I'm using the following line sxl = tuple(sorted(((eid1, rid1), (eid2, rid2)))) to convert all residue pairs to a unified representation so i can a) deduplicate b) compare to residue pairs from PDB-DEV entry. The problem is that I always get exactly 1052 residue pair records, but they are a) different, and b) the list is incomplete (i.e., i can't get all 611 unique residue pairs after deduplication). I thought the residue pair ID (e.g., Apparently, it can be associated with multiple residue pairs (which is a bit confusing), but the problems are inconsistent and incompleteness. Check the example (printing all residue pairs with a given ID from 3 independent calls): for r in rps_list[0]:
if r['id'] == 'SII_177702_1':
print(r)
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE48_target', 'prot1_acc': 'P0CE48', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE48_target', 'prot1_acc': 'P0CE48', 'pos1': 265, 'prot2': 'dbseq_P0CE48_target', 'prot2_acc': 'P0CE48', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE48_target', 'prot2_acc': 'P0CE48', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE48_target', 'prot1_acc': 'P0CE48', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE48_target', 'prot1_acc': 'P0CE48', 'pos1': 265, 'prot2': 'dbseq_P0CE48_target', 'prot2_acc': 'P0CE48', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE48_target', 'prot2_acc': 'P0CE48', 'pos2': 260} for r in rps_list[2]:
if r['id'] == 'SII_177702_1':
print(r)
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE48_target', 'prot1_acc': 'P0CE48', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE47_target', 'prot2_acc': 'P0CE47', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE48_target', 'prot1_acc': 'P0CE48', 'pos1': 265, 'prot2': 'dbseq_P0CE48_target', 'prot2_acc': 'P0CE48', 'pos2': 260}
{'id': 'SII_177702_1', 'file': 'membrane_5pc_1200IDs.mzid', 'pass': True, 'prot1': 'dbseq_P0CE47_target', 'prot1_acc': 'P0CE47', 'pos1': 265, 'prot2': 'dbseq_P0CE48_target', 'prot2_acc': 'P0CE48', 'pos2': 260} for r in rps_list[3]:
if r['id'] == 'SII_177702_1':
print(r)
#empty |
(was downloading and parsing PXD036833 locally hence delay) I'll take this as an example query (this is for @sureshhewabi 's info):
so that's the query from here: https://github.com/PRIDE-Archive/xiview-api/blob/python3/app/routes/pdbdev.py#L148-L161 SII_184564_1 is the match id with the most duplicates, in xiVIEW it looks like this: The result of the SQl query is this:
so there is duplication of that match ID 4 times, but each time is a different pair of residues. I think it's correct but confusing? (xiVIEW screenshot shows how two homologous proteins results in 4 possible links.) However, in Arthurs test code that ID came up 12 times? So there is additional duplication coming from somewhere that isn't just the SQL query? (EDIT - or rather, if its the sql query then its to do with the LIMIT and OFFSET) |
@aozalevsky - i wrote the above before reading your message,
that is the id of the match (aka Spectrum Identification Item), and it can result in 4 different residue pairs.
the idea of duplicating the match ID in the output is that you can then know which links are possibly just alternative explanations of the same thing. (@grandrea) Do you think this makes sense or do we need to look at this again?
ok |
@sureshhewabi - i'll add a GROUP BY so there isn't duplication of residue pairs, basically, just like you first asked... |
@sureshhewabi - have a look at this PR - PRIDE-Archive/xiview-api#5 I quickly added GROUP BY to the PDB-Dev endpoint, (i tested the SQL but not the actual API code, sorry) It entails some other changes (some things become arrays) |
@colin-combe deployed your changes. Let me know if you still want to remove pagination |
Hi @sureshhewabi, I was a bit slow realising what was going on yesterday - i sent some irrelevant messages before properly realising the structure of the data we we're sending out at that time. I.e. that it had a shed load of residue-pair duplication because it was one item per match rather than one item per residue pair. Then i realised what you were asking for.
OK, thanks. Has it fixed the problem with inconsistencies is the question. But I guess one data item per unique residue pair suits Arthur better. Also, some fields in the response might not be necessary? E.g. the 'pass' field, as it seems like consumers might only ever ask for passing matches. (In which case API params and the code could also be simplified.) Also, the "match_ids" and "files" attributes are maybe not necessary (they are now yucky arrays of values to compensate for the GROUP BY). Depends on whether consumers are interested in following up more detail about the match. If we're keeping "match_ids", should maybe rename it to "ident_ids" or something because 'match' isn't mzIdentML terminology.
i don't have such a strong view on that. I feel it makes it more annoying to consume but allowing unlimited pagesize risks it timing out (at some point in future coz it should be OK now). Maybe increase the default and maximum page sizes? @aozalevsky - Arthur do you have a view about paging? |
(if it still behaves inconsistently, i might take out the pagination just to check if it behaves as expected then) |
@sureshhewabi i'm getting 503 errors for all endpoints, including swagger docs. also https://www.ebi.ac.uk/pride/archive/crosslinking seems to be empty. @colin-combe Well, as we've discussed earlier, we need to keep Pagination makes things slightly more complicated, but I'm totally fine with/or without it. Pros:
My only suggestion is to increase the page size to, let's say, 1000 or 10000 lines. Transferring up to 1MB in one request is, IMHO, totally fine with modern connection speeds. |
@aozalevsky Sorry it was down for few minutes. I would like to know how you are using this endpoint. Because whenever it is down, you get immediately spotted. My understanding was you use this endpoint just to run your pipeline, or may be periodically(as a cron job). So far, my understanding was, it is not that critical to have a few minutes down time, although it shouldn't be the case. If that is the case, I need to implement a test environment. |
I believe we need a test environment at some point, however as there are other priorities we did not wont to do that at this moment. Please let me know how often you use the API, how critical the down time for you. |
@sureshhewabi Indeed, I use it rarely to generate static reports like this (I showed an example during the call). Probably it's just a coincidence :) I'm in a different timezone (San Francisco, GMT-8) with a slightly weird schedule to accommodate occasional calls with collaborators from Europe and India, so I guess when you leave it for an "end of the day deployment," my work day just starts :) |
@aozalevsky Got it. Thanks for your clarification. Would you like to join us on Slack channel so that if you experience any issues you can directly communicate with us, instead of writing here in issues? because they are just temporary issues most of the time. |
On the other hand, just as a suggestion, if this could break your reports, may be you can cache/store data at your side and pull data periodically from API. |
for i, (ri, rj) in enumerate(zip(rps_list[0], rps_list[1])):
if ri != rj:
print('#' * 80)
print(i)
print('-' * 40)
print(ri)
print('-' * 40)
print(rj)
################################################################################
19
----------------------------------------
{'match_ids': ['SII_197232_1', 'SII_21729_1', 'SII_48226_1', 'SII_164195_1'], 'files': ['membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid'], 'pass': [True, True, True, True], 'prot1': 'dbseq_P02931_target', 'prot1_acc': 'P02931', 'pos1': 139, 'prot2': 'dbseq_P02931_target', 'prot2_acc': 'P02931', 'pos2': 42}
----------------------------------------
{'match_ids': ['SII_21729_1', 'SII_197232_1', 'SII_48226_1', 'SII_164195_1'], 'files': ['membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid'], 'pass': [True, True, True, True], 'prot1': 'dbseq_P02931_target', 'prot1_acc': 'P02931', 'pos1': 139, 'prot2': 'dbseq_P02931_target', 'prot2_acc': 'P02931', 'pos2': 42}
################################################################################
25
----------------------------------------
{'match_ids': ['SII_1335_1', 'SII_122897_1', 'SII_182749_1', 'SII_64018_1'], 'files': ['membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid'], 'pass': [True, True, True, True], 'prot1': 'dbseq_P02931_target', 'prot1_acc': 'P02931', 'pos1': 304, 'prot2': 'dbseq_P02931_target', 'prot2_acc': 'P02931', 'pos2': 227}
----------------------------------------
{'match_ids': ['SII_122897_1', 'SII_182749_1', 'SII_1335_1', 'SII_64018_1'], 'files': ['membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid', 'membrane_5pc_1200IDs.mzid'], 'pass': [True, True, True, True], 'prot1': 'dbseq_P02931_target', 'prot1_acc': 'P02931', 'pos1': 304, 'prot2': 'dbseq_P02931_target', 'prot2_acc': 'P02931', 'pos2': 227}
<...>
returns
This is clearly not true because I can manually request all other pages (up to 7) and eventually get complete data. yep, slack is fine. can you add me using [email protected]? |
@sureshhewabi Yes, we have caching on our side and are already using it for PRIDE data. |
i think the remaining duplication is the end swapping (unimportant as you say)
@sureshhewabi - this would make the paging unreliable, like we saw before? |
@sureshhewabi - the SQL queries need an ORDER BY clause if we have pagination? |
@colin-combe Let me check this tomorrow. |
@colin-combe @sureshhewabi any updates? the biggest issue so far is the incorrect counts ( "page": {
"page_no": 1,
"page_size": 99,
"total_elements": 1,
"total_pages": 1
} |
I'm pretty confident that anything that uses pagination based on SQL queries needs to have an ORDER BY clause, so that was a mistake, which should now be fixed. But I think you know about this. it was mainly @sureshhewabi that worked on the pagination. I have looked at it and don't see a problem. I think I'm stating the obvious but i guess https://github.com/PRIDE-Archive/xiview-api/blob/python3/app/routes/pdbdev.py#L215 isn't returning what we think... Could you look into it @sureshhewabi ? There was also a suggestion about pagesize -
|
@colin-combe increasing the page size to 10,000 first. I am deploying this first |
Fixed the pagination issue as well. Next is to check the inconsistency |
i think the inconsistency was fixed by adding the ORDER BY |
this is fixed and can be closed? |
yes. the ordering of the data (at least for |
to clarify, the ordering of the rows must not change with each call or the total output will potentially be incomplete and inconsistent. So, when you say the ordering of the data changes, it is only the order of the ids in the array aggregated 'match_ids' column? |
I rechecked again. Right the current behaviour is:
it's ok for our purposes. |
The data returned by the residue-pairs is inconsistent. While the overall size of the output is constant, the content is different every time and thus incomplete.
Here is my test code:
And the output
Only by combining all 10 independent calls did I get the correct number of 611 unique residue pairs, as reported by XiView (without TD and DD decoys).
![image](https://private-user-images.githubusercontent.com/578484/318904720-55cc2167-65f4-4902-918a-8df993efe39e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwMTczMTQsIm5iZiI6MTczOTAxNzAxNCwicGF0aCI6Ii81Nzg0ODQvMzE4OTA0NzIwLTU1Y2MyMTY3LTY1ZjQtNDkwMi05MThhLThkZjk5M2VmZTM5ZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOFQxMjE2NTRaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yNDhkNTBmZGIyMmVlMDIwZGIwMjBlNDcxZTkyNDYxZDYwZjY2NjEwOGJhY2EyNzM1NGM3OTQwYTMxOGJkMjRkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.l00qRXSVGyIy3C_X53GUBkaEaBvaMETnPZ4By6dy9Ms)
Most surprisingly, the number of duplicated IDs is also inconsistent. Thus, duplication seems to happen on a per-line rather than per-page basis.
Let me know if you can confirm the behavior or find an error in my test.
The text was updated successfully, but these errors were encountered: