-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort-on multiple indexes is broken#81 #82
base: master
Are you sure you want to change the base?
Conversation
…ve of the data. New test test_sort_on_two2 to show the problem.
Bad test with limit = 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that you have evidence that your change in sort_nbest_reverse
is necessary. As commented earlier, a symmetric change should then be necessary for the symmetric sort_nbest
as well. Maybe, you can design a specific test for this?
I have my doubts that the current implementations for sort_nbest
and sort_nbest_reverse
already deliver correct results with multiple sort indexes. They keep only limit
elements from the result but for the decision which elements to keep they look only at the first sort index (and ignore any followup sort indexes); this cannot be right. I recommend to design a test where the first sort index has the same value for all hits and the values for the second sort index are not properly sorted.
src/Products/ZCatalog/Catalog.py
Outdated
@@ -907,8 +907,11 @@ def _sort_nbest_reverse(self, actual_result_count, result, rs, | |||
# This document is not in the sort key index, skip it. | |||
actual_result_count -= 1 | |||
else: | |||
if n >= limit and key >= best: | |||
continue | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar code is present in the companion _sort_nbest
, too. If the change is necessary here, almost surely, it will be necessary there as well.
Sorry! I agree with your doubts. The current use of limit in nbest/reverse seems broken to me. And this is evidently since the new tests fail on this code. A good starting point could be test_sort_on_two_reverse Cheers, |
Volker Jaenisch (PhD) wrote at 2019-8-28 01:15 -0700:
The change in sort_nbest_reverse was not intended. Its just a temporarily debugging hook for pdb. No changes at Catalog.py are necessary for this pull request. This pull request should introduce better test-results, only.
It is a good idea to add an entry in `CHANGES.rst`. This way, everybody
can easily see what was changed (and what was not).
Your title suggests that the PR wants to fix the broken sort implementation
(not just to improve the corresponding tests). Likely, the
review request came too early (I am aware that it did not come from you).
Please rerequest a review once the implementation is (as far as you
are aware) fixed.
As discussed earlier, sorting via index iteration and the use of "nbest"
can be employed even for a sort over several indexes *BUT*
it must be slightly different than the implementation for a single index.
E.g., for "nbest", a former sorting stage may need to produce more
than `limit` candidates to allow later stages to fine tune the sorting
(if necessary - i.e. if the former stages could not "separate" the latest
candidates).
|
Thank you for the hints. I am a bit confused on how to send in a pull request for Zope. Each community has other unspoken rules how to proceed.
I agree fully with your hints on the correct implementation of n-sort.
Data: Task: Sort after first, second column, limit = 2 First in index is result = Cheers, Volker |
Volker Jaenisch (PhD) wrote at 2019-8-28 02:38 -0700:
1. What version number shall I use in changes.rst?
The newest one (the section marked with "unreleased").
2. I think the best way may be to send in a new PR for just only the advanced testing? There I could also include a test like you recommended.
It would be preferable if your PR not only added the tests
but also fixed any problems: in order to merge a PR,
all tests should pass -- otherwise other activities in the package would
become more difficult due to failing tests.
...
- Use the first index to find 'limit' sorted results.
- Then take the last result, get its value in the index. Add all the entries with the same index value.
Good!
|
Thank you for your guidance @d-maurer ! I will come up with a "combined" PR with the new tests, and a solution with the 3-stage sorting. Cheers, |
…r batching applied.
OK. Found some time to dig into this. Please review Cheers, Volker |
result = multisort(result, sort_spec) | ||
# we have multi index sorting | ||
result = self._multi_index_nbest( | ||
actual_result_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the sorting code into a separate method (which is in general a good idea) prevents actual_result_count
to be updated. You must either wrap (int
) actual_result_count
in a "mutable" object (i.e. updatable in place) or return the updated value as part of the return value (and reassign). ZCatalog
filters out hits for which at least one of the sort indexes lacks a value - a test examining the correct actual_result_count
thus would ensure that some hits lack a sort value and verify the correct result size.
if sort_index_length == 1: | ||
index_key_map = sort_index.documentToKeyMap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, sort_nbest
and sort_nbest_reverse
would be symmetric. Your change above seems to have reduced the symmetry.
You have refactored the multi-index sorting into a single method. I would go a step further and merge sort_nbest
and sort_nbest_reverse
into a single function (in my view those methods are also named unintuitively; the heapq
module contains similar functions named nsmallest
and nlargest
, respectively - which would be much better names). I would use functools.partial
to instantiate the proper sorting function from it.
except KeyError: | ||
# This document is not in the sort key index, skip it. | ||
# ToDo: Is this the correct/intended behavior??? | ||
actual_result_count -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As actual_result_count
is an int
and thus immutable, this does not change the value in the caller. As mentioned before, you must do something special to get the updated value to the caller (e.g. return the updated value as part or the return value).
src/Products/ZCatalog/Catalog.py
Outdated
# Sort the sort index_values | ||
sorted_index_values = sorted( | ||
did_by_index_value.keys(), | ||
reverse=reverse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is correct, you lose the former optimization (use nbest
rather than full sorting to save sorting time). This may be relevant if the first sort index is large (e.g. a timestamp based index such as modified
, effective
, ...). I would avoid the full sorting and use heapq.n[smallest|largest]
(depending on reverse
) with limit as n. You might even use more elementary heapq
functions to sort incrementally and stop as soon as you have enough documents found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go for a heapq implementation of the algorithm.
Cheers,
Volker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After digging into heapq: Heapq asumes a fixed size heap.
So if say limit=2 but I need in reality three elements
Result-set:
eggs 1
ham 2
ham 1
Expected Result set:
eggs 1
ham 1
I would start with a heap of length limit=2. Then I would have to extend the heap iteratevely and push in all the index values in again. This would cost 2 x O(n) (One for heapq.heapify and on for the pushs) for each additional item I need.
I think it may be preferable to switch to a linear search, here. Finding all results with the same index value of the largest/smallest value (not already in the sort body) is of O(n) and this operation takes place only once.
Cheers,
Volker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Will work on this at the weekend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally!
I pushed a new version using heapq.
Ideally, sort_nbest and sort_nbest_reverse would be symmetric. Your change above seems to have reduced the symmetry.
You have refactored the multi-index sorting into a single method. I would go a step further and merge sort_nbest and sort_nbest_reverse into a single function (in my view those methods are also named unintuitively; the heapq module contains similar functions named nsmallest and nlargest, respectively - which would be much better names). I would use functools.partial to instantiate the proper sorting function from it.
Currently these methods utilize a bisecting scheme that is probably as effective as heapq, if the bisecting is implemented in C-Code. One can replace this code by the using heapq to use the same sorting algorithm.
I start doing this now
Volker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My heapq Code is currently sorting the index_values [primitive], only.
If we like to handle the former bisec_code we have to sort a list of tuples [(primitive, did)].
A benchmark for 100Mio values to sort [primitive] or [(primitive, did)] I got
import heapq
import random
import datetime
NUM = 100000000
a = []
b =[]
for i in range(NUM):
a.append(random.random())
b.append((random.random(), random.random()))
start = datetime.datetime.now()
x = heapq.nlargest(100, a)
now = datetime.datetime.now()
diff = now - start
print( diff.total_seconds())
start = datetime.datetime.now()
y = heapq.nlargest(100, b)
now = datetime.datetime.now()
diff = now - start
print( diff.total_seconds())
8.796026
11.152541
One will need 27% more time for heapq-sort to sort a list of tuples than only primitives.
So I am uncertain which way to go.
Cheers,
Volker
"github" has special UI support for review requests. It is always present at the top of the right column of a pull request. In your case, there has already been a review. In this case, there are additional review related UI actions to the right or the review, among them "rerequest review". Using the "gibhub" support for review requests has the advantage that the reviewer can get a list of all his pending requests, reducing the risk that some review is forgotten. |
@volkerjaenisch Thank you for working on this pull request. To be able to merge it – once it is ready – you need to sign the contributor agreement, see https://www.zope.org/developer/becoming-a-committer.html |
I just noticed that I have not investigated the potential consequences. I suspect that in some situations some batching functions (e.g. "go to last page", "next page") may fail because the batching sees a hit number including hits lacking a sort value (and which therefore will not be delivered). |
I agree on the problems with actual_result_count (see my ToDo statements). IMHO the handling of datasets with no index values have to be solved in the way you describe it from Products.AdvancedQuery. |
Volker Jaenisch (PhD) wrote at 2019-8-29 00:54 -0700:
I agree on the problems with actual_result_count (see my ToDo statements). IMHO the handling of datasets with no index values have to be solved in the way you describe it from Products.AdvancedQuery.
But I would like to separate the current issue "sorting fails" from the next issue "actual_result_count is wrong/misleading".
I am okay with this approach. However, ensure that the
`actual_result_count` computed in the `multi_index_sort` is passed on
to the caller.
|
Ahhhh! Sorry I missed that! Fixed. Cheers, Volker |
I am already accepted as comitter. |
Volker Jaenisch (PhD) wrote at 2019-8-29 12:18 -0700:
...
I do not think that we can avoid a full sort (on the first index).
Sure, you can: sorting over several indexes calls for a
lexicographic search where each index represents one lexicographic level.
With a lexicographic search, early search levels may already determine
the sort order (if they "separate" the elements).
Assume that *f1* to *fn* represents the sorted list of distinct hit
values for the first index; assume that *n1* to *nn* are the corresponding
numbers of hits with those index values. Assume that you want
the first *limit* hits correctly sorted and
you have an *m* with *sum(nx for 1 <= x <= m) >= limit*, then it is sufficient
to determine *f1* to *fm* to get the first *limit* hits correctly sorted --
because any hit with a first index value above *fm* will not be among the
first *limit* sorted hits.
|
Volker Jaenisch (PhD) wrote at 2019-8-29 14:21 -0700:
...
After digging into heapq: Heapq asumes a fixed size heap.
So if say limit=2 but I need in reality three elements
You could use your *limit* as heap size (the parameter from your sort
function). It might be too large but it is surely large enough.
|
@volkerjaenisch wrote:
Hm, as a committer you should have write access to this repository and GitHub should mention you as "Member" instead of "First-time contributor". I did not find your name in the list of the members of the zopefoundation GitHub organization. Are you sure you signed a contributor agreement for Zope not only for Plone? (Maybe you signed it very long ago in times of svn.zope.org and your access rights do not have been migrated.) |
Better documentation
Volker Jaenisch (PhD) wrote at 2019-9-8 11:44 -0700:
volkerjaenisch commented on this pull request.
...
+ did_by_index_value[index_value].append(did)
...
One will need 27% more time for heapq-sort to sort a list of tuples than only primitives.
You have already preclassified the documents by index value
(--> `did_by_index_value`). Such a preclassification allows
you to sort the index values (and get the corresponding "did"s
from the preclassification map).
|
@volkerjaenisch |
I am also interested in checking what is blocking this one |
For the record I think I fixed my issues changing:
with
which is kind of stupid but it works |
@volkerjaenisch Do you have time and energy to work on this PR again or should we close it? Or is there someone else who wants to complete this PR? |
Since ~1Year I am running something like this:
and the multiple sorting issue is not a problem anymore for me. Anyway @volkerjaenisch PR looks very advanced and interesting (especially because of the attempt to introduce the usage of the |
Dear @ALL!
Sorry, but I have nor the time nor the energy to contribute to this PR
in the next months.
IMHO the Code is running, it solves the issue and the test are complete.
So there should be only superficial stuff to be done.
Cheers,
Volker
…--
=========================================================
inqbus Scientific Computing Dr. Volker Jaenisch
Hungerbichlweg 3 +49 (8860) 9222 7 92
86977 Burggenhttps://inqbus.de
=========================================================
|
Does someone want to pick up this PR to get it mergeable or do we have to close it? |
The only blocker here is the missing Contributor License Agreement AFAIK. |
This pull request adds a better testing for the multicolumn sort testcases.
The expected output is 16 errors:
bettertests.log
If line 1000 of catalog.py:
is patched to
all tests are working. This patch ensures that in any case with multiple indexes self._sort_iterate_resultset is used. Ergo the errors are located in self._sort_nbest, self._sort_nbest_reverse, only.
Please confirm that the new tests are OK. Then they may act as a new baseline for finding and fixing the bug in self._sort_nbest, self._sort_nbest_reverse.
Cheers,
Volker
fixes #81