Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort-on multiple indexes is broken#81 #82

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

volkerjaenisch
Copy link

@volkerjaenisch volkerjaenisch commented Aug 24, 2019

This pull request adds a better testing for the multicolumn sort testcases.

The expected output is 16 errors:
bettertests.log

If line 1000 of catalog.py:

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

is patched to

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or second_indexes or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

all tests are working. This patch ensures that in any case with multiple indexes self._sort_iterate_resultset is used. Ergo the errors are located in self._sort_nbest, self._sort_nbest_reverse, only.

Please confirm that the new tests are OK. Then they may act as a new baseline for finding and fixing the bug in self._sort_nbest, self._sort_nbest_reverse.

Cheers,
Volker

fixes #81

Copy link
Contributor

@d-maurer d-maurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that you have evidence that your change in sort_nbest_reverse is necessary. As commented earlier, a symmetric change should then be necessary for the symmetric sort_nbest as well. Maybe, you can design a specific test for this?

I have my doubts that the current implementations for sort_nbest and sort_nbest_reverse already deliver correct results with multiple sort indexes. They keep only limit elements from the result but for the decision which elements to keep they look only at the first sort index (and ignore any followup sort indexes); this cannot be right. I recommend to design a test where the first sort index has the same value for all hits and the values for the second sort index are not properly sorted.

@@ -907,8 +907,11 @@ def _sort_nbest_reverse(self, actual_result_count, result, rs,
# This document is not in the sort key index, skip it.
actual_result_count -= 1
else:
if n >= limit and key >= best:
continue
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar code is present in the companion _sort_nbest, too. If the change is necessary here, almost surely, it will be necessary there as well.

@volkerjaenisch
Copy link
Author

volkerjaenisch commented Aug 28, 2019

Sorry!
The change in sort_nbest_reverse was not intended. Its just a temporarily debugging hook for pdb. No changes at Catalog.py are necessary for this pull request. This pull request should introduce better test-results, only.

I agree with your doubts. The current use of limit in nbest/reverse seems broken to me. And this is evidently since the new tests fail on this code.

A good starting point could be test_sort_on_two_reverse
This test fails in the first iteration b_start =0, b_size=1 .

Cheers,
Volker

@d-maurer
Copy link
Contributor

d-maurer commented Aug 28, 2019 via email

@volkerjaenisch
Copy link
Author

@d-maurer !

Thank you for the hints. I am a bit confused on how to send in a pull request for Zope. Each community has other unspoken rules how to proceed.

  1. What version number shall I use in changes.rst?
  2. I think the best way may be to send in a new PR for just only the advanced testing? There I could also include a test like you recommended.

I recommend to design a test where the first sort index has the same value for all hits and the values for the second sort index are not properly sorted.

I agree fully with your hints on the correct implementation of n-sort.
The correct algorithm should IMHO work as follows:
Pre sorting stage:

  • Use the first index to find 'limit' sorted results.
  • Then take the last result, get its value in the index. Add all the entries with the same index value.

Data:
eggs 1
ham 2
ham 1
spam 2

Task: Sort after first, second column, limit = 2

First in index is
eggs 1
Second in index is
ham 2
limit is fullfilled. Now add datasets that have the same value "ham" for the index value.

result =
eggs 1
ham 2
ham 1
This is the result of the presorting stage.
Sorting stage
Now sorting over all indexes can take place (I think without code changes).
eggs 1
ham 1
ham 2
Post sorting stage
In the finale stage the result has to limited to the length of the expected result set.
eggs 1
ham 1

Cheers,

Volker

@d-maurer
Copy link
Contributor

d-maurer commented Aug 28, 2019 via email

@volkerjaenisch
Copy link
Author

Thank you for your guidance @d-maurer !

I will come up with a "combined" PR with the new tests, and a solution with the 3-stage sorting.
But first I have to kickstart a new project, so it may take a few days.

Cheers,
Volker

@volkerjaenisch
Copy link
Author

OK. Found some time to dig into this.
I did a minimal invasive fix. The algorithms for n-best/reverse with one sort_index were not changed. The cases with multiple sort_indexes are now encapsulated into a single function which handles the n-best as well as the n-best-reverse case.

Please review

Cheers,

Volker

result = multisort(result, sort_spec)
# we have multi index sorting
result = self._multi_index_nbest(
actual_result_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the sorting code into a separate method (which is in general a good idea) prevents actual_result_count to be updated. You must either wrap (int) actual_result_count in a "mutable" object (i.e. updatable in place) or return the updated value as part of the return value (and reassign). ZCatalog filters out hits for which at least one of the sort indexes lacks a value - a test examining the correct actual_result_count thus would ensure that some hits lack a sort value and verify the correct result size.

if sort_index_length == 1:
index_key_map = sort_index.documentToKeyMap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, sort_nbest and sort_nbest_reverse would be symmetric. Your change above seems to have reduced the symmetry.

You have refactored the multi-index sorting into a single method. I would go a step further and merge sort_nbest and sort_nbest_reverse into a single function (in my view those methods are also named unintuitively; the heapq module contains similar functions named nsmallest and nlargest, respectively - which would be much better names). I would use functools.partial to instantiate the proper sorting function from it.

except KeyError:
# This document is not in the sort key index, skip it.
# ToDo: Is this the correct/intended behavior???
actual_result_count -= 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As actual_result_count is an int and thus immutable, this does not change the value in the caller. As mentioned before, you must do something special to get the updated value to the caller (e.g. return the updated value as part or the return value).

# Sort the sort index_values
sorted_index_values = sorted(
did_by_index_value.keys(),
reverse=reverse)
Copy link
Contributor

@d-maurer d-maurer Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is correct, you lose the former optimization (use nbest rather than full sorting to save sorting time). This may be relevant if the first sort index is large (e.g. a timestamp based index such as modified, effective, ...). I would avoid the full sorting and use heapq.n[smallest|largest] (depending on reverse) with limit as n. You might even use more elementary heapq functions to sort incrementally and stop as soon as you have enough documents found.

Copy link
Author

@volkerjaenisch volkerjaenisch Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go for a heapq implementation of the algorithm.

Cheers,
Volker

Copy link
Author

@volkerjaenisch volkerjaenisch Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging into heapq: Heapq asumes a fixed size heap.
So if say limit=2 but I need in reality three elements
Result-set:
eggs 1
ham 2
ham 1

Expected Result set:
eggs 1
ham 1

I would start with a heap of length limit=2. Then I would have to extend the heap iteratevely and push in all the index values in again. This would cost 2 x O(n) (One for heapq.heapify and on for the pushs) for each additional item I need.

I think it may be preferable to switch to a linear search, here. Finding all results with the same index value of the largest/smallest value (not already in the sort body) is of O(n) and this operation takes place only once.

Cheers,
Volker

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Will work on this at the weekend

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally!

I pushed a new version using heapq.

Ideally, sort_nbest and sort_nbest_reverse would be symmetric. Your change above seems to have reduced the symmetry.

You have refactored the multi-index sorting into a single method. I would go a step further and merge sort_nbest and sort_nbest_reverse into a single function (in my view those methods are also named unintuitively; the heapq module contains similar functions named nsmallest and nlargest, respectively - which would be much better names). I would use functools.partial to instantiate the proper sorting function from it.

Currently these methods utilize a bisecting scheme that is probably as effective as heapq, if the bisecting is implemented in C-Code. One can replace this code by the using heapq to use the same sorting algorithm.

I start doing this now

Volker

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My heapq Code is currently sorting the index_values [primitive], only.
If we like to handle the former bisec_code we have to sort a list of tuples [(primitive, did)].

A benchmark for 100Mio values to sort [primitive] or [(primitive, did)] I got

import heapq
import random
import datetime

NUM = 100000000

a = []
b =[]

for i in range(NUM):

a.append(random.random())
b.append((random.random(), random.random()))

start = datetime.datetime.now()

x = heapq.nlargest(100, a)

now = datetime.datetime.now()
diff = now - start
print( diff.total_seconds())

start = datetime.datetime.now()

y = heapq.nlargest(100, b)

now = datetime.datetime.now()
diff = now - start

print( diff.total_seconds())

8.796026
11.152541

One will need 27% more time for heapq-sort to sort a list of tuples than only primitives.

So I am uncertain which way to go.

Cheers,
Volker

@d-maurer
Copy link
Contributor

Please review

"github" has special UI support for review requests. It is always present at the top of the right column of a pull request. In your case, there has already been a review. In this case, there are additional review related UI actions to the right or the review, among them "rerequest review".

Using the "gibhub" support for review requests has the advantage that the reviewer can get a list of all his pending requests, reducing the risk that some review is forgotten.

@icemac
Copy link
Member

icemac commented Aug 29, 2019

@volkerjaenisch Thank you for working on this pull request. To be able to merge it – once it is ready – you need to sign the contributor agreement, see https://www.zope.org/developer/becoming-a-committer.html

@d-maurer
Copy link
Contributor

I just noticed that ZCatalog's handling of actual_result_count may be unreliable in general: apparently, it is used to reduce the result length by the number of documents lacking a sort value. To determine this number correctly, all sort values for all hits must be examined but ZCatalog tries to avoid just this with some optimizations; in those cases, actual_result_count can be too large.

I have not investigated the potential consequences. I suspect that in some situations some batching functions (e.g. "go to last page", "next page") may fail because the batching sees a hit number including hits lacking a sort value (and which therefore will not be delivered). Products.AdvancedQuery (my querying replacement for ZCatalog) takes a different approach: rather than filtering out hits lacking a sort value, a replacement value is used in such a case which is either infinitely large or small (dependent on parameters); this way, it can avoid potential batching problems.

@volkerjaenisch
Copy link
Author

I agree on the problems with actual_result_count (see my ToDo statements). IMHO the handling of datasets with no index values have to be solved in the way you describe it from Products.AdvancedQuery.
But I would like to separate the current issue "sorting fails" from the next issue "actual_result_count is wrong/misleading".

@d-maurer
Copy link
Contributor

d-maurer commented Aug 29, 2019 via email

@volkerjaenisch
Copy link
Author

Ahhhh! Sorry I missed that!

Fixed.

Cheers,

Volker

@volkerjaenisch
Copy link
Author

@volkerjaenisch Thank you for working on this pull request. To be able to merge it – once it is ready – you need to sign the contributor agreement, see https://www.zope.org/developer/becoming-a-committer.html

I am already accepted as comitter.

@d-maurer
Copy link
Contributor

d-maurer commented Aug 30, 2019 via email

@d-maurer
Copy link
Contributor

d-maurer commented Aug 30, 2019 via email

@icemac
Copy link
Member

icemac commented Aug 30, 2019

@volkerjaenisch wrote:

I am already accepted as comitter.

Hm, as a committer you should have write access to this repository and GitHub should mention you as "Member" instead of "First-time contributor". I did not find your name in the list of the members of the zopefoundation GitHub organization. Are you sure you signed a contributor agreement for Zope not only for Plone? (Maybe you signed it very long ago in times of svn.zope.org and your access rights do not have been migrated.)

@d-maurer
Copy link
Contributor

d-maurer commented Sep 8, 2019 via email

@jensens
Copy link
Member

jensens commented Feb 3, 2021

@volkerjaenisch
So, @agitator stumbled over this problem as well and I started looking at it.
Did you resolve the CLA problem?
Whats left to do here?

@ale-rt
Copy link
Member

ale-rt commented Mar 26, 2021

I am also interested in checking what is blocking this one

@ale-rt
Copy link
Member

ale-rt commented Mar 26, 2021

For the record I think I fixed my issues changing:

elif limit is None or (limit * 4 > rlen):

with

elif limit is None or (limit * 10000000 > rlen):

which is kind of stupid but it works

@icemac
Copy link
Member

icemac commented Jul 4, 2022

@volkerjaenisch Do you have time and energy to work on this PR again or should we close it? Or is there someone else who wants to complete this PR?

@jensens
Copy link
Member

jensens commented Jul 4, 2022

@ale-rt did your PR solve this too?
cc @agitator - IIRC you have some experience here, is this still a problem?

@ale-rt
Copy link
Member

ale-rt commented Jul 4, 2022

@ale-rt did your PR solve this too?

Since ~1Year I am running something like this:

$ rg -N '(CMFPlone|ZCatalog)' bin/instance 
  '/home/ale/.buildout/eggs/cp38/Products.CMFPlone-5.2.8-py3.8.egg',
  '/home/ale/.buildout/eggs/cp38/Products.ZCatalog-6.1-py3.8.egg',

and the multiple sorting issue is not a problem anymore for me.

Anyway @volkerjaenisch PR looks very advanced and interesting (especially because of the attempt to introduce the usage of the heapq module in the catalog).

@volkerjaenisch
Copy link
Author

volkerjaenisch commented Jul 4, 2022 via email

@icemac
Copy link
Member

icemac commented Apr 4, 2024

Does someone want to pick up this PR to get it mergeable or do we have to close it?

@jensens
Copy link
Member

jensens commented Apr 4, 2024

The only blocker here is the missing Contributor License Agreement AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sort-on multiple indexes is broken
5 participants