sort-on multiple indexes is broken#81 #82

volkerjaenisch · 2019-08-24T15:21:29Z

This pull request adds a better testing for the multicolumn sort testcases.

The expected output is 16 errors:
bettertests.log

If line 1000 of catalog.py:

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

is patched to

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or second_indexes or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

all tests are working. This patch ensures that in any case with multiple indexes self._sort_iterate_resultset is used. Ergo the errors are located in self._sort_nbest, self._sort_nbest_reverse, only.

Please confirm that the new tests are OK. Then they may act as a new baseline for finding and fixing the bug in self._sort_nbest, self._sort_nbest_reverse.

Cheers,
Volker

fixes #81

…ve of the data. New test test_sort_on_two2 to show the problem.

Bad test with limit = 10

d-maurer

I assume that you have evidence that your change in sort_nbest_reverse is necessary. As commented earlier, a symmetric change should then be necessary for the symmetric sort_nbest as well. Maybe, you can design a specific test for this?

I have my doubts that the current implementations for sort_nbest and sort_nbest_reverse already deliver correct results with multiple sort indexes. They keep only limit elements from the result but for the decision which elements to keep they look only at the first sort index (and ignore any followup sort indexes); this cannot be right. I recommend to design a test where the first sort index has the same value for all hits and the values for the second sort index are not properly sorted.

d-maurer · 2019-08-28T05:04:03Z

src/Products/ZCatalog/Catalog.py

@@ -907,8 +907,11 @@ def _sort_nbest_reverse(self, actual_result_count, result, rs,
                    # This document is not in the sort key index, skip it.
                    actual_result_count -= 1
                else:
-                    if n >= limit and key >= best:
-                        continue
+                    try:


Similar code is present in the companion _sort_nbest, too. If the change is necessary here, almost surely, it will be necessary there as well.

volkerjaenisch · 2019-08-28T08:15:09Z

Sorry!
The change in sort_nbest_reverse was not intended. Its just a temporarily debugging hook for pdb. No changes at Catalog.py are necessary for this pull request. This pull request should introduce better test-results, only.

I agree with your doubts. The current use of limit in nbest/reverse seems broken to me. And this is evidently since the new tests fail on this code.

A good starting point could be test_sort_on_two_reverse
This test fails in the first iteration b_start =0, b_size=1 .

Cheers,
Volker

d-maurer · 2019-08-28T08:44:29Z

Volker Jaenisch (PhD) wrote at 2019-8-28 01:15 -0700:

The change in sort_nbest_reverse was not intended. Its just a temporarily debugging hook for pdb. No changes at Catalog.py are necessary for this pull request. This pull request should introduce better test-results, only.

It is a good idea to add an entry in `CHANGES.rst`. This way, everybody can easily see what was changed (and what was not). Your title suggests that the PR wants to fix the broken sort implementation (not just to improve the corresponding tests). Likely, the review request came too early (I am aware that it did not come from you). Please rerequest a review once the implementation is (as far as you are aware) fixed. As discussed earlier, sorting via index iteration and the use of "nbest" can be employed even for a sort over several indexes *BUT* it must be slightly different than the implementation for a single index. E.g., for "nbest", a former sorting stage may need to produce more than `limit` candidates to allow later stages to fine tune the sorting (if necessary - i.e. if the former stages could not "separate" the latest candidates).

volkerjaenisch · 2019-08-28T09:38:24Z

@d-maurer !

Thank you for the hints. I am a bit confused on how to send in a pull request for Zope. Each community has other unspoken rules how to proceed.

What version number shall I use in changes.rst?
I think the best way may be to send in a new PR for just only the advanced testing? There I could also include a test like you recommended.

I recommend to design a test where the first sort index has the same value for all hits and the values for the second sort index are not properly sorted.

I agree fully with your hints on the correct implementation of n-sort.
The correct algorithm should IMHO work as follows:
Pre sorting stage:

Use the first index to find 'limit' sorted results.
Then take the last result, get its value in the index. Add all the entries with the same index value.

Data:
eggs 1
ham 2
ham 1
spam 2

Task: Sort after first, second column, limit = 2

First in index is
eggs 1
Second in index is
ham 2
limit is fullfilled. Now add datasets that have the same value "ham" for the index value.

result =
eggs 1
ham 2
ham 1
This is the result of the presorting stage.
Sorting stage
Now sorting over all indexes can take place (I think without code changes).
eggs 1
ham 1
ham 2
Post sorting stage
In the finale stage the result has to limited to the length of the expected result set.
eggs 1
ham 1

Cheers,

Volker

d-maurer · 2019-08-28T09:46:05Z

Volker Jaenisch (PhD) wrote at 2019-8-28 02:38 -0700:

1. What version number shall I use in changes.rst?

The newest one (the section marked with "unreleased").

2. I think the best way may be to send in a new PR for just only the advanced testing? There I could also include a test like you recommended.

It would be preferable if your PR not only added the tests but also fixed any problems: in order to merge a PR, all tests should pass -- otherwise other activities in the package would become more difficult due to failing tests.

... - Use the first index to find 'limit' sorted results. - Then take the last result, get its value in the index. Add all the entries with the same index value.

Good!

volkerjaenisch · 2019-08-28T09:52:05Z

Thank you for your guidance @d-maurer !

I will come up with a "combined" PR with the new tests, and a solution with the 3-stage sorting.
But first I have to kickstart a new project, so it may take a few days.

Cheers,
Volker

…r batching applied.

volkerjaenisch · 2019-08-29T02:09:25Z

OK. Found some time to dig into this.
I did a minimal invasive fix. The algorithms for n-best/reverse with one sort_index were not changed. The cases with multiple sort_indexes are now encapsulated into a single function which handles the n-best as well as the n-best-reverse case.

Please review

Cheers,

Volker

d-maurer · 2019-08-29T05:11:31Z

src/Products/ZCatalog/Catalog.py

-            result = multisort(result, sort_spec)
+            # we have multi index sorting
+            result = self._multi_index_nbest(
+                actual_result_count,


Moving the sorting code into a separate method (which is in general a good idea) prevents actual_result_count to be updated. You must either wrap (int) actual_result_count in a "mutable" object (i.e. updatable in place) or return the updated value as part of the return value (and reassign). ZCatalog filters out hits for which at least one of the sort indexes lacks a value - a test examining the correct actual_result_count thus would ensure that some hits lack a sort value and verify the correct result size.

d-maurer · 2019-08-29T05:22:36Z

src/Products/ZCatalog/Catalog.py

        if sort_index_length == 1:
+            index_key_map = sort_index.documentToKeyMap()


Ideally, sort_nbest and sort_nbest_reverse would be symmetric. Your change above seems to have reduced the symmetry.

You have refactored the multi-index sorting into a single method. I would go a step further and merge sort_nbest and sort_nbest_reverse into a single function (in my view those methods are also named unintuitively; the heapq module contains similar functions named nsmallest and nlargest, respectively - which would be much better names). I would use functools.partial to instantiate the proper sorting function from it.

d-maurer · 2019-08-29T05:27:39Z

src/Products/ZCatalog/Catalog.py

+            except KeyError:
+                # This document is not in the sort key index, skip it.
+                # ToDo: Is this the correct/intended behavior???
+                actual_result_count -= 1


As actual_result_count is an int and thus immutable, this does not change the value in the caller. As mentioned before, you must do something special to get the updated value to the caller (e.g. return the updated value as part or the return value).

d-maurer · 2019-08-29T05:40:28Z

src/Products/ZCatalog/Catalog.py

+        # Sort the sort index_values
+        sorted_index_values = sorted(
+            did_by_index_value.keys(),
+            reverse=reverse)


While this is correct, you lose the former optimization (use nbest rather than full sorting to save sorting time). This may be relevant if the first sort index is large (e.g. a timestamp based index such as modified, effective, ...). I would avoid the full sorting and use heapq.n[smallest|largest] (depending on reverse) with limit as n. You might even use more elementary heapq functions to sort incrementally and stop as soon as you have enough documents found.

I will go for a heapq implementation of the algorithm.

Cheers,
Volker

After digging into heapq: Heapq asumes a fixed size heap.
So if say limit=2 but I need in reality three elements
Result-set:
eggs 1
ham 2
ham 1

Expected Result set:
eggs 1
ham 1

I would start with a heap of length limit=2. Then I would have to extend the heap iteratevely and push in all the index values in again. This would cost 2 x O(n) (One for heapq.heapify and on for the pushs) for each additional item I need.

I think it may be preferable to switch to a linear search, here. Finding all results with the same index value of the largest/smallest value (not already in the sort body) is of O(n) and this operation takes place only once.

Cheers,
Volker

Sorry for the delay. Will work on this at the weekend

Finally!

I pushed a new version using heapq.

Ideally, sort_nbest and sort_nbest_reverse would be symmetric. Your change above seems to have reduced the symmetry.

You have refactored the multi-index sorting into a single method. I would go a step further and merge sort_nbest and sort_nbest_reverse into a single function (in my view those methods are also named unintuitively; the heapq module contains similar functions named nsmallest and nlargest, respectively - which would be much better names). I would use functools.partial to instantiate the proper sorting function from it.

Currently these methods utilize a bisecting scheme that is probably as effective as heapq, if the bisecting is implemented in C-Code. One can replace this code by the using heapq to use the same sorting algorithm.

I start doing this now

Volker

My heapq Code is currently sorting the index_values [primitive], only.
If we like to handle the former bisec_code we have to sort a list of tuples [(primitive, did)].

A benchmark for 100Mio values to sort [primitive] or [(primitive, did)] I got

import heapq
import random
import datetime

NUM = 100000000

a = []
b =[]

for i in range(NUM):

a.append(random.random()) b.append((random.random(), random.random()))

start = datetime.datetime.now()

x = heapq.nlargest(100, a)

now = datetime.datetime.now()
diff = now - start
print( diff.total_seconds())

start = datetime.datetime.now()

y = heapq.nlargest(100, b)

now = datetime.datetime.now()
diff = now - start

print( diff.total_seconds())

8.796026
11.152541

One will need 27% more time for heapq-sort to sort a list of tuples than only primitives.

So I am uncertain which way to go.

Cheers,
Volker

d-maurer · 2019-08-29T05:50:34Z

Please review

"github" has special UI support for review requests. It is always present at the top of the right column of a pull request. In your case, there has already been a review. In this case, there are additional review related UI actions to the right or the review, among them "rerequest review".

Using the "gibhub" support for review requests has the advantage that the reviewer can get a list of all his pending requests, reducing the risk that some review is forgotten.

icemac · 2019-08-29T05:58:35Z

@volkerjaenisch Thank you for working on this pull request. To be able to merge it – once it is ready – you need to sign the contributor agreement, see https://www.zope.org/developer/becoming-a-committer.html

d-maurer · 2019-08-29T06:14:52Z

I just noticed that ZCatalog's handling of actual_result_count may be unreliable in general: apparently, it is used to reduce the result length by the number of documents lacking a sort value. To determine this number correctly, all sort values for all hits must be examined but ZCatalog tries to avoid just this with some optimizations; in those cases, actual_result_count can be too large.

I have not investigated the potential consequences. I suspect that in some situations some batching functions (e.g. "go to last page", "next page") may fail because the batching sees a hit number including hits lacking a sort value (and which therefore will not be delivered). Products.AdvancedQuery (my querying replacement for ZCatalog) takes a different approach: rather than filtering out hits lacking a sort value, a replacement value is used in such a case which is either infinitely large or small (dependent on parameters); this way, it can avoid potential batching problems.

volkerjaenisch · 2019-08-29T07:53:59Z

I agree on the problems with actual_result_count (see my ToDo statements). IMHO the handling of datasets with no index values have to be solved in the way you describe it from Products.AdvancedQuery.
But I would like to separate the current issue "sorting fails" from the next issue "actual_result_count is wrong/misleading".

d-maurer · 2019-08-29T08:26:43Z

Volker Jaenisch (PhD) wrote at 2019-8-29 00:54 -0700:

I agree on the problems with actual_result_count (see my ToDo statements). IMHO the handling of datasets with no index values have to be solved in the way you describe it from Products.AdvancedQuery. But I would like to separate the current issue "sorting fails" from the next issue "actual_result_count is wrong/misleading".

I am okay with this approach. However, ensure that the `actual_result_count` computed in the `multi_index_sort` is passed on to the caller.

volkerjaenisch · 2019-08-29T08:58:32Z

Ahhhh! Sorry I missed that!

Fixed.

Cheers,

Volker

volkerjaenisch · 2019-08-29T19:18:57Z

@volkerjaenisch Thank you for working on this pull request. To be able to merge it – once it is ready – you need to sign the contributor agreement, see https://www.zope.org/developer/becoming-a-committer.html

I am already accepted as comitter.

d-maurer · 2019-08-30T04:57:08Z

Volker Jaenisch (PhD) wrote at 2019-8-29 12:18 -0700:

... I do not think that we can avoid a full sort (on the first index).

Sure, you can: sorting over several indexes calls for a lexicographic search where each index represents one lexicographic level. With a lexicographic search, early search levels may already determine the sort order (if they "separate" the elements). Assume that *f1* to *fn* represents the sorted list of distinct hit values for the first index; assume that *n1* to *nn* are the corresponding numbers of hits with those index values. Assume that you want the first *limit* hits correctly sorted and you have an *m* with *sum(nx for 1 <= x <= m) >= limit*, then it is sufficient to determine *f1* to *fm* to get the first *limit* hits correctly sorted -- because any hit with a first index value above *fm* will not be among the first *limit* sorted hits.

d-maurer · 2019-08-30T05:02:32Z

Volker Jaenisch (PhD) wrote at 2019-8-29 14:21 -0700:

... After digging into heapq: Heapq asumes a fixed size heap. So if say limit=2 but I need in reality three elements

You could use your *limit* as heap size (the parameter from your sort function). It might be too large but it is surely large enough.

icemac · 2019-08-30T06:00:17Z

@volkerjaenisch wrote:

I am already accepted as comitter.

Hm, as a committer you should have write access to this repository and GitHub should mention you as "Member" instead of "First-time contributor". I did not find your name in the list of the members of the zopefoundation GitHub organization. Are you sure you signed a contributor agreement for Zope not only for Plone? (Maybe you signed it very long ago in times of svn.zope.org and your access rights do not have been migrated.)

Better documentation

d-maurer · 2019-09-08T20:37:19Z

Volker Jaenisch (PhD) wrote at 2019-9-8 11:44 -0700:

volkerjaenisch commented on this pull request. ... + did_by_index_value[index_value].append(did) ... One will need 27% more time for heapq-sort to sort a list of tuples than only primitives.

You have already preclassified the documents by index value (--> `did_by_index_value`). Such a preclassification allows you to sort the index values (and get the corresponding "did"s from the preclassification map).

jensens · 2021-02-03T08:54:49Z

@volkerjaenisch
So, @agitator stumbled over this problem as well and I started looking at it.
Did you resolve the CLA problem?
Whats left to do here?

ale-rt · 2021-03-26T13:43:36Z

I am also interested in checking what is blocking this one

ale-rt · 2021-03-26T14:24:39Z

For the record I think I fixed my issues changing:

elif limit is None or (limit * 4 > rlen):

with

elif limit is None or (limit * 10000000 > rlen):

which is kind of stupid but it works

icemac · 2022-07-04T07:01:01Z

@volkerjaenisch Do you have time and energy to work on this PR again or should we close it? Or is there someone else who wants to complete this PR?

jensens · 2022-07-04T07:25:58Z

@ale-rt did your PR solve this too?
cc @agitator - IIRC you have some experience here, is this still a problem?

ale-rt · 2022-07-04T07:56:23Z

@ale-rt did your PR solve this too?

Since ~1Year I am running something like this:

$ rg -N '(CMFPlone|ZCatalog)' bin/instance 
  '/home/ale/.buildout/eggs/cp38/Products.CMFPlone-5.2.8-py3.8.egg',
  '/home/ale/.buildout/eggs/cp38/Products.ZCatalog-6.1-py3.8.egg',

and the multiple sorting issue is not a problem anymore for me.

Anyway @volkerjaenisch PR looks very advanced and interesting (especially because of the attempt to introduce the usage of the heapq module in the catalog).

volkerjaenisch · 2022-07-04T11:54:24Z

Dear @ALL! Sorry, but I have nor the time nor the energy to contribute to this PR in the next months. IMHO the Code is running, it solves the issue and the test are complete. So there should be only superficial stuff to be done. Cheers, Volker

…

-- ========================================================= inqbus Scientific Computing Dr. Volker Jaenisch Hungerbichlweg 3 +49 (8860) 9222 7 92 86977 Burggenhttps://inqbus.de =========================================================

icemac · 2024-04-04T06:31:47Z

Does someone want to pick up this PR to get it mergeable or do we have to close it?

jensens · 2024-04-04T20:28:21Z

The only blocker here is the missing Contributor License Agreement AFAIK.

volkerjaenisch added 6 commits August 12, 2019 23:57

Fixed: Can no longer sort after more than one index zopefoundation#92

697719c

New testclass Dummy2. Segments att1 index on the first and second hal…

6eaef16

…ve of the data. New test test_sort_on_two2 to show the problem.

Good test with limit = 50

8a9608c

Bad test with limit = 10

Better test data

78fd575

explicit multi column tests

bab34d7

Better output of the test context

88de79d

icemac requested a review from d-maurer August 26, 2019 05:49

icemac added the bug label Aug 26, 2019

volkerjaenisch mentioned this pull request Aug 26, 2019

sort-on multiple indexes is broken #81

Open

d-maurer requested changes Aug 28, 2019

View reviewed changes

volkerjaenisch added 2 commits August 29, 2019 02:33

Fix for zopefoundation#81 sorting with multiple indexes while limit o…

85611a5

…r batching applied.

Flake8

5686c7a

d-maurer requested changes Aug 29, 2019

View reviewed changes

Bugfix: Do return the actual_result_count from _multi_index_sort.

1ff6b5f

Flake8

87f0149

Utilizing the index to get all documents belonging to an index value.

9141f94

Better documentation

agitator mentioned this pull request Feb 3, 2021

Reverse sorting on collections produces weird results with batching plone/Products.CMFPlone#3235

Open

Merge branch 'master' into bettertests

a17c1f7

icemac mentioned this pull request Feb 10, 2021

Fix Flake8 problems. Inqbus/Products.ZCatalog#1

Open

ale-rt mentioned this pull request Mar 26, 2021

Fix reversed sorting by multiple index + limit #120

Merged

Merge branch 'master' into bettertests

cedeca3

		if sort_index_length == 1:
		index_key_map = sort_index.documentToKeyMap()

sort-on multiple indexes is broken#81 #82

Are you sure you want to change the base?

sort-on multiple indexes is broken#81 #82

Conversation

volkerjaenisch commented Aug 24, 2019 • edited by icemac Loading

d-maurer left a comment

Choose a reason for hiding this comment

d-maurer Aug 28, 2019

Choose a reason for hiding this comment

volkerjaenisch commented Aug 28, 2019 • edited Loading

d-maurer commented Aug 28, 2019 via email

volkerjaenisch commented Aug 28, 2019

d-maurer commented Aug 28, 2019 via email

volkerjaenisch commented Aug 28, 2019

volkerjaenisch commented Aug 29, 2019

d-maurer Aug 29, 2019

Choose a reason for hiding this comment

d-maurer Aug 29, 2019

Choose a reason for hiding this comment

d-maurer Aug 29, 2019

Choose a reason for hiding this comment

d-maurer Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

volkerjaenisch Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

volkerjaenisch Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

volkerjaenisch Sep 5, 2019

Choose a reason for hiding this comment

volkerjaenisch Sep 8, 2019

Choose a reason for hiding this comment

volkerjaenisch Sep 8, 2019

Choose a reason for hiding this comment

d-maurer commented Aug 29, 2019

icemac commented Aug 29, 2019

d-maurer commented Aug 29, 2019

volkerjaenisch commented Aug 29, 2019

d-maurer commented Aug 29, 2019 via email

volkerjaenisch commented Aug 29, 2019

volkerjaenisch commented Aug 29, 2019

d-maurer commented Aug 30, 2019 via email

d-maurer commented Aug 30, 2019 via email

icemac commented Aug 30, 2019

d-maurer commented Sep 8, 2019 via email

jensens commented Feb 3, 2021

ale-rt commented Mar 26, 2021

ale-rt commented Mar 26, 2021

icemac commented Jul 4, 2022

jensens commented Jul 4, 2022

ale-rt commented Jul 4, 2022

volkerjaenisch commented Jul 4, 2022 via email

icemac commented Apr 4, 2024

jensens commented Apr 4, 2024

volkerjaenisch commented Aug 24, 2019 •

edited by icemac

Loading

volkerjaenisch commented Aug 28, 2019 •

edited

Loading

d-maurer Aug 29, 2019 •

edited

Loading

volkerjaenisch Aug 29, 2019 •

edited

Loading

volkerjaenisch Aug 29, 2019 •

edited

Loading