Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort-on multiple indexes is broken #81

Open
volkerjaenisch opened this issue Aug 12, 2019 · 12 comments · May be fixed by #82
Open

sort-on multiple indexes is broken #81

volkerjaenisch opened this issue Aug 12, 2019 · 12 comments · May be fixed by #82
Assignees
Labels

Comments

@volkerjaenisch
Copy link

When doing a sort-on on multiple indexes the
_sort_iterate_resultset
sort_algorithm has to be used (independent of the limit settings).

The limit calculation in Catalog.py (intorduced in commit 939c8c4)

def _sort_limit_arguments(self, query, sort_index, reverse, limit)

does not respect the possibility that limit is set to None to choose _sort_iterate_result set.

While in the algorithm choice code the test on limit is None (uselessly) persists:

       # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

The n_best/n_best_reverse do the limiting on the first index, only. And then the sorting over the additional indexes on the already limited result set which leads to incorrect results.

The fix to the issue is quite simple (just check if we have more than on sort_index)

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or second_indexes or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

I also noticed that some Plone packages in particular plone.app.querystring do no longer even mention (and wrongly document in the code) the possibility to search/sort on more than a single index.

plone/plone.app.querystring#92

@icemac
Copy link
Member

icemac commented Aug 13, 2019

Thank you for documenting the issue. Could you please create a pull request with your suggested solution?

@volkerjaenisch
Copy link
Author

volkerjaenisch commented Aug 16, 2019

@icemac Thank you for taking this problem seriously.

I would love to send in a simple pull request solving this issue. But ...
IMHO the tests for zcatalog at least for TestCatalogSortBatch are complete nonsense. There is already a testcase for this particular usecase returning true.

From debugging the tests I learned the following:

  • The 100 test-Objects are of the class Dummy and contain as data ('num') an integer value in range 0..100) (randomly shuffled ).
  • Dummy instances have the attributes att1, att2, att3 and member functions col1, col2, col3, which ALL return fixed values. E.g. attribute att1 always returns the string 'att1'.
class Dummy(ExtensionClass.Base):

    att1 = 'att1'
    att2 = 'att2'
    att3 = ['att3']
    foo = 'foo'

    def __init__(self, num):
        self.num = num
        if isinstance(num, int) and (self.num % 10) == 0:
            self.ends_in_zero = True

    def col1(self):
        return 'col1'

    def col2(self):
        return 'col2'

    def col3(self):
        return ['col3']

So we have the following index test data (only att1 shown for simplicity), with a fixed value of 'att1':
Obj1 : num : 73, att1 : 'att1'
Obj3 : num : 1, att1 : 'att1'
Obj2 : num : 51, att1 : 'att1'
...
Obj100 : num : 17, att1 : 'att1'

So there is no wonder if a test-routine like

    def testSortLimitViaBatchingArgsRightMiddleSortOnTwoSecond(self):
        catalog = self._make_one()
        query = dict(att1='att1', sort_on=('att1', 'num'),
                     sort_order=('', 'reverse'), b_start=48, b_size=8)
        result = catalog(query)
        self.assertEqual(result.actual_result_count, 100)
        self.assertEqual([r.num for r in result], list(range(51, 43, -1)))

always returns true.

query = dict(att1='att1', sort_on=('att1', 'num')

att1='att1' selects all entries. The sort_on 'att1' is a null operation since the index has exactly one value. The the sort_on 'num' returns then a quite predictable result: list(range(51, 43, -1)).
In fact there is no real testing of multiple indexes at all.

I modified the testing enhance the visibility of the problem :
https://github.com/Inqbus/Products.ZCatalog/tree/bettertests

I added a Dummy2 class:


    def __init__(self, num):
        self.num = num
        if isinstance(num, int) and (self.num % 10) == 0:
            self.ends_in_zero = True
        if isinstance(num, int) and self.num >= 50 :
            self.att1 = 'other'

which attributes all num >= 50 to an index value of att1 = 'other'.

Obj1 : num : 73, att1 : 'other'
Obj3 : num : 1, att1 : 'att1'
Obj2 : num : 51, att1 : 'other'
...
Obj100 : num : 17, att1 : 'att1'

I also introduced two new tests, which make use of this class:

    def test_sort_on_two_limit_10(self):
        catalog = self._make_one(cls=Dummy2)
        upper = self.upper
        a = catalog(sort_on=('att1', 'num'), att1='att1',  b_start=0, b_size=10)
        self.assertEqual(len(a), 10)
        for x in range(10):
            self.assertEqual(a[x].num, x)

    def test_sort_on_two_limit_50(self):
        catalog = self._make_one(cls=Dummy2)
        upper = self.upper
        a = catalog(sort_on=('att1', 'num'), att1='att1',  b_start=0, b_size=50)
        self.assertEqual(len(a), 50)
        for x in range(50):
            self.assertEqual(a[x].num, x)

These tests are identically in structure and only differ by the batch size 'b_size' and the expected results.
b_size=10 fails
b_size=50 works.

In the both test cases the data is filtered by att1='att1' which gives us the objects with num below 50. Then the data is first sorted after index 'att1' which is a null operation. Then the data should be sorted after num which should give us the the objects sorted on num:0..49. Then there should be sort of slice to give us the appropriate number of objects.
In the b_site=10 case the objects we would expect the objects with num= 0..10. In the b_size=50 case the objects with num=0..49. But this is only the case for b_size=50.

I volunteer to fix these problems, but I need some help from the community.

Cheers,
Volker

@jugmac00
Copy link
Member

"Never trust a test unless you have seen it fail" Tim Ottinger

Hi Volker,

great to see you want to contribute.

In order to contribute to any Zope or Plone project, you have to sign the conributor agreement:
https://www.zope.org/developer/becoming-a-committer.html

Once you have done this (maybe you have already), it is a good idea to create a pull request - especially as you already wrote some code.

Creating a pull request triggers TravisCI to run all tests - so the reviewers already know the tests pass :-)

You wrote you need the help of the community. What exactly do you need help with?

I started contributing to the Zope ecosystem not that long ago and wrote a blog post about my experience and took notes of the many unwritten rules - maybe it helps:
https://jugmac00.github.io/blog/face-to-face-with-earl-zope/

Best,
Jürgen

@volkerjaenisch
Copy link
Author

volkerjaenisch commented Aug 18, 2019

Dear @jugmac00 !
I sent the agreement. Thanks for the Starter-Link. I contributed to Collective for a while so I hope to deal with Zope/Plone as well.
Concerning the "help of the community":
This code is one of the most crucial parts of Zope/Plone and I do not like to break it more than it is already broken. I suppose that the algorithms for n-best/n-best-reverse are not designed to be broken for multiple indexes - they simply not have taken them into account. So there are 2/3 of the sort algorithms in ZCatalog broken (for multiple indexes).

  • I can start with pull requests for better testing.
  • But I have no complete understanding of the code itself and (more important) its history. I would appreciate if Hanno and others in recent contact with this code discuss with me to make things really better.
  • And in the end we will need some hands to fix the code. This will cost a lot of brain work to fix this - I cannot do this on my own.

Cheers,
Volker

@jugmac00
Copy link
Member

jugmac00 commented Aug 18, 2019

@hannosch is not very active lately, but others - especially @icemac - will have a look - given time and availablity, and also partly personal interest.

I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html

Updating documentation is something I can step in - once this is fixed and I have some spare time.

@volkerjaenisch
Copy link
Author

@jugmac00

I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html

Yep. That is the core of the problem: Information leakage. The original ZCatalog supported multiple indexes. This has gone forgotten and the developers promoting the code have not taken into account this feature. At the end of the food chain plone.app.querystring is broken in a way that it no longer support multible indexes at all.
Therefore I urged for the originators of the code to have a discussion. The changes of Hanno (which are great for the performance and the readability of the code) did damage the multi index features of ZCatalog.

If we like to fix this mess:

  • We have to change the Dokumentation (Of this Package and Plone)
  • We have to fix the tests and the code. This is work for some people. I am one of them.

Cheers,
Volker

@d-maurer
Copy link
Contributor

d-maurer commented Aug 19, 2019 via email

@volkerjaenisch
Copy link
Author

volkerjaenisch commented Aug 20, 2019

Dear Community!

I need some insight:
What is/was the aim of sort_limit exactly?

From the Zope book the sort_limit is a hint to the Catalog how many datasets are expected.
This may have act as a shortcut to terminate the query when enough datasets were found. This optimization is correct if the result-set is only sorted after one index (which was as @d-maurer stated the state of the Zope-Book: There was only one index to sort after.)

Currently we have a more complex case:

  1. There was a time when sort_limit may have been None.
    There are still some now useless if-statements in the code checking for sort_limit is None.
  2. In this case I suppose was expected that all sort optimization is futile and a sorting over all the result sets is needed
  3. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.
  4. Currently sort_limit is calculated from the b_start and b_size parameters in any case. This breaks (1,2,3)

I propose for discussion :

  • The meaning of sort_limit has to be defined anew and well documented.
  • The default for sort_limit=None shall return the correct search/sort behavior for any combination of search or sort indexes.
  • If sort_limit is given or zero some optimizations may occur.

I am currently working on an advanced testing for ZCatalog to find the uncovered cases.

These test are IMHO more readable

    def test_sort_on_two_reverse(self):
        catalog = self._make_one()

        reference = [
            ('geralt', 19, 'smith'),
            ('geralt', 14, 'meier'),
            ('fran', 9, 'lopez'),
            ('ethel', 18, 'smith'),
            ('ethel', 13, 'smith'),
            ('ethel', 12, 'meier'),
            ('ethel', 6, 'meier'),
            ('ethel', 1, 'lopez'),
            ('dolores', 11, 'lopez'),
            ('dolores', 10, 'lopez'),
            ('dolores', 7, 'smith'),
            ('cinder', 17, 'lopez'),
            ('cinder', 5, 'lopez'),
            ('cinder', 3, 'meier'),
            ('bert', 15, 'smith'),
            ('bert', 8, 'smith'),
            ('bert', 0, 'meier'),
            ('abel', 16, 'lopez'),
            ('abel', 4, 'meier'),
            ('abel', 2, 'meier')
        ]

        a = catalog(sort_on=('first', 'num'), all='all',
                    sort_order='reverse')
        self.check_result_limit(a, reference, self.upper, ('first', 'num'))

        for N in range(1,self.upper):
            a = catalog(sort_on=('first', 'num'), all='all', sort_order='reverse', b_start=0, b_size=N)
            self.check_result_limit(a, reference, N, ('first', 'num'))

since the reference is explicitly given.

I do not like tests involving randomness. So I followed the second Python Principal "Explicit is better than implicit."

Some of these new tests are failing. This gives us a clean start for the work on the code.

I am not done with the transformation of the old test code to the more explicit form, so please stay tuned.

Cheers,
Volker

@d-maurer
Copy link
Contributor

d-maurer commented Aug 21, 2019 via email

@volkerjaenisch
Copy link
Author

@d-maurer
Thank you for shedding some light on this dark piece of Zope and the intentions the code should have.

Its main purpose is to reduce sorting time: If "sort_limit" is small,
"nsmallest" is significantly faster than full sorting.
The catalog has two optimization stragegies for sorting:

1 using "sort_limit"; if "sort_limit" is not specified, full sorting
is required

2 choose between sorting via the index (index small compared to result set)
or via the document values (index large compared to result set).
This is independent from "sort_limit".

OK. Currently 2 is broken. It is broken since it depends on sort_limit (in contradiction to your statement). The choice of the algorithm is done solely by looking at (a precalculated) "limit":

Catalog.py: 644

        # Try to deduce the sort limit from batching arguments.
        b_start, b_size, limit, sort_report_name = self._sort_limit_arguments(
            query, sort_index, reverse, limit)

This is not a try!: In _sort_limit_arguments limit is always set to b_start + b_size. Sort_limit=None is not respected.

Catalog.py :1000

        # Special first condition, as it changes post-processing.
        iterate_sort_index = (
            merge and limit is None and (
                rlen > (len(sort_index) * (rlen / 100 + 1))))

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

The decision depends crucially on (sort_)limit

  1. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.
    I do not think so. In fact sorting over several indexes is almost
    identical to sorting over a single index: your sort values are
    just tuples rather than individual values.

Sorry but "almost" is IMHO not good enough for a core piece of software.

The second optimization approach becomes a bit more complex but it, too,
can be adapted to several indexes (I do in Products.AdvancedQuery).

Fine. Please help me to do this for ZCatalog.

  1. Currently sort_limit is calculated from the b_start and b_size parameters in any case. This breaks (1,2,3)

I do not doubt that you have evidence that something currently breaks.
But, in principle, it is correct to compute sort_limit a b_start + b_size
(plus some small number to account for a slightly unprecise batch).

As stated before. IMHO this is in principal not correct. I can construct easily cases where this kind of "algorithm" is off by an arbitrary number of datasets. The choice of the "small number" you mention depends crucially on the nature of your data and is not a constant like Pi.

The question we have to ask our self is: Do we want to rely on software that is 99% correct or do we like to have software which is 100% reliable. Nothing against optimizations, but they have IMHO to be a users choice to turn on.

Note that sort_limit specifies that you want at least that number of hits
correctly sorted (and do not care about the sort order of further hits).
Thus, if your batch shows hits n through m, then m is an appropriate
value for sort_limit.

This optimization is absolutely correct IF sort_limit is specified. But absolutely wrong if sort_limit is None or not specified.
IMHO when sort_limit is not specified or None, a failsafe variant of the algorithm has to take place.

Ceterum censeo: The tests have to be improved and better documented. A 100% test coverage says nothing if your test data is crap. And not documented tests shy away users trying to help us finding errors.
Then the code has to be fixed to comply the tests.

@d-maurer
Copy link
Contributor

d-maurer commented Aug 22, 2019 via email

@volkerjaenisch
Copy link
Author

volkerjaenisch commented Aug 24, 2019

Added a pull request for the better testing on multiple index cases: #82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants