-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort-on multiple indexes is broken #81
Comments
Thank you for documenting the issue. Could you please create a pull request with your suggested solution? |
@icemac Thank you for taking this problem seriously. I would love to send in a simple pull request solving this issue. But ... From debugging the tests I learned the following:
So we have the following index test data (only att1 shown for simplicity), with a fixed value of 'att1': So there is no wonder if a test-routine like
always returns true. query = dict(att1='att1', sort_on=('att1', 'num') att1='att1' selects all entries. The sort_on 'att1' is a null operation since the index has exactly one value. The the sort_on 'num' returns then a quite predictable result: list(range(51, 43, -1)). I modified the testing enhance the visibility of the problem : I added a Dummy2 class:
which attributes all num >= 50 to an index value of att1 = 'other'. Obj1 : num : 73, att1 : 'other' I also introduced two new tests, which make use of this class:
These tests are identically in structure and only differ by the batch size 'b_size' and the expected results. In the both test cases the data is filtered by att1='att1' which gives us the objects with num below 50. Then the data is first sorted after index 'att1' which is a null operation. Then the data should be sorted after num which should give us the the objects sorted on num:0..49. Then there should be sort of slice to give us the appropriate number of objects. I volunteer to fix these problems, but I need some help from the community. Cheers, |
"Never trust a test unless you have seen it fail" Tim Ottinger Hi Volker, great to see you want to contribute. In order to contribute to any Zope or Plone project, you have to sign the conributor agreement: Once you have done this (maybe you have already), it is a good idea to create a pull request - especially as you already wrote some code. Creating a pull request triggers TravisCI to run all tests - so the reviewers already know the tests pass :-) You wrote you need the help of the community. What exactly do you need help with? I started contributing to the Zope ecosystem not that long ago and wrote a blog post about my experience and took notes of the many unwritten rules - maybe it helps: Best, |
Dear @jugmac00 !
Cheers, |
@hannosch is not very active lately, but others - especially @icemac - will have a look - given time and availablity, and also partly personal interest. I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html Updating documentation is something I can step in - once this is fixed and I have some spare time. |
Yep. That is the core of the problem: Information leakage. The original ZCatalog supported multiple indexes. This has gone forgotten and the developers promoting the code have not taken into account this feature. At the end of the food chain plone.app.querystring is broken in a way that it no longer support multible indexes at all. If we like to fix this mess:
Cheers, |
Jürgen Gmach wrote at 2019-8-18 15:40 -0700:
...
I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html
`ZCatalog` has learned this only fairly recently (much more
recently than the Zope book).
|
Dear Community! I need some insight: From the Zope book the sort_limit is a hint to the Catalog how many datasets are expected. Currently we have a more complex case:
I propose for discussion :
I am currently working on an advanced testing for ZCatalog to find the uncovered cases. These test are IMHO more readable
since the reference is explicitly given. I do not like tests involving randomness. So I followed the second Python Principal "Explicit is better than implicit." Some of these new tests are failing. This gives us a clean start for the work on the code. I am not done with the transformation of the old test code to the more explicit form, so please stay tuned. Cheers, |
Volker Jaenisch (PhD) wrote at 2019-8-20 16:45 -0700:
...
**What is/was the aim of sort_limit exactly?**
>From the [Zope book](https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html) the sort_limit is a hint to the Catalog how many datasets are expected.
This may have act as a shortcut to terminate the query when enough datasets were found. This optimization is correct if the result-set is only sorted after one index (which was as @d-maurer stated the state of the Zope-Book: There was only one index to sort after.)
Its main purpose is to reduce sorting time: If "sort_limit" is small,
"nsmallest" is significantly faster than full sorting.
Currently we have a more complex case:
1. There was a time when sort_limit may have been None.
There are still some now useless if-statements in the code checking for sort_limit is None.
2. In this case I suppose was expected that all sort optimization is futile and a sorting over all the result sets is needed
The catalog has two optimization stragegies for sorting:
1 using "sort_limit"; if "sort_limit" is not specified, full sorting
is required
2 choose between sorting via the index (index small compared to result set)
or via the document values (index large compared to result set).
This is independent from "sort_limit".
3. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.
I do not think so. In fact sorting over several indexes is almost
identical to sorting over a single index: your sort values are
just tuples rather than individual values.
The second optimization approach becomes a bit more complex but it, too,
can be adapted to several indexes (I do in `Products.AdvancedQuery`).
4. Currently sort_limit is calculated from the b_start and b_size parameters in any case. This breaks (1,2,3)
I do not doubt that you have evidence that something currently breaks.
But, in principle, it is correct to compute `sort_limit` a `b_start + b_size`
(plus some small number to account for a slightly unprecise batch).
Note that `sort_limit` specifies that you want at least that number of hits
correctly sorted (and do not care about the sort order of further hits).
Thus, if your batch shows hits *n* through *m*, then *m* is an appropriate
value for `sort_limit`.
I propose for discussion :
- The meaning of sort_limit has to be defined anew and well documented.
If `sort_limit` is specified, then the first `sort_limit` elements
of the result list must be the first elements of the search result
sorted according to the sort specification; later elements can be delivered
in any order.
This is used as a sort optimization. It allows to use the
`nsmallest` sort algorithm which is considerably faster than full sorting
for small *n*.
- The default for sort_limit=None shall return the correct search/sort behavior for any combination of search or sort indexes.
Yes.
|
@d-maurer
OK. Currently 2 is broken. It is broken since it depends on sort_limit (in contradiction to your statement). The choice of the algorithm is done solely by looking at (a precalculated) "limit": Catalog.py: 644
This is not a try!: In _sort_limit_arguments limit is always set to b_start + b_size. Sort_limit=None is not respected. Catalog.py :1000
The decision depends crucially on (sort_)limit
Sorry but "almost" is IMHO not good enough for a core piece of software.
Fine. Please help me to do this for ZCatalog.
As stated before. IMHO this is in principal not correct. I can construct easily cases where this kind of "algorithm" is off by an arbitrary number of datasets. The choice of the "small number" you mention depends crucially on the nature of your data and is not a constant like Pi. The question we have to ask our self is: Do we want to rely on software that is 99% correct or do we like to have software which is 100% reliable. Nothing against optimizations, but they have IMHO to be a users choice to turn on.
This optimization is absolutely correct IF sort_limit is specified. But absolutely wrong if sort_limit is None or not specified. Ceterum censeo: The tests have to be improved and better documented. A 100% test coverage says nothing if your test data is crap. And not documented tests shy away users trying to help us finding errors. |
Volker Jaenisch (PhD) wrote at 2019-8-21 02:17 -0700:
@d-maurer
Thank you for shedding some light on this dark piece of Zope and the intentions the code should have.
> Its main purpose is to reduce sorting time: If "sort_limit" is small,
> "nsmallest" is significantly faster than full sorting.
> The catalog has two optimization stragegies for sorting:
>
> 1 using "sort_limit"; if "sort_limit" is not specified, full sorting
> is required
>
> 2 choose between sorting via the index (index small compared to result set)
> or via the document values (index large compared to result set).
> This is independent from "sort_limit".
OK. Currently 2 is broken. It is broken since it depends on sort_limit (in contradiction to your statement). The choice of the algorithm is done solely by looking at (a precalculated) "limit":
Well, 2 can obviously make use of an additional limit after it has decided
about the sorting approach. If implemented correctly, both
approaches will give the same result. Thus, it is only a matter
of efficiency, not correctness.
Catalog.py: 644
```
# Try to deduce the sort limit from batching arguments.
b_start, b_size, limit, sort_report_name = self._sort_limit_arguments(
query, sort_index, reverse, limit)
```
This is not a try!: In _sort_limit_arguments limit is always set to b_start + b_size. Sort_limit=None is not respected.
This might be justified (apart from forgetting special "batching"
features which mean that the batch size can vary slightly): If you use
the catalog in a "batching context", then, likely, your are only
interested in the "batching window" of the result list. From this,
you can deduce a sort limit (even if not specified).
If you are interested in a fully sorted result list, you likely
do not have batching parameters.
Whether this arguemnt is good highly depends on how the catalog
determines the batching parameters. If it looks at
explicitly provided arguments, this is okay; if it looks at
the request (and such for implicit arguments), it is doubtfull
because a batching request might employ a secondary search
not governed by the request parameters.
Catalog.py :1000
```
# Special first condition, as it changes post-processing.
iterate_sort_index = (
merge and limit is None and (
rlen > (len(sort_index) * (rlen / 100 + 1))))
# Choose one of the sort algorithms.
if iterate_sort_index:
sort_func = self._sort_iterate_index
elif limit is None or (limit * 4 > rlen):
sort_func = self._sort_iterate_resultset
elif first_reverse:
sort_func = self._sort_nbest
else:
sort_func = self._sort_nbest_reverse
```
The decision depends crucially on (sort_)limit
It may not fully exploit the optimization potential - but nothing
bad should happen (as explained above).
> 3. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.
Do you want to say, that for sorting over several indexes "iterate_sort_index"
becomes impossible? That would not be true (`Products.AdvancedQuery`
does precisely this -- correctly, as far as I know).
Of course, "iterate_sort_index", too, uses the complete result -- just
indirectly, without explicitly looking at the document values but
using index information to deduce this value for many documents at
the same time.
> I do not think so. In fact sorting over several indexes is almost
> identical to sorting over a single index: your sort values are
> just tuples rather than individual values.
Sorry but "almost" is IMHO not good enough for a core piece of software.
I explained the "almost" in the following sentence: replace
the single value by the value tuple; even a "core piece of software"
can make use of this kind of similarity.
> The second optimization approach becomes a bit more complex but it, too,
> can be adapted to several indexes (I do in `Products.AdvancedQuery`).
Fine. Please help me to do this for ZCatalog.
Look at `Products.AdvancedQuery` (you find a source distribution on PyPI).
`Products.AdvancedQuery` has no "sort_limit" (as it does sorting
fully lazily) but it, too, uses the two approaches "iterate_index"
and "iterate_document_values".
As stated before. IMHO this is in principal not correct.
I argued above that you are likely wrong.
I can construct easily cases where this kind of "algorithm" is off by an arbitrary number of datasets. The choice of the "small number" you mention depends crucially on the nature of your data and is not a constant like Pi.
If you perform a "batched search", then you should only be interested
in the results in the "batch window" - and then the order of hits
above this window is not relevant (which means, you can set
`sort_limit` to the first index above the window).
Do not use batching, if you are interested in results outside
the "batch window".
The question we have to ask our self is: Do we want to rely on software that is 99% correct or do we like to have software which is 100% reliable. Nothing against optimizations, but they have IMHO to be a users choice to turn on.
An optimization should still produce a correct result.
If it does, then their is no need that users turn it on (most users
will not have sufficient knowledge to make a good decision about
most optimizations).
*IF* the catalog determines the batching parameters implicitly
(I fear, it does), then I consider this a bug (as a batching
request might perform secondary (not batched) searches).
...
This optimization is absolutely correct IF sort_limit is specified. But absolutely wrong if sort_limit is None or not specified.
**IMHO when sort_limit is not specified or None, a failsafe variant of the algorithm has to take place.**
At least, I have no objection to drop an implicit determination
of a sort limit from implicit batching parameters.
|
Added a pull request for the better testing on multiple index cases: #82 |
…r batching applied.
When doing a sort-on on multiple indexes the
_sort_iterate_resultset
sort_algorithm has to be used (independent of the limit settings).
The limit calculation in Catalog.py (intorduced in commit 939c8c4)
def _sort_limit_arguments(self, query, sort_index, reverse, limit)
does not respect the possibility that limit is set to None to choose _sort_iterate_result set.
While in the algorithm choice code the test on limit is None (uselessly) persists:
The n_best/n_best_reverse do the limiting on the first index, only. And then the sorting over the additional indexes on the already limited result set which leads to incorrect results.
The fix to the issue is quite simple (just check if we have more than on sort_index)
I also noticed that some Plone packages in particular plone.app.querystring do no longer even mention (and wrongly document in the code) the possibility to search/sort on more than a single index.
plone/plone.app.querystring#92
The text was updated successfully, but these errors were encountered: