Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rr/sc 60366 sparse global order reader merge #5417

Merged
merged 233 commits into from
Feb 7, 2025

Conversation

rroelke
Copy link
Contributor

@rroelke rroelke commented Jan 2, 2025

The story contains more details, but in brief this pull request adds an additional mode to the sparse global order reader in which we pre-process the minimum bounding rectangles of all tiles from all fragments to determine a single global order in which all of the tiles must be loaded.

This pre-processing step is implemented using a "parallel merge" algorithm which merges the tiles from each fragment (which are arranged in global order within the fragment).

Parallel Merge

The parallel merge code lives in tiledb/common/algorithm/parallel_merge.h. It is written generically to merge streams of a copyable type T using any type which can compare T (default is std::less<T> of course). An explanation of the algorithm is provided within the file.

The top-level function parallel_merge is asynchronous, i.e. it returns a future which can be polled to see how much of the merge has already completed. This enables callers to begin processing merged data from the head of the eventual output before the tail of the eventual output has finished.

Sparse Global Order Reader

We extend the sparse global order reader with a new configuration sm.query.sparse_global_order.preprocess_tile_merge. If nonzero, the sparse global order reader will run a parallel merge on the fragments to find the unified tile order and then use that to populate result tiles.

  • preprocess_compute_result_tile_order kicks off the parallel merge.
  • create_result_tiles_using_preprocess advances along the global tile order to create result tiles.

The fields which are used for the old "per fragment result tiles" mode have been encapsulated into their own struct to emphasize that their use does not overlap with this new mode.

create_result_tiles_using_preprocess does not need a per-fragment memory budget; instead it pulls tiles off of the globally ordered tile list until it has saturated the memory budget as much as it can.

Tiles in the unified global order are arranged on their lower bound. The upper bounds of the tiles in the list may be out of order. To prevent cells from tile A to be emitted out of order with cells from tile B, we augment add_next_cell_to_queue to check the lower bound of the tiles which have not populated result tiles yet.

The value of sm.query.sparse_global_order.preprocess_tile_merge configures the minimum amount of work that each parallel unit of the merge will do. This is so we can benchmark with different values without re-compiling; we will either want to recommend a value to customers, or choose one and flip this to a boolean.

Serialization

The unified global tile order is state which must be communicated back and forth between the client and REST server. We can either serialize this whole list (16 bytes per tile across all fragments) or we can re-compute the parallel merge each time we run a submit on the REST server side. The current implementation chooses the latter, assuming that smaller messages are preferred to the additional CPU overhead. It turns out that we must serialize the tile order. The parallel merge algorithm should be deterministic, but it turns out that some aspect of the REST server state sometimes causes the subarray qualifying tile ranges to vary from one iteration to the next, which means that we cannot recompute the tile order in the same way.

Testing

Testing of all changes is augmented using rapidcheck. With this library, rather than writing some test data examples, we write properties which contain generic claims about what the expected output must look like for a given input. The rapidcheck runtime generates arbitrary inputs to the property to test our claims.

The parallel merge algorithm is tested in unit_parallel_merge.cc and has rapidcheck properties implemented for each step of the algorithm.

The sparse global order reader tests are in unit-sparse-global-order-reader.cc. The gist is that we have a generic function CSparseGlobalOrderFx::run which writes a bunch of fragments, and then reads the data back in global order, comparing against an expected result. There's a fair bit of refactoring to support this. For 1D arrays we have tests Sparse global order reader: fragment skew, fragment interleave, and fragment many overlap which set up inputs which are expected to exercise some of the edge cases in the global order reader. And then we add rapidcheck 1D and rapidcheck 2D tests which generate totally arbitrary 1D and 2D inputs respectively.

Performance Results

I still have more to do here, but things are looking pretty good... will fill in more details here as I have them. Notes are here.


TYPE: FEATURE | BUG | IMPROVEMENT
DESC: sparse global order reader determine global order of result tiles

@rroelke
Copy link
Contributor Author

rroelke commented Feb 4, 2025

@ypatia @teo-tsirpanis

This probably changed a bit since the last review as I was wrangling it together for tiledb:// URIs. As far as I can tell the wrangling has been successful, and the performance results vary from "only ever so slightly worse in a way that would not worsen the experience of human interaction" to 85% faster on some longer-running queries.

I'm collecting more performance results here.

Either way I'd like to click the merge button - this is ready for a final review.

I edited the above comment regarding the major change - in brief, the serialized query state does not contain any information about the preprocess tile order. Instead we recompute the tile order and the cursor into it for each message using the query read state.

Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small comments, but approving anyway. Can't wait to see this merged!!

tiledb/api/c_api/config/config_api_external.h Show resolved Hide resolved
tiledb/sm/serialization/tiledb-rest.capnp Outdated Show resolved Hide resolved
.gitlab-ci.yml Outdated Show resolved Hide resolved
@ypatia
Copy link
Member

ypatia commented Feb 6, 2025

CI looks great! The Windows GCS issue is a baseline one, Theodore is trying to fix it here.
Small reminder to remove the capnp changes.

@rroelke
Copy link
Contributor Author

rroelke commented Feb 6, 2025

@ypatia

I did manage to find a bug today (inspired by SC-61471).

In brief, the recent round of changes all is about the merge bound. If duplicates are not allowed, then the merge bound comparison must be a strict inequality, not <= as it is now. If a loaded tile contains a coordinate which is equal to the merge bound, then it is incorrect to emit it when duplicates are off since the next un-loaded tile might also contain that coordinate.

This change was pretty small in the reader but propagated to the tests as the addition of the new merge bound duplication test, as well as randomizing allow_dups in some of the existing rapidcheck tests.

@rroelke rroelke merged commit 3c617e3 into main Feb 7, 2025
59 checks passed
@rroelke rroelke deleted the rr/sc-60366-sparse-global-order-reader-merge branch February 7, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants