Improve the performance of CAGRA new vector addition with the default params #569

enp1s0 · 2025-01-14T08:48:37Z

This PR updates the default chunk size of the CAGRA graph extension and also adds a knob to control the batch size of the CAGRA searches run inside for better throughput.

The default chunk size was set to 1 in the current implementation because there is a potential problem with low recall when the chunk size is large, because no edges are made within nodes in the same chunk. However, as I have investigated, the low recall problem rarely occurs with large chunk sizes.

Search performance

The performance was measured after applying a bugfix #565

degree = 32

(I don't know the reason the performance is unstable in NYTimes.)

degree = 64

So I increase the default chunk size to the size of the new dataset vectors for better throughput in this PR. I also make public a knob to control the search batch size in the `extend' function to control the balance between throughput and memory consumption.

cjnolet · 2025-01-16T04:24:34Z

/ok to test

tfeher

Thanks @enp1s0 for the PR! It looks good, but we should discuss whether we need a new parameter, or whether we could utilize the workspace memory resource for controlling the batch size.

The raft resource handle has a workspace memory resource. This is set up as an rmm::mr::limiting_resource_adaptor. The intended usage of the workspace is to query it's size, and adjust our internal batching parameter accordingly.

Here is an example how we can query the available workspace size using get_workspace_free_bytes

Afterwards, we need to use the workspace memory resource to allocate the temporary buffers:

cuvs/cpp/src/neighbors/ivf_pq/ivf_pq_search.cuh

Lines 678 to 684 in bd603a9

    
           auto mr = raft::resource::get_workspace_resource(handle); 
        
           // Maximum number of query vectors to search at the same time. 
        
           const auto max_queries = std::min<uint32_t>(std::max<uint32_t>(n_queries, 1), kMaxQueries); 
        
           auto max_batch_size    = get_max_batch_size(handle, k, n_probes, max_queries, max_samples); 
        
           rmm::device_uvector<float> float_queries(max_queries * dim_ext, stream, mr);

Users who need fine control on temporary allocations can set the workspace allocator explicitly:

cuvs/examples/cpp/src/cagra_example.cu

Line 76 in bd603a9

    
           // raft::resource::set_workspace_to_pool_resource(dev_resources, 2 * 1024 * 1024 * 1024ull);

Note that by default the allocation limit is 1/4th of GPU Total memory, which is potentially much larger then the default for max_working_device_memory_size_in_megabyte introduced` in this PR.

…d use `raft::resource::get_workspace_free_bytes` instead

…to update-cagra-graph-extend

enp1s0 · 2025-01-20T01:22:00Z

Thank you @tfeher for the review. I changed the code to use get_workspace_free_bytes and removed max_working_device_memory_size_in_megabyte. The original code has already used raft::resource::get_workspace_resource, so I just changed the max batch size calculation.

tfeher

Thanks Hiroyuki for the update! I have one more question below.

tfeher · 2025-01-20T09:55:31Z

cpp/src/neighbors/detail/cagra/add_nodes.cuh

@@ -68,7 +69,12 @@ void add_node_core(
             new_size,
             raft::resource::get_cuda_stream(handle));

-  const std::size_t max_chunk_size = 1024;
+  const std::size_t data_size_per_vector =
+    sizeof(IdxT) * base_degree + sizeof(DistanceT) * base_degree + sizeof(T) * dim;


In case the additional_dataset is in host memory, the batch_load_iterator will need to create a temporary buffer in device memory for the batch. Should this be included here as an additional sizeof(T)*dim term?

Thank you @tfeher for the suggestion. Yes, the term for batch_load_iterator should be added. I fixed the code.

enp1s0 added 4 commits January 14, 2025 15:48

Update default batch size for new vector addition

cb0707c

Update search batch size in cagra::extend

26a075e

Update max_search_batch_size

642e329

Add a comment about max_working_device_memory_size_in_megabyte

768ecba

enp1s0 requested a review from a team as a code owner January 14, 2025 08:48

enp1s0 self-assigned this Jan 14, 2025

github-actions bot added the cpp label Jan 14, 2025

enp1s0 added breaking Introduces a breaking change improvement Improves an existing functionality and removed cpp labels Jan 14, 2025

Merge branch 'branch-25.02' into update-cagra-graph-extend

62bf424

github-actions bot added the cpp label Jan 16, 2025

cjnolet approved these changes Jan 16, 2025

View reviewed changes

enp1s0 and others added 4 commits January 16, 2025 17:19

Merge branch 'branch-25.02' into update-cagra-graph-extend

880b168

Merge branch 'branch-25.02' into update-cagra-graph-extend

336d852

Merge branch 'branch-25.02' into update-cagra-graph-extend

17e295a

Merge branch 'branch-25.02' into update-cagra-graph-extend

54334d7

tfeher reviewed Jan 19, 2025

View reviewed changes

enp1s0 added 2 commits January 20, 2025 10:05

Remove extend_params::max_working_device_memory_size_in_megabyte an…

5e2d306

…d use `raft::resource::get_workspace_free_bytes` instead

Merge branch 'update-cagra-graph-extend' of github.com:enp1s0/cuvs in…

79e14f7

…to update-cagra-graph-extend

tfeher requested changes Jan 20, 2025

View reviewed changes

enp1s0 and others added 3 commits January 20, 2025 21:36

Update working memory calculation

eb00a4f

Merge branch 'branch-25.02' into update-cagra-graph-extend

addf2d5

Merge branch 'branch-25.02' into update-cagra-graph-extend

293192c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of CAGRA new vector addition with the default params #569

Improve the performance of CAGRA new vector addition with the default params #569

enp1s0 commented Jan 14, 2025 •

edited

Loading

cjnolet commented Jan 16, 2025

tfeher left a comment •

edited

Loading

enp1s0 commented Jan 20, 2025

tfeher left a comment

tfeher Jan 20, 2025

enp1s0 Jan 20, 2025

	auto mr = raft::resource::get_workspace_resource(handle);

	// Maximum number of query vectors to search at the same time.
	const auto max_queries = std::min<uint32_t>(std::max<uint32_t>(n_queries, 1), kMaxQueries);
	auto max_batch_size = get_max_batch_size(handle, k, n_probes, max_queries, max_samples);

	rmm::device_uvector<float> float_queries(max_queries * dim_ext, stream, mr);

Improve the performance of CAGRA new vector addition with the default params #569

Are you sure you want to change the base?

Improve the performance of CAGRA new vector addition with the default params #569

Conversation

enp1s0 commented Jan 14, 2025 • edited Loading

Search performance

degree = 32

degree = 64

cjnolet commented Jan 16, 2025

tfeher left a comment • edited Loading

Choose a reason for hiding this comment

enp1s0 commented Jan 20, 2025

tfeher left a comment

Choose a reason for hiding this comment

tfeher Jan 20, 2025

Choose a reason for hiding this comment

enp1s0 Jan 20, 2025

Choose a reason for hiding this comment

enp1s0 commented Jan 14, 2025 •

edited

Loading

tfeher left a comment •

edited

Loading