[QST]: DASK AND CUGRAPH #4831

williamcolegithub · 2024-12-12T05:45:52Z

What is your question?

Hello! For the life of me I cannot get slurmcluster dask and cugraph to cooperate. I can get many configurations of slurmcluster dask and cudf to work. Cugraph has various errors for me such as a generic cufile error, or different modules dont exist, or code runs indefinitely... etc. All existing documentatino appears to be with localcudacluster which does not work for my setup and is this even truley multi-node + multi-gpu or just multigpu to use local cuda cluster?

I know my environments are consistent and up to date.

Looking for any better examples or hop on a quick call.
Thank you!!!!!

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open issues and have found no duplicates for this question

jnke2016 · 2024-12-17T17:05:07Z

@williamcolegithub thank you for reaching out. Can you provide more information on how you setup your cluster please?

All existing documentatino appears to be with localcudacluster

LocalCudaCluster only supports single node multi-gpu hence if you want to run multi nodes, you will need to start each worker with a CLI command like dask-cuda-worker along with the scheduler on one of your nodes with dask-scheduler.

williamcolegithub · 2024-12-19T13:18:38Z

@jnke2016 Heres a reply please see: I see! OK, I will reach out to my slurm team. Yes, I have tried dask-cuda-worker and it resulted in a failure to connect with nanny. So I have been using dask-worker.
From my perspective the documentation was not clear that dask-cuda-worker was essential, it appeared optional. Thank you for clarifying and the fast reply. I will reach out if issues persist.

--- By any chance, is there a distributed notebook you all recommend? I only find examples using local cuda cluster, even notebooks that claim to be multi-node.

quasiben · 2024-12-20T03:10:27Z

We have examples of how to deploy with SLURM on the HPC Deployment page:
https://docs.rapids.ai/deployment/stable/hpc/

If you run into trouble please ping here

cc @jacobtomlinson

As for a cuGraph notebooks, that is a great question. @acostadon / @jnke2016 / @rlratzel is that something you know ?

…sters for cuGraph (#4838) This PR adds utility scripts and initial docs for managing multi-GPU Dask clusters for cuGraph, aimed at helping the situation described in [this issue](#4831). These scripts are taken from internal tools used for MNMG testing and have been modified to be more generalized for use by the community. Authors: - Rick Ratzel (https://github.com/rlratzel) - Don Acosta (https://github.com/acostadon) Approvers: - Brad Rees (https://github.com/BradReesWork) - Don Acosta (https://github.com/acostadon) - Joseph Nke (https://github.com/jnke2016) URL: #4838

rlratzel · 2025-01-07T04:06:43Z

is there a distributed notebook you all recommend? I only find examples using local cuda cluster

I'm not aware of any notebook examples that are multi-node (and we should fix that in our notebooks that claim to be multi-node). I did however recently commit some scripts which can be used for multi-node cugraph workflows that use dask:

https://github.com/rapidsai/cugraph/tree/branch-25.02/scripts/dask

Those scripts will launch the dask scheduler and worker processes. Once you have those running, you should then be able to use these helper functions here and here with the SCHEDULER_FILE env var used by the scheduler and workers to create and teardown a dask client in your code.

Here's an example:

import cugraph
import dask_cudf
import cugraph.dask as dask_cugraph

from cugraph.testing.mg_utils import start_dask_client, stop_dask_client


if __name__ == "__main__":
    # Must have SCHEDULER_FILE env var set to path of generated scheduler file.
    (client, cluster) = start_dask_client()

    input_data_path="/data/22/graph500-22.e"
    blocksize = dask_cugraph.get_chunksize(input_data_path)

    e_list = dask_cudf.read_csv(input_data_path,
                                blocksize=blocksize,
                                delimiter=" ",
                                names=["src", "dst"],
                                dtype=["int32", "int32"],
                                )

    G = cugraph.Graph()
    G.from_dask_cudf_edgelist(e_list,
                              source="src",
                              destination="dst",
                              )

    results = dask_cugraph.pagerank(G)

    stop_dask_client(client, cluster)

Hope this helps.

williamcolegithub added the question Further information is requested label Dec 12, 2024

BradReesWork assigned jnke2016 Dec 18, 2024

rlratzel self-assigned this Dec 20, 2024

rlratzel mentioned this issue Dec 20, 2024

Adds utility scripts and initial docs for managing multi-GPU Dask clusters for cuGraph #4838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]: DASK AND CUGRAPH #4831

[QST]: DASK AND CUGRAPH #4831

williamcolegithub commented Dec 12, 2024 •

edited

Loading

jnke2016 commented Dec 17, 2024 •

edited

Loading

williamcolegithub commented Dec 19, 2024 •

edited

Loading

quasiben commented Dec 20, 2024

rlratzel commented Jan 7, 2025

[QST]: DASK AND CUGRAPH #4831

[QST]: DASK AND CUGRAPH #4831

Comments

williamcolegithub commented Dec 12, 2024 • edited Loading

What is your question?

Code of Conduct

jnke2016 commented Dec 17, 2024 • edited Loading

williamcolegithub commented Dec 19, 2024 • edited Loading

quasiben commented Dec 20, 2024

rlratzel commented Jan 7, 2025

williamcolegithub commented Dec 12, 2024 •

edited

Loading

jnke2016 commented Dec 17, 2024 •

edited

Loading

williamcolegithub commented Dec 19, 2024 •

edited

Loading