Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Enable Proton for XPU #2635

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft

POC: Enable Proton for XPU #2635

wants to merge 19 commits into from

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Nov 5, 2024

No need to review yet.

Closes #1145

On LTS proton doesn't work: zelTracerCreate return 2013265921 (ZE_RESULT_ERROR_UNINITIALIZED) code. Looks like relates to #1953.

12/12 status update:

The most important problem is the lack of an interface for registering callback functions that would be called before and after kernel execution. Cupti and Roctracer backend of Proton use this interface to register (via callbackData->correlationId) ​​kernel calls to build a call tree and, upon profiler shutdown, to check that all created records for running kernels were written to Proton storage structures from the backend profiler storage system. The current workaround uses the interface <level_zero/layers/zel_tracing_api.h>, which provides functions for registering user functions via a pair of prologue_callbacks/epilogue_callbacks for various events. However, the problem of obtaining the record identifier in these callbacks that PTI will create for registered kernels has not been solved yet (a manually selected identifier is currently used for tests).

How it works in Cupti? (simplified)

void CuptiProfiler::CuptiProfilerPimpl::callbackFn(void *userData,
                                                   CUpti_CallbackDomain domain,
                                                   CUpti_CallbackId cbId,
                                                   const void *cbData) {
// ...
  if (callbackData->callbackSite == CUPTI_API_ENTER) {
      // scope registration
      auto scopeId = threadState.record();
      threadState.enterOp(scopeId);
      // ...
      // linking the internal profiler data ID to the external one that proton uses. 
      profiler.correlation.correlate(callbackData->correlationId, numInstances);
  } else if (callbackData->callbackSite == CUPTI_API_EXIT) {
     // ...
     // scope exit
     threadState.exitOp();
     // the submitted record should be taken into account when flushing data
     profiler.correlation.submit(callbackData->correlationId);
  }
// ...
}

void CuptiProfiler::CuptiProfilerPimpl::doStart() {
  // subscriber is `CUpti_SubscriberHandle`
  cupti::subscribe<true>(&subscriber, callbackFn, nullptr);            // not exists in PTI
  cupti::activityEnable<true>(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);  // exists in PTI
  cupti::activityRegisterCallbacks<true>(allocBuffer, completeBuffer); // exists in PTI
  // `setGraphCallbacks` do something like:
  //   CALLBACK_ENABLE(CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED);
  //   CALLBACK_ENABLE(CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED);
  //   ...
  setGraphCallbacks(subscriber, /*enable=*/true);                      // not exists in PTI
  // `setRuntimeCallbacks` do something like:
  //   CALLBACK_ENABLE(CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020);
  //   CALLBACK_ENABLE(CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000);
  //   ...
  setRuntimeCallbacks(subscriber, /*enable=*/true);                    // not exists in PTI
  // `setDriverCallbacks` do something like:
  //   CALLBACK_ENABLE(CUPTI_DRIVER_TRACE_CBID_cuLaunch);
  //   CALLBACK_ENABLE(CUPTI_DRIVER_TRACE_CBID_cuLaunchGrid);
  //   ...
  setDriverCallbacks(subscriber, /*enable=*/true);                     // not exists in PTI
}

Needs to be done:

  • Think of what to do with getting the PTI profiler record identifiers in level zero callbacks or how to find a way without them.
  • ​​Synchronize the device to make a correct data flush.
  • What to do with the concept of cuda graph kernels?
  • Obtain the device architecture.
  • Enable unit tests.
    • test_api.py
    • test_lib.py
    • test_profile.py (partially)
    • test_viewer.py
  • Enable tutorials.
  • Final code cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Profiler] Enable Triton Proton for the Intel GPU's
1 participant