-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MI300X Runners to AMD workflow. #149
base: main
Are you sure you want to change the base?
Conversation
@@ -13,10 +13,18 @@ on: | |||
|
|||
jobs: | |||
run: | |||
runs-on: [amdgpu-mi250-x86-64] | |||
runs-on: ${{ matrix.runs-on }} | |||
strategy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the way we want to handle multiple runners; afaics, this well always run the given code sample on both types of AMD cards. Instead, the card should be selectable by an input argument that is given to the job.
Both runners should be registered in src/discord-cluster-manager/consts.py, and
https://github.com/gpu-mode/discord-cluster-manager/blob/main/src/discord-cluster-manager/cogs/github_cog.py#L35-L40
needs to be updated
python3 .github/workflows/runner.py | ||
cat result.json # Debug: show output | ||
|
||
- name: Upload training artifacts | ||
uses: actions/upload-artifact@v4 | ||
if: always() | ||
with: | ||
name: run-result | ||
name: run-result-${{ matrix.name }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name needs to be unchanged so the bot knows how to read the results.
Description
This commit adds mi300 nodes to the current amd workflow matrix.
What dependencies do we expect to be installed on the runners?
Right now, it is a fresh ubuntu + rocm + python image on the runner.
From the current setup, it looks like we expect dependencies like
torch
andpytorch-triton-rocm
to be requested by the user.Let me know if that is not the case.
I tested a couple times on the AMD workflow here (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/12902691574)
Checklist
Before submitting this PR, ensure the following steps have been completed:
/verifyruns
on your own server./verifyruns
.runs may take a little longer. The Modal run is typically quick.)
For more information on running a cluster bot on your own server, see
README.md.