Add MI300X Runners to AMD workflow. #149

saienduri · 2025-01-22T06:50:37Z

Description

This commit adds mi300 nodes to the current amd workflow matrix.
What dependencies do we expect to be installed on the runners?
Right now, it is a fresh ubuntu + rocm + python image on the runner.
From the current setup, it looks like we expect dependencies like torch and pytorch-triton-rocm to be requested by the user.
Let me know if that is not the case.

I tested a couple times on the AMD workflow here (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/12902691574)

Checklist

Before submitting this PR, ensure the following steps have been completed:

Run the slash command /verifyruns on your own server.
- Run the cluster bot on your server:
```
python discord-bot.py
```
- Start training runs with the slash command /verifyruns.
- Verify that the bot eventually responds with:
```
✅ All runs completed successfully!
```
  (It may take a few minutes for all runs to finish. In particular, the GitHub
  runs may take a little longer. The Modal run is typically quick.)
  For more information on running a cluster bot on your own server, see
  README.md.

ngc92 · 2025-01-22T09:52:37Z

.github/workflows/amd_workflow.yml

@@ -13,10 +13,18 @@ on:

 jobs:
  run:
-    runs-on: [amdgpu-mi250-x86-64]
+    runs-on: ${{ matrix.runs-on }}
+    strategy:


I don't think this is the way we want to handle multiple runners; afaics, this well always run the given code sample on both types of AMD cards. Instead, the card should be selectable by an input argument that is given to the job.

Both runners should be registered in src/discord-cluster-manager/consts.py, and
https://github.com/gpu-mode/discord-cluster-manager/blob/main/src/discord-cluster-manager/cogs/github_cog.py#L35-L40
needs to be updated

ngc92 · 2025-01-22T09:55:11Z

.github/workflows/amd_workflow.yml

        python3 .github/workflows/runner.py
        cat result.json  # Debug: show output

    - name: Upload training artifacts
      uses: actions/upload-artifact@v4
      if: always()
      with:
-        name: run-result
+        name: run-result-${{ matrix.name }}


this name needs to be unchanged so the bot knows how to read the results.

saienduri added 5 commits January 21, 2025 19:58

mi300 changes

d750d27

use preset env git variable

afa5aba

add venv activate to script step

c625f7d

temp pip install amd torch

af733a7

remove temp pip install

53369a5

saienduri requested review from ngc92 and msaroufim January 22, 2025 06:50

ngc92 reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MI300X Runners to AMD workflow. #149

Add MI300X Runners to AMD workflow. #149

saienduri commented Jan 22, 2025 •

edited

Loading

ngc92 Jan 22, 2025

ngc92 Jan 22, 2025

Add MI300X Runners to AMD workflow. #149

Are you sure you want to change the base?

Add MI300X Runners to AMD workflow. #149

Conversation

saienduri commented Jan 22, 2025 • edited Loading

Description

Checklist

ngc92 Jan 22, 2025

Choose a reason for hiding this comment

ngc92 Jan 22, 2025

Choose a reason for hiding this comment

saienduri commented Jan 22, 2025 •

edited

Loading