Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate workflows to OSSCI #793

Open
yamiyysu opened this issue Jan 8, 2025 · 3 comments
Open

Migrate workflows to OSSCI #793

yamiyysu opened this issue Jan 8, 2025 · 3 comments
Assignees

Comments

@yamiyysu
Copy link

yamiyysu commented Jan 8, 2025

Currently we have workflows that are running on our lab and we want to migrate them to our OSSCI clusters.

Two types of workflows

  1. w/o storage affinity: a PR is in progress to migrate to Conductor
  2. w/ storage affinity: this needs a solution to move over the storage before we can migrate the workflows and we need to decide where to migrate to.
@yamiyysu yamiyysu self-assigned this Jan 8, 2025
@ScottTodd
Copy link
Member

ScottTodd commented Jan 13, 2025

Repository Workflow Migrated?
// Current runner labels
Special considerations
iree-org/iree pkgci_test_amd_mi300.yml ✔️ Yes (iree-org/iree#19517)
linux-mi300-gpu-1
  • Builds the IREE runtime from source, needs clang+cmake+ninja
iree-org/iree pkgci_test_amd_mi250.yml ❌ No
nodai-amdgpu-mi250-x86-64
  • Needs an MI250 accelerator
  • Builds the IREE runtime from source, needs clang+cmake+ninja
iree-org/iree pkgci_regression_test.yml ❌ No
nodai-amdgpu-mi250-x86-64 and nodai-amdgpu-mi300-x86-64
  • Jobs for CPU, MI250, and MI300
  • Local cache of SDXL and SD3 model weights (sample sources here), around 20GB
iree-org/iree-turbine ci-tk.yaml ❌ No
nodai-amdgpu-mi250-x86-64 and nodai-amdgpu-mi300-x86-64
  • Jobs for CPU, MI250, and MI300
iree-org/iree-turbine perf.yaml ❌ No
nodai-amdgpu-mi300-x86-64
  • None? Workflow is also currently disabled
nod-ai/shark-ai ci-llama-large-tests.yaml ❌ No
llama-mi300x-1
nod-ai/shark-ai ci-llama-quick-tests.yaml ❌ No
llama-mi300x-1
nod-ai/shark-ai ci-sdxl.yaml ❌ No
mi300x-4
nod-ai/shark-ai ci-sglang-benchmark.yml ❌ No
mi300x-4
nod-ai/shark-ai ci-sglang-integration-tests.yml ❌ No
mi300x-4
nod-ai/shark-ai ci_eval.yaml ❌ No
llama-mi300x-3
  • Needs local files for llama, specified with --llama3-8b-f16-model-path=/data/...
nod-ai/shark-ai ci_eval_short.yaml ❌ No
llama-mi300x-3
  • Needs local files for llama, specified with --llama3-8b-f16-model-path=/data/...
nod-ai/shark-ai pkgci_shark_ai.yml ✔️ Yes (#890)
linux-mi300-gpu-1
  • Does not depend on local files (it downloads its own model weights if not already available) but having a larg, fast storage space to cache things would make this MUCH faster.

@yamiyysu
Copy link
Author

yamiyysu commented Jan 27, 2025

pkgci_regression_test.yml
has a local cache reference as seen from the PR build

@renxida
Copy link
Contributor

renxida commented Feb 4, 2025

I'd suggest prioritizing moving pkgci_shark_ai.yml to the new cluster soon, if possible.

Update: This one is done! #890

old content

We haven't ran llama shortfin integration tests on GPU for a while now. I moved it off CPU CI due to saturation of mi300x-3 (mi300x-3 is tri-plexed for CI, as well as being used, as of now, by 10+ users for dev tasks. I moved it to our CPU machines because mi300x-3 was taking hours to queue).

I'm trying to move it back, but it's failing. Possibly due to over-saturation of mi300x-3, possibly due to iree bumps.

Here's why it's important:

  1. I don't have other alternatives. MI300x-4 is reserved for MLPerf, so everybody is rushing to it. I don't have enough ci time on MI300x-3 to even test my fixes.
  2. With every passing day, we drift further from the last working shortfin-llama-on-gpu test as IREE and other dependencies get bumped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants