Migrate workflows to OSSCI #793

yamiyysu · 2025-01-08T21:59:45Z

Currently we have workflows that are running on our lab and we want to migrate them to our OSSCI clusters.

Two types of workflows

w/o storage affinity: a PR is in progress to migrate to Conductor
w/ storage affinity: this needs a solution to move over the storage before we can migrate the workflows and we need to decide where to migrate to.

ScottTodd · 2025-01-13T21:53:29Z

Repository	Workflow	Migrated? // Current runner labels	Special considerations
iree-org/iree	`pkgci_test_amd_mi300.yml`	✔️ Yes (iree-org/iree#19517) `linux-mi300-gpu-1`	Builds the IREE runtime from source, needs clang+cmake+ninja
iree-org/iree	`pkgci_test_amd_mi250.yml`	❌ No `nodai-amdgpu-mi250-x86-64`	Needs an MI250 accelerator Builds the IREE runtime from source, needs clang+cmake+ninja
iree-org/iree	`pkgci_regression_test.yml`	❌ No `nodai-amdgpu-mi250-x86-64` and `nodai-amdgpu-mi300-x86-64`	Jobs for CPU, MI250, and MI300 Local cache of SDXL and SD3 model weights (sample sources here), around 20GB
iree-org/iree-turbine	`ci-tk.yaml`	❌ No `nodai-amdgpu-mi250-x86-64` and `nodai-amdgpu-mi300-x86-64`	Jobs for CPU, MI250, and MI300
iree-org/iree-turbine	`perf.yaml`	❌ No `nodai-amdgpu-mi300-x86-64`	None? Workflow is also currently disabled
nod-ai/shark-ai	`ci-llama-large-tests.yaml`	❌ No `llama-mi300x-1`	Requires local files for llama models. See hardcoded paths in `benchmark_amdgpu_test.py`
nod-ai/shark-ai	`ci-llama-quick-tests.yaml`	❌ No `llama-mi300x-1`	Requires local files for llama models. See hardcoded paths in `benchmark_amdgpu_test.py`
nod-ai/shark-ai	`ci-sdxl.yaml`	❌ No `mi300x-4`	(?) Uses cached files for SDXL see `apps/sd/e2e_test.py`
nod-ai/shark-ai	`ci-sglang-benchmark.yml`	❌ No `mi300x-4`	(?) Uses cached files for llama see `sglang_benchmarks/shortfin_benchmark_test.py`
nod-ai/shark-ai	`ci-sglang-integration-tests.yml`	❌ No `mi300x-4`	(?) Uses cached files for llama see `app_tests/integration_tests/llm`
nod-ai/shark-ai	`ci_eval.yaml`	❌ No `llama-mi300x-3`	Needs local files for llama, specified with `--llama3-8b-f16-model-path=/data/...`
nod-ai/shark-ai	`ci_eval_short.yaml`	❌ No `llama-mi300x-3`	Needs local files for llama, specified with `--llama3-8b-f16-model-path=/data/...`
nod-ai/shark-ai	`pkgci_shark_ai.yml`	✔️ Yes (#890) `linux-mi300-gpu-1`	Does not depend on local files (it downloads its own model weights if not already available) but having a larg, fast storage space to cache things would make this MUCH faster.

yamiyysu · 2025-01-27T17:02:53Z

pkgci_regression_test.yml
has a local cache reference as seen from the PR build

renxida · 2025-02-04T22:12:50Z

I'd suggest prioritizing moving pkgci_shark_ai.yml to the new cluster soon, if possible.

Update: This one is done! #890

old content

We haven't ran llama shortfin integration tests on GPU for a while now. I moved it off CPU CI due to saturation of mi300x-3 (mi300x-3 is tri-plexed for CI, as well as being used, as of now, by 10+ users for dev tasks. I moved it to our CPU machines because mi300x-3 was taking hours to queue).

I'm trying to move it back, but it's failing. Possibly due to over-saturation of mi300x-3, possibly due to iree bumps.

Here's why it's important:

I don't have other alternatives. MI300x-4 is reserved for MLPerf, so everybody is rushing to it. I don't have enough ci time on MI300x-3 to even test my fixes.
With every passing day, we drift further from the last working shortfin-llama-on-gpu test as IREE and other dependencies get bumped

yamiyysu self-assigned this Jan 8, 2025

yamiyysu assigned Eliasj42 Jan 30, 2025

ScottTodd mentioned this issue Feb 3, 2025

Make get_iree_devices read IREE_DEVICE env var if provided #891

Open

yamiyysu pinned this issue Feb 5, 2025

ScottTodd mentioned this issue Feb 8, 2025

Refs/heads/eliasj42/llama runner migration #938

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate workflows to OSSCI #793

Migrate workflows to OSSCI #793

yamiyysu commented Jan 8, 2025

ScottTodd commented Jan 13, 2025 •

edited by renxida

Loading

yamiyysu commented Jan 27, 2025 •

edited

Loading

renxida commented Feb 4, 2025 •

edited

Loading

Migrate workflows to OSSCI #793

Migrate workflows to OSSCI #793

Comments

yamiyysu commented Jan 8, 2025

ScottTodd commented Jan 13, 2025 • edited by renxida Loading

yamiyysu commented Jan 27, 2025 • edited Loading

renxida commented Feb 4, 2025 • edited Loading

ScottTodd commented Jan 13, 2025 •

edited by renxida

Loading

yamiyysu commented Jan 27, 2025 •

edited

Loading

renxida commented Feb 4, 2025 •

edited

Loading