Add ray multiple host support #63

FanhaiLu1 · 2024-04-26T23:47:43Z

This PR enable pytorch engine multiple host on TPU POD slices.

MVP Goal:

The current PR is MVP version of multiple host, two goals in MPV:

Load weight and sharding in multiple host
Compute meaningful decode result

Result validation:

Weight and sharding

With this PR, the weight and sharding on v5e-16 (4 host , total 16 chips) use 50 of memory compared with 8 chips. It worked as expected.

memory using 804.8 MiB / 15.7 GiB (4.990588%) on TPU_10(process=3,(2,2,0,0))
memory using 804.8 MiB / 15.7 GiB (4.990588%) on TPU_11(process=3,(3,2,0,0))
memory using 804.8 MiB / 15.7 GiB (4.990588%) on TPU_14(process=3,(2,3,0,0))
memory using 804.8 MiB / 15.7 GiB (4.990588%) on TPU_15(process=3,(3,3,0,0))

The weight and sharding on v5e-8 (4 host , total 8 chips)

memory using 1.7 GiB / 15.7 GiB (10.765424%) on TPU_4(process=0,(0,2,0,0))
memory using 1.7 GiB / 15.7 GiB (10.765424%) on TPU_5(process=0,(1,2,0,0))
memory using 1.7 GiB / 15.7 GiB (10.765424%) on TPU_6(process=0,(0,3,0,0))
memory using 1.7 GiB / 15.7 GiB (10.765424%) on TPU_7(process=0,(1,3,0,0))

Meaningful Result

With this PR, first two line of the result on v5e-16 (4 host , total 16 chips). The result are meaningful and looking good to human.


to find purpose and fulfillment.

I believe that everyone has a unique purpose and that it is up to each individual to discover and pursue theirs.

first two line of the result on v5e-8 (4 host , total 8 chips)

to find purpose, happiness, and fulfillment. Here are some reasons why:

1. Purpose: Having a sense of purpose gives life meaning and direction. It helps individuals set goals and work towards achieving them, which can lead to a sense of accomplishment and fulfillment.

Caveats

As this is the MVP, it's important to note that there could be some limitations in terms of performance and accuracy right now.

allenwang28 · 2024-04-29T19:30:24Z

jetstream_pt/ray_engine_master.py

+          prefix=prefix, decode_state=decode_state, slot=slot
+      )
+      all_outputs.append(output)
+    _ = ray.get(all_outputs)


Do we need to return anything here? And for prefill?

In interleave serving ray multiple host, prefill doesn't return anything.

got it, can we return None with that comment?

allenwang28 · 2024-04-29T20:31:35Z

jetstream_pt/ray_engine_master.py

+  pod_name = tpu.get_current_pod_name()
+  num_hosts = tpu.get_current_pod_worker_count()


maybe we can add an assertion check here that the pod_name is not None and num_hosts >= 1 ?

(None, 0) would be the situation if we're not running on a TPU VM at all

Good point! Added assert for None and num_hosts.

allenwang28 · 2024-04-29T20:34:13Z

jetstream_pt/ray_engine_master.py

+  engine_worker_with_tpu_resource = PyTorchEngineRayWorker.options(
+      resources={"TPU": 4}
+  )


nit - we could remove this line altogether if we did @ray.remote(resources={"TPU": 4}) in the PyTorchEngineRayWorker decorator

I keep worker without assigning any resource as intended. Move resource allocation in master to dynamic assign different amount of chips for different type of TPU. Right now, I still use a fix number, but I use the value either by passed as parameter or calculated from (pod_worker_count / num_hosts) to assign resource for the worker.

allenwang28

This looks great!! Mostly nits, but IIUC this same approach would work for JAX/JetStream right?

allenwang28 · 2024-04-29T20:37:18Z

jetstream_pt/ray_engine_worker.py

+          f"memory using {fmt_size(used)} / {fmt_size(limit)} ({used/limit:%}) on {d}"
+      )
+
+  def init_decode_state_ray(


nit: maybe this can be init_decode_state and the init_decode_state from above can become _init_decode_state?

To easy debug and reproduce the code in single host (lots of jax debug info lost in master worker mode compared with single host mode), I let ray_worker can run in single host without ray with only one line change (comment the @ray.remote). Keep init_decode_state function name same as engine without ray.

Ah okay that makes a lot of sense!

FanhaiLu1 · 2024-04-29T21:28:14Z

This looks great!! Mostly nits, but IIUC this same approach would work for JAX/JetStream right?

Thanks Allen for reviewing it! Yes, the Jax/Jetstream should use same approach in general, we could extract common part of code (For example: Master code could be exactly same for both Jax and Pytorch), so both Jax and Pytorch part can share same code base.

allenwang28 · 2024-04-29T23:06:34Z

Could we also add a basic unit test for Ray? One way is to take advantage of Ray's "fake" cluster if you look at this file for instance.

qihqi · 2024-04-29T23:16:50Z

jetstream_pt/ray_engine_master.py

+      output = worker.init_decode_state_ray.remote()
+      all_outputs.append(output)
+    _ = ray.get(all_outputs)
+    return None


instead of returning None, is it meaningful to return some sort of "handle" here?

So basically the reason we don't need to return is because the worker will hold on the state in a local variable. However, we can also have the worker to store the state in a list and return it's index to master

master then keep a list of ints as the "handle".

This way on good thing is that you dont need to worry about overwrite if prefill / insert is called multiple times . and helps with disagg setup

Good point! It would be very important for disaggregation serving. The API might have large change with disaggregation serving, can we hold it for now? We will have a clear API when we do disaggregation serving.

allenwang28

Added a few more style focused comments

allenwang28 · 2024-04-29T23:07:36Z

install_everything.sh

@@ -22,7 +22,8 @@ pip3 show libtpu-nightly && pip3 uninstall -y libtpu-nightly
 pip3 install pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
 # torch cpu
 pip3 install torch --index-url https://download.pytorch.org/whl/cpu
-pip3 install tensorflow flatbuffers absl-py flax sentencepiece seqio google-cloud-storage safetensors colorama coverage
+pip3 install tensorflow flatbuffers absl-py flax sentencepiece seqio google-cloud-storage 
+pip3 install safetensors colorama coverage ray[serve] humanize


The ray[serve] dependency can be simplified to just ray[default] which should lower the total package size

Good catch, updated.

allenwang28 · 2024-04-29T23:08:46Z

jetstream_pt/ray_engine_master.py

+          prefix=prefix, decode_state=decode_state, slot=slot
+      )
+      all_outputs.append(output)
+    _ = ray.get(all_outputs)


got it, can we return None with that comment?

allenwang28 · 2024-04-29T23:10:16Z

jetstream_pt/ray_engine_master.py

+          params=params, decode_state=decode_state
+      )
+      all_outputs.append(output)
+    state, result_tokens = ray.get(all_outputs)[0]


I would add a comment here saying that we're assuming that the worker does an all_gather, else it could be confusing for future readers why we toss the other results

Make sense, right now all the workers do all_gather.

allenwang28 · 2024-04-29T23:14:28Z

jetstream_pt/ray_engine_master.py

+DecodeState = Any
+
+
+class PyTorchEngineRayMaster(engine_api.Engine):


I'm wondering if PyTorchRayEngine could be a more descriptive name here (although we should still mention that it is the leader of PyTorchInterleavedRayWorkers)

allenwang28 · 2024-04-29T23:15:44Z

jetstream_pt/ray_engine_master.py

+  """Ray engine master to orchestrate requests and collect token response"""
+
+  def __init__(
+      self, engine_workers, tokenizer_path, context_length, batch_size


Could we add typing here in the init, that way readers can tell that engine_workers should be Iterable[PyTorchEngineRayWorker]?

Agree, it would be more clear, updated.

allenwang28 · 2024-04-29T23:24:17Z

jetstream_pt/ray_engine_worker.py

+@ray.remote
+# pylint: disable-next=all
+class PyTorchEngineRayWorker:
+  """Wraps functions to the Jet Engine API format."""


It could be helpful to add more documentation here, for instance:

class PyTorchEngineRayWorker: """Ray actor representation for a PyTorch engine worker. PyTorchEngineRayWorker enables multi-host serving for models that exceed the memory capabilities of single-host TPU VMs with Ray. PyTorchEngineRayWorker should be used with a "leader" engine, e.g. `PyTorchRayEngine`. Note: For `PyTorchRayEngine` to return consistent results, it's important that `PyTorchEngineRayWorker` is able to maintain its state within process and only return results once its transferred to CPU device. """

(as I write this out, if we're renaming the above to PyTorchRayEngine then this should be PyTorchRayWorker)

Agree on renaming, updated.

FanhaiLu1 · 2024-04-30T01:06:27Z

Could we also add a basic unit test for Ray? One way is to take advantage of Ray's "fake" cluster if you look at this file for instance.

Looks great with Ray's "fake" cluster test. Can I add test with another PR (I tested current PR e2e manually)? I also plan to add unit test for RayWorkers in next PR.

allenwang28 · 2024-04-30T16:52:12Z

Looks great with Ray's "fake" cluster test. Can I add test with another PR (I tested current PR e2e manually)? I also plan to add unit test for RayWorkers in next PR.

Sounds good to me! Thanks for patiently addressing all of the comments, LGTM!

Add ray multiple host support

d77ee0e

FanhaiLu1 requested review from qihqi, gangji and wang2yn84 April 26, 2024 23:48

add dependencies

26c13f1

FanhaiLu1 requested a review from vipannalla April 27, 2024 00:03

add dependencies

5b706dd

allenwang28 reviewed Apr 29, 2024

View reviewed changes

add assertion check on pod_name and num_hosts

ef20f95

qihqi reviewed Apr 29, 2024

View reviewed changes

allenwang28 reviewed Apr 29, 2024

View reviewed changes

FanhaiLu1 added 3 commits April 30, 2024 00:33

Update ray engine and worker

b1d1d0f

update interactive

00d3c14

update ray worker

151cae9

FanhaiLu1 added 2 commits April 30, 2024 01:10

add comments

fe3528f

update comments

49c174b

allenwang28 approved these changes Apr 30, 2024

View reviewed changes

FanhaiLu1 requested review from qihqi and removed request for gangji, wang2yn84 and vipannalla April 30, 2024 17:51

qihqi approved these changes Apr 30, 2024

View reviewed changes

FanhaiLu1 merged commit a58051d into AI-Hypercomputer:main Apr 30, 2024
3 checks passed

FanhaiLu1 deleted the multiple-host branch May 8, 2024 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ray multiple host support #63

Add ray multiple host support #63

FanhaiLu1 commented Apr 26, 2024 •

edited

Loading

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 29, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 30, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 29, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 29, 2024

allenwang28 left a comment

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 29, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 commented Apr 29, 2024

allenwang28 commented Apr 29, 2024

qihqi Apr 29, 2024

FanhaiLu1 Apr 30, 2024

allenwang28 left a comment

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 30, 2024

allenwang28 Apr 29, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 30, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 30, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 30, 2024

allenwang28 Apr 29, 2024

FanhaiLu1 Apr 30, 2024

FanhaiLu1 commented Apr 30, 2024

allenwang28 commented Apr 30, 2024

		pod_name = tpu.get_current_pod_name()
		num_hosts = tpu.get_current_pod_worker_count()

		DecodeState = Any


		class PyTorchEngineRayMaster(engine_api.Engine):

Add ray multiple host support #63

Add ray multiple host support #63

Conversation

FanhaiLu1 commented Apr 26, 2024 • edited Loading

MVP Goal:

Result validation:

Weight and sharding

Meaningful Result

Caveats

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenwang28 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FanhaiLu1 commented Apr 29, 2024

allenwang28 commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allenwang28 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FanhaiLu1 commented Apr 30, 2024

allenwang28 commented Apr 30, 2024

FanhaiLu1 commented Apr 26, 2024 •

edited

Loading