Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: spu返回结果时不能聚合,报错说KeyError: ObjRef(c2d95221-44ca-49f4-9297-3baa13058179 at node:4) #460

Closed
Zhaoxinxinzi opened this issue Jan 5, 2024 · 6 comments · Fixed by #461

Comments

@Zhaoxinxinzi
Copy link

Issue Type

Support

Modules Involved

SPU compiler

Have you reproduced the bug with SPU HEAD?

Yes

Have you searched existing issues?

Yes

SPU Version

spu 0.6.0b0

OS Platform and Distribution

Linux Ubuntu 22.04

Python Version

3.10

Compiler Version

GCC11.2

Current Behavior?

A bug happened!

Standalone code to reproduce the issue

def run_on_spu():
    def calcu(array1,array2):
        dot_product = jnp.dot(array1, array2)
        return dot_product
    array1 = jnp.array([1, 2, 3])
    array2 = jnp.array([4, 5, 6])
    array1_secret = ppd.device("P2")(lambda x: x)(array1)
    array2_secret = ppd.device("P1")(lambda x: x)(array2)
    out = ppd.device("SPU")(calcu, copts=copts)(array1=array1_secret, array2=array2_secret)
    print("out",out)

Relevant log output

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.py", line 245, in <module>
    benchmark_stablediffusion()
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.py", line 241, in benchmark_stablediffusion
    run_on_spu()
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.py", line 63, in wrapper
    result = func(*args, **kwargs)  # 执行函数
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.py", line 224, in run_on_spu
    out = ppd.device("SPU")(calcu, copts=copts)(array1=array1_secret, array2=array2_secret)
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/spu/utils/distributed.py", line 680, in __call__
    results = [future.result() for future in futures]
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/spu/utils/distributed.py", line 680, in <listcomp>
    results = [future.result() for future in futures]
  File "/root/miniconda3/envs/puma/lib/python3.10/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/root/miniconda3/envs/puma/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/root/miniconda3/envs/puma/lib/python3.10/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/spu/utils/distributed.py", line 249, in run
    return self._call(self._stub.Run, fn, *args, **kwargs)
  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/ml/flax_stable_diffusion/flax_stable_diffusion.runfiles/spulib/spu/utils/distributed.py", line 242, in _call
    raise Exception("remote exception", result)
Exception: ('remote exception', Exception('Traceback (most recent call last):\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 324, in Run\n    args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))\n  File "/root/miniconda3/envs/puma/lib/python3.10/site-packages/jax/_src/tree_util.py", line 244, in tree_map\n    return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))\n  File "/root/miniconda3/envs/puma/lib/python3.10/site-packages/jax/_src/tree_util.py", line 244, in <genexpr>\n    return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 324, in <lambda>\n    args, kwargs = tree_map(lambda obj: self._get_object(obj), (args, kwargs))\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 344, in _get_object\n    obj = self._node_clients[ref.origin_nodeid].get(ref)\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 262, in get\n    return self._call(self._stub.RunReturn, builtin_fetch_object, ref.uuid)\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 242, in _call\n    raise Exception("remote exception", result)\nException: (\'remote exception\', Exception(\'Traceback (most recent call last):\\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 309, in RunReturn\\n    result = fn(self, *args, **kwargs)\\n  File "/root/.cache/bazel/_bazel_root/949b3252a7459bc91b00b31cb7f4016a/execroot/spulib/bazel-out/k8-opt/bin/examples/python/utils/nodectl.runfiles/spulib/spu/utils/distributed.py", line 259, in builtin_fetch_object\\n    return server._globals[ObjectRef(refid, server.node_id)]\\nKeyError: ObjRef(c2d95221-44ca-49f4-9297-3baa13058179 at node:4)\\n\'))\n'))
@tpppppub
Copy link
Collaborator

tpppppub commented Jan 5, 2024

麻烦贴一下 nodectl 的 json 配置文件

@Zhaoxinxinzi
Copy link
Author

image

@tpppppub
Copy link
Collaborator

tpppppub commented Jan 8, 2024

没有复现你的问题,请说一下具体的复现步骤,然后 nodectl 的 log 也贴一下

@Zhaoxinxinzi
Copy link
Author

我就是配置完环境之后,开了五个端口用于up节点,配置文件和bazel文件都是用的example/ml里面的,然后又新开了一个端口来bazel run 代码,结果返回结果时出现错误,key error,换简单的测试用例也是同样的错
image
image
image
image
image

这是5个nodectl 的log
image
image

这是执行代码 的 log, 我还打印了中间信息,计算到output时都没有问题,返回结果的时候报错

@Zhaoxinxinzi
Copy link
Author

对了,我是用一台机器开了多个端口,没有用多台机器

@tpppppub
Copy link
Collaborator

tpppppub commented Jan 8, 2024

nodectl up 只需要运行一次就行了,就会启动 5 个 node,不用运行 5 次

anakinxc added a commit that referenced this issue Jan 8, 2024
# Pull Request

## What problem does this PR solve?

Issue Number: Fixed #460 

## Possible side effects?

- Performance: No

- Backward compatibility: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants