Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: stuck when running puma #443

Closed
hkz103 opened this issue Dec 18, 2023 · 7 comments
Closed

[Bug]: stuck when running puma #443

hkz103 opened this issue Dec 18, 2023 · 7 comments

Comments

@hkz103
Copy link

hkz103 commented Dec 18, 2023

Issue Type

Usability

Modules Involved

Documentation/Tutorial/Example

Have you reproduced the bug with SPU HEAD?

Yes

Have you searched existing issues?

Yes

SPU Version

spu 0.7.0, secretflow1.3.0

OS Platform and Distribution

linux ubuntu20.04

Python Version

3.8

Compiler Version

gcc 10.5

Current Behavior?

When I running puma, the example is stuck at:
"Run on CPU
Q: What is the largest animal?
A: The (no output)
Run on SPU"
(stuck)"

Standalone code to reproduce the issue

Running command line: bazel-bin/examples/python/ml/flax_llama7b/flax_llama7b --config /opt/spu/examples/python/ml/flax_llama7b/3pc.json
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Some of the weights of FlaxLLaMAForCausalLM were initialized in float16 precision from the model checkpoint at /opt/open_llama:
[('lm_head', 'kernel'), ('transformer', 'h', '0', 'attention', 'wk', 'kernel'), ('transformer', 'h', '0', 'attention', 'wo', 'kernel'), ('transformer', 'h', '0', 'attention', 'wq', 'kernel'), ('transformer', 'h', '0', 'attention', 'wv', 'kernel'), ('transformer', 'h', '0', 'attention_norm', 'kernel'), ('transformer', 'h', '0', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '0', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '0', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '0', 'ffn_norm', 'kernel'), ('transformer', 'h', '1', 'attention', 'wk', 'kernel'), ('transformer', 'h', '1', 'attention', 'wo', 'kernel'), ('transformer', 'h', '1', 'attention', 'wq', 'kernel'), ('transformer', 'h', '1', 'attention', 'wv', 'kernel'), ('transformer', 'h', '1', 'attention_norm', 'kernel'), ('transformer', 'h', '1', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '1', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '1', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '1', 'ffn_norm', 'kernel'), ('transformer', 'h', '10', 'attention', 'wk', 'kernel'), ('transformer', 'h', '10', 'attention', 'wo', 'kernel'), ('transformer', 'h', '10', 'attention', 'wq', 'kernel'), ('transformer', 'h', '10', 'attention', 'wv', 'kernel'), ('transformer', 'h', '10', 'attention_norm', 'kernel'), ('transformer', 'h', '10', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '10', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '10', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '10', 'ffn_norm', 'kernel'), ('transformer', 'h', '11', 'attention', 'wk', 'kernel'), ('transformer', 'h', '11', 'attention', 'wo', 'kernel'), ('transformer', 'h', '11', 'attention', 'wq', 'kernel'), ('transformer', 'h', '11', 'attention', 'wv', 'kernel'), ('transformer', 'h', '11', 'attention_norm', 'kernel'), ('transformer', 'h', '11', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '11', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '11', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '11', 'ffn_norm', 'kernel'), ('transformer', 'h', '12', 'attention', 'wk', 'kernel'), ('transformer', 'h', '12', 'attention', 'wo', 'kernel'), ('transformer', 'h', '12', 'attention', 'wq', 'kernel'), ('transformer', 'h', '12', 'attention', 'wv', 'kernel'), ('transformer', 'h', '12', 'attention_norm', 'kernel'), ('transformer', 'h', '12', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '12', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '12', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '12', 'ffn_norm', 'kernel'), ('transformer', 'h', '13', 'attention', 'wk', 'kernel'), ('transformer', 'h', '13', 'attention', 'wo', 'kernel'), ('transformer', 'h', '13', 'attention', 'wq', 'kernel'), ('transformer', 'h', '13', 'attention', 'wv', 'kernel'), ('transformer', 'h', '13', 'attention_norm', 'kernel'), ('transformer', 'h', '13', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '13', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '13', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '13', 'ffn_norm', 'kernel'), ('transformer', 'h', '14', 'attention', 'wk', 'kernel'), ('transformer', 'h', '14', 'attention', 'wo', 'kernel'), ('transformer', 'h', '14', 'attention', 'wq', 'kernel'), ('transformer', 'h', '14', 'attention', 'wv', 'kernel'), ('transformer', 'h', '14', 'attention_norm', 'kernel'), ('transformer', 'h', '14', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '14', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '14', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '14', 'ffn_norm', 'kernel'), ('transformer', 'h', '15', 'attention', 'wk', 'kernel'), ('transformer', 'h', '15', 'attention', 'wo', 'kernel'), ('transformer', 'h', '15', 'attention', 'wq', 'kernel'), ('transformer', 'h', '15', 'attention', 'wv', 'kernel'), ('transformer', 'h', '15', 'attention_norm', 'kernel'), ('transformer', 'h', '15', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '15', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '15', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '15', 'ffn_norm', 'kernel'), ('transformer', 'h', '16', 'attention', 'wk', 'kernel'), ('transformer', 'h', '16', 'attention', 'wo', 'kernel'), ('transformer', 'h', '16', 'attention', 'wq', 'kernel'), ('transformer', 'h', '16', 'attention', 'wv', 'kernel'), ('transformer', 'h', '16', 'attention_norm', 'kernel'), ('transformer', 'h', '16', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '16', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '16', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '16', 'ffn_norm', 'kernel'), ('transformer', 'h', '17', 'attention', 'wk', 'kernel'), ('transformer', 'h', '17', 'attention', 'wo', 'kernel'), ('transformer', 'h', '17', 'attention', 'wq', 'kernel'), ('transformer', 'h', '17', 'attention', 'wv', 'kernel'), ('transformer', 'h', '17', 'attention_norm', 'kernel'), ('transformer', 'h', '17', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '17', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '17', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '17', 'ffn_norm', 'kernel'), ('transformer', 'h', '18', 'attention', 'wk', 'kernel'), ('transformer', 'h', '18', 'attention', 'wo', 'kernel'), ('transformer', 'h', '18', 'attention', 'wq', 'kernel'), ('transformer', 'h', '18', 'attention', 'wv', 'kernel'), ('transformer', 'h', '18', 'attention_norm', 'kernel'), ('transformer', 'h', '18', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '18', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '18', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '18', 'ffn_norm', 'kernel'), ('transformer', 'h', '19', 'attention', 'wk', 'kernel'), ('transformer', 'h', '19', 'attention', 'wo', 'kernel'), ('transformer', 'h', '19', 'attention', 'wq', 'kernel'), ('transformer', 'h', '19', 'attention', 'wv', 'kernel'), ('transformer', 'h', '19', 'attention_norm', 'kernel'), ('transformer', 'h', '19', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '19', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '19', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '19', 'ffn_norm', 'kernel'), ('transformer', 'h', '2', 'attention', 'wk', 'kernel'), ('transformer', 'h', '2', 'attention', 'wo', 'kernel'), ('transformer', 'h', '2', 'attention', 'wq', 'kernel'), ('transformer', 'h', '2', 'attention', 'wv', 'kernel'), ('transformer', 'h', '2', 'attention_norm', 'kernel'), ('transformer', 'h', '2', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '2', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '2', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '2', 'ffn_norm', 'kernel'), ('transformer', 'h', '20', 'attention', 'wk', 'kernel'), ('transformer', 'h', '20', 'attention', 'wo', 'kernel'), ('transformer', 'h', '20', 'attention', 'wq', 'kernel'), ('transformer', 'h', '20', 'attention', 'wv', 'kernel'), ('transformer', 'h', '20', 'attention_norm', 'kernel'), ('transformer', 'h', '20', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '20', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '20', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '20', 'ffn_norm', 'kernel'), ('transformer', 'h', '21', 'attention', 'wk', 'kernel'), ('transformer', 'h', '21', 'attention', 'wo', 'kernel'), ('transformer', 'h', '21', 'attention', 'wq', 'kernel'), ('transformer', 'h', '21', 'attention', 'wv', 'kernel'), ('transformer', 'h', '21', 'attention_norm', 'kernel'), ('transformer', 'h', '21', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '21', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '21', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '21', 'ffn_norm', 'kernel'), ('transformer', 'h', '22', 'attention', 'wk', 'kernel'), ('transformer', 'h', '22', 'attention', 'wo', 'kernel'), ('transformer', 'h', '22', 'attention', 'wq', 'kernel'), ('transformer', 'h', '22', 'attention', 'wv', 'kernel'), ('transformer', 'h', '22', 'attention_norm', 'kernel'), ('transformer', 'h', '22', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '22', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '22', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '22', 'ffn_norm', 'kernel'), ('transformer', 'h', '23', 'attention', 'wk', 'kernel'), ('transformer', 'h', '23', 'attention', 'wo', 'kernel'), ('transformer', 'h', '23', 'attention', 'wq', 'kernel'), ('transformer', 'h', '23', 'attention', 'wv', 'kernel'), ('transformer', 'h', '23', 'attention_norm', 'kernel'), ('transformer', 'h', '23', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '23', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '23', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '23', 'ffn_norm', 'kernel'), ('transformer', 'h', '24', 'attention', 'wk', 'kernel'), ('transformer', 'h', '24', 'attention', 'wo', 'kernel'), ('transformer', 'h', '24', 'attention', 'wq', 'kernel'), ('transformer', 'h', '24', 'attention', 'wv', 'kernel'), ('transformer', 'h', '24', 'attention_norm', 'kernel'), ('transformer', 'h', '24', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '24', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '24', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '24', 'ffn_norm', 'kernel'), ('transformer', 'h', '25', 'attention', 'wk', 'kernel'), ('transformer', 'h', '25', 'attention', 'wo', 'kernel'), ('transformer', 'h', '25', 'attention', 'wq', 'kernel'), ('transformer', 'h', '25', 'attention', 'wv', 'kernel'), ('transformer', 'h', '25', 'attention_norm', 'kernel'), ('transformer', 'h', '25', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '25', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '25', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '25', 'ffn_norm', 'kernel'), ('transformer', 'h', '26', 'attention', 'wk', 'kernel'), ('transformer', 'h', '26', 'attention', 'wo', 'kernel'), ('transformer', 'h', '26', 'attention', 'wq', 'kernel'), ('transformer', 'h', '26', 'attention', 'wv', 'kernel'), ('transformer', 'h', '26', 'attention_norm', 'kernel'), ('transformer', 'h', '26', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '26', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '26', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '26', 'ffn_norm', 'kernel'), ('transformer', 'h', '27', 'attention', 'wk', 'kernel'), ('transformer', 'h', '27', 'attention', 'wo', 'kernel'), ('transformer', 'h', '27', 'attention', 'wq', 'kernel'), ('transformer', 'h', '27', 'attention', 'wv', 'kernel'), ('transformer', 'h', '27', 'attention_norm', 'kernel'), ('transformer', 'h', '27', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '27', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '27', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '27', 'ffn_norm', 'kernel'), ('transformer', 'h', '28', 'attention', 'wk', 'kernel'), ('transformer', 'h', '28', 'attention', 'wo', 'kernel'), ('transformer', 'h', '28', 'attention', 'wq', 'kernel'), ('transformer', 'h', '28', 'attention', 'wv', 'kernel'), ('transformer', 'h', '28', 'attention_norm', 'kernel'), ('transformer', 'h', '28', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '28', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '28', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '28', 'ffn_norm', 'kernel'), ('transformer', 'h', '29', 'attention', 'wk', 'kernel'), ('transformer', 'h', '29', 'attention', 'wo', 'kernel'), ('transformer', 'h', '29', 'attention', 'wq', 'kernel'), ('transformer', 'h', '29', 'attention', 'wv', 'kernel'), ('transformer', 'h', '29', 'attention_norm', 'kernel'), ('transformer', 'h', '29', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '29', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '29', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '29', 'ffn_norm', 'kernel'), ('transformer', 'h', '3', 'attention', 'wk', 'kernel'), ('transformer', 'h', '3', 'attention', 'wo', 'kernel'), ('transformer', 'h', '3', 'attention', 'wq', 'kernel'), ('transformer', 'h', '3', 'attention', 'wv', 'kernel'), ('transformer', 'h', '3', 'attention_norm', 'kernel'), ('transformer', 'h', '3', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '3', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '3', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '3', 'ffn_norm', 'kernel'), ('transformer', 'h', '30', 'attention', 'wk', 'kernel'), ('transformer', 'h', '30', 'attention', 'wo', 'kernel'), ('transformer', 'h', '30', 'attention', 'wq', 'kernel'), ('transformer', 'h', '30', 'attention', 'wv', 'kernel'), ('transformer', 'h', '30', 'attention_norm', 'kernel'), ('transformer', 'h', '30', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '30', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '30', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '30', 'ffn_norm', 'kernel'), ('transformer', 'h', '31', 'attention', 'wk', 'kernel'), ('transformer', 'h', '31', 'attention', 'wo', 'kernel'), ('transformer', 'h', '31', 'attention', 'wq', 'kernel'), ('transformer', 'h', '31', 'attention', 'wv', 'kernel'), ('transformer', 'h', '31', 'attention_norm', 'kernel'), ('transformer', 'h', '31', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '31', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '31', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '31', 'ffn_norm', 'kernel'), ('transformer', 'h', '4', 'attention', 'wk', 'kernel'), ('transformer', 'h', '4', 'attention', 'wo', 'kernel'), ('transformer', 'h', '4', 'attention', 'wq', 'kernel'), ('transformer', 'h', '4', 'attention', 'wv', 'kernel'), ('transformer', 'h', '4', 'attention_norm', 'kernel'), ('transformer', 'h', '4', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '4', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '4', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '4', 'ffn_norm', 'kernel'), ('transformer', 'h', '5', 'attention', 'wk', 'kernel'), ('transformer', 'h', '5', 'attention', 'wo', 'kernel'), ('transformer', 'h', '5', 'attention', 'wq', 'kernel'), ('transformer', 'h', '5', 'attention', 'wv', 'kernel'), ('transformer', 'h', '5', 'attention_norm', 'kernel'), ('transformer', 'h', '5', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '5', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '5', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '5', 'ffn_norm', 'kernel'), ('transformer', 'h', '6', 'attention', 'wk', 'kernel'), ('transformer', 'h', '6', 'attention', 'wo', 'kernel'), ('transformer', 'h', '6', 'attention', 'wq', 'kernel'), ('transformer', 'h', '6', 'attention', 'wv', 'kernel'), ('transformer', 'h', '6', 'attention_norm', 'kernel'), ('transformer', 'h', '6', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '6', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '6', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '6', 'ffn_norm', 'kernel'), ('transformer', 'h', '7', 'attention', 'wk', 'kernel'), ('transformer', 'h', '7', 'attention', 'wo', 'kernel'), ('transformer', 'h', '7', 'attention', 'wq', 'kernel'), ('transformer', 'h', '7', 'attention', 'wv', 'kernel'), ('transformer', 'h', '7', 'attention_norm', 'kernel'), ('transformer', 'h', '7', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '7', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '7', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '7', 'ffn_norm', 'kernel'), ('transformer', 'h', '8', 'attention', 'wk', 'kernel'), ('transformer', 'h', '8', 'attention', 'wo', 'kernel'), ('transformer', 'h', '8', 'attention', 'wq', 'kernel'), ('transformer', 'h', '8', 'attention', 'wv', 'kernel'), ('transformer', 'h', '8', 'attention_norm', 'kernel'), ('transformer', 'h', '8', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '8', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '8', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '8', 'ffn_norm', 'kernel'), ('transformer', 'h', '9', 'attention', 'wk', 'kernel'), ('transformer', 'h', '9', 'attention', 'wo', 'kernel'), ('transformer', 'h', '9', 'attention', 'wq', 'kernel'), ('transformer', 'h', '9', 'attention', 'wv', 'kernel'), ('transformer', 'h', '9', 'attention_norm', 'kernel'), ('transformer', 'h', '9', 'feed_forward', 'w1', 'kernel'), ('transformer', 'h', '9', 'feed_forward', 'w2', 'kernel'), ('transformer', 'h', '9', 'feed_forward', 'w3', 'kernel'), ('transformer', 'h', '9', 'ffn_norm', 'kernel'), ('transformer', 'ln_f', 'kernel'), ('transformer', 'wte', 'embedding')]
You should probably UPCAST the model weights to float32 if this was not intended. See [`~FlaxPreTrainedModel.to_fp32`] for further information on how to do this.

------
Run on CPU
Q: What is the largest animal?
A: The

------
Run on SPU

Relevant log output

No response

@anakinxc
Copy link
Collaborator

Hi @hkz103

Can you share your machine spec?

@hkz103
Copy link
Author

hkz103 commented Dec 18, 2023

linux ubuutu 20.04,gpu is a100,secretflow is built with docker

@anakinxc
Copy link
Collaborator

linux ubuutu 20.04,gpu is a100,secretflow is built with docker

what about ram? puma needs around 1TB ram per node.

@tpppppub
Copy link
Collaborator

In your case, you were running three nodes on one machine, you would need approximately 3TB of ram.

@hkz103
Copy link
Author

hkz103 commented Dec 19, 2023

If I run puma on gpt2 or other smaller models, do I need less ram?

@anakinxc
Copy link
Collaborator

If I run puma on gpt2 or other smaller models, do I need less ram?

yes, there is a gpt2 example, you try that.

Copy link

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants