You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm running MoBA on RunPod using a high-performance GPU instance(1 x H100 SXM 26 vCPU 251 GB RAM), but the model is generating nonsensical and repetitive outputs. Additionally, I am seeing warnings related to attention masks and sampling settings. This issue persists across different prompts.
I followed the provided install instructions in the README. No torch version was listed in requirements.txt but manually installed:
[00:14<00:00, 3.67s/it] /root/miniconda/envs/moba/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:629: UserWarning: do_sampleis set toFalse. However, temperatureis set to0.6-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. warnings.warn( /root/miniconda/envs/moba/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:634: UserWarning: do_sampleis set toFalse. However, top_pis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettop_p. warnings.warn( The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_maskto obtain reliable results. Settingpad_token_idtoeos_token_id:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. tensor([[128000, 5269, 527, 499, 30, 358, 1097, 7060, 13, 358, 1097, 264, 5575, 13, 358, 1097, 21630, 304, 264, 12374, 13, 358, 1097, 21630, 304, 264, 12374, 13, 358, 1097, 21630, 304]], device='cuda:0') <|begin_of_text|>how are you? I am fine. I am a student. I am studying in a university. I am studying in a university. I am studying in
I'm assuming the attention mask warnings may be causing the repetitive output.
Are there specific MoBA settings that should be adjusted? Also, could this be related to Torch or CUDA version mismatches? What are the recommended versions for best performance?
Any guidance on resolving these issues would be appreciated.
The text was updated successfully, but these errors were encountered:
Hi, I'm running MoBA on RunPod using a high-performance GPU instance(1 x H100 SXM 26 vCPU 251 GB RAM), but the model is generating nonsensical and repetitive outputs. Additionally, I am seeing warnings related to attention masks and sampling settings. This issue persists across different prompts.
I followed the provided install instructions in the README. No torch version was listed in requirements.txt but manually installed:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Uname: Linux ffe21a18248e 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Python: 3.10.16
Torch Version: 2.6.0+cu118
CUDA Available: True
CUDA Version: 11.8
CUDNN Version: 90100
Pip freeze output:
accelerate==1.4.0
certifi==2025.1.31
charset-normalizer==3.4.1
einops==0.8.1
exceptiongroup==1.2.2
filelock==3.13.1
flash_attn==2.6.3
fsspec==2024.6.1
huggingface-hub==0.29.1
idna==3.10
iniconfig==2.0.0
Jinja2==3.1.4
MarkupSafe==2.1.5
moba @ file:///app/MoBA
mpmath==1.3.0
networkx==3.3
numpy==2.1.2
nvidia-cublas-cu11==11.11.3.6
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cudnn-cu11==9.1.0.70
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.3.0.86
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusparse-cu11==11.7.5.86
nvidia-nccl-cu11==2.21.5
nvidia-nvtx-cu11==11.8.86
packaging==24.2
pillow==11.0.0
pluggy==1.5.0
psutil==7.0.0
pytest==8.3.4
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.5.2
sympy==1.13.1
tokenizers==0.21.0
tomli==2.2.1
torch==2.6.0+cu118
torchaudio==2.6.0+cu118
torchvision==0.21.0+cu118
tqdm==4.67.1
transformers==4.49.0
triton==3.2.0
typing_extensions==4.12.2
urllib3==2.3.0
Command Used to Run Model:
python3 examples/llama.py --model meta-llama/Llama-3.1-8B --attn moba
Output
[00:14<00:00, 3.67s/it] /root/miniconda/envs/moba/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:629: UserWarning:
do_sampleis set to
False. However,
temperatureis set to
0.6-- this flag is only used in sample-based generation modes. You should set
do_sample=Trueor unset
temperature. warnings.warn( /root/miniconda/envs/moba/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:634: UserWarning:
do_sampleis set to
False. However,
top_pis set to
0.9-- this flag is only used in sample-based generation modes. You should set
do_sample=Trueor unset
top_p. warnings.warn( The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_maskto obtain reliable results. Setting
pad_token_idto
eos_token_id:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_maskto obtain reliable results. tensor([[128000, 5269, 527, 499, 30, 358, 1097, 7060, 13, 358, 1097, 264, 5575, 13, 358, 1097, 21630, 304, 264, 12374, 13, 358, 1097, 21630, 304, 264, 12374, 13, 358, 1097, 21630, 304]], device='cuda:0') <|begin_of_text|>how are you? I am fine. I am a student. I am studying in a university. I am studying in a university. I am studying in
I'm assuming the attention mask warnings may be causing the repetitive output.
Are there specific MoBA settings that should be adjusted? Also, could this be related to Torch or CUDA version mismatches? What are the recommended versions for best performance?
Any guidance on resolving these issues would be appreciated.
The text was updated successfully, but these errors were encountered: