-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] build is verrrrrrrrrrrrrrrrrrrry slow #945
Comments
Did you install ninja? That sped things up for me considerably |
I have ninja 1.11.1.1, I think this may not be the cause of the problem because his compilation speed was good in the previous commit. |
Following your suggestion, I attempted to install version 2.8.7 of flash-attention. However, the build process is still very slow, with CPU usage remaining below 1%. What could be causing this?😭 pip install flash-attn==2.5.7 --no-build-isolation
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://pypi.ngc.nvidia.com
Collecting flash-attn==2.5.7
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/21/cb/33a1f833ac4742c8adba063715bf769831f96d99dbbbb4be1b197b637872/flash_attn-2.5.7.tar.gz (2.5 MB)
━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 54.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from flash-attn==2.5.7) (2.3.0+cu118)
Collecting einops (from flash-attn==2.5.7)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/5a/f0b9ad6c0a9017e62d4735daaeb11ba3b6c009d69a26141b258cd37b5588/einops-0.8.0-py3-none-any.whl (43 kB)
━━━━━━━━━━━━━━━━ 43.2/43.2 kB 82.2 MB/s eta 0:00:00
Requirement already satisfied: packaging in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from flash-attn==2.5.7) (24.0)
Requirement already satisfied: ninja in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from flash-attn==2.5.7) (1.11.1.1)
Requirement already satisfied: filelock in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (3.14.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (4.12.0)
Requirement already satisfied: sympy in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (1.12)
Requirement already satisfied: networkx in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (3.2.1)
Requirement already satisfied: jinja2 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (3.1.3)
Requirement already satisfied: fsspec in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (2024.5.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.8.89 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.8.89)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.8.89 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.8.89)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.8.87 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.8.87)
Requirement already satisfied: nvidia-cudnn-cu11==8.7.0.84 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (8.7.0.84)
Requirement already satisfied: nvidia-cublas-cu11==11.11.3.6 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.11.3.6)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.3.0.86 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (10.3.0.86)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.1.48 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.4.1.48)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.5.86 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.7.5.86)
Requirement already satisfied: nvidia-nccl-cu11==2.20.5 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (2.20.5)
Requirement already satisfied: nvidia-nvtx-cu11==11.8.86 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (11.8.86)
Requirement already satisfied: triton==2.3.0 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from torch->flash-attn==2.5.7) (2.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from jinja2->torch->flash-attn==2.5.7) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /zhuzixuan/conda/envs/bunny/lib/python3.10/site-packages (from sympy->torch->flash-attn==2.5.7) (1.3.0)
Building wheels for collected packages: flash-attn
Building wheel for flash-attn (setup.py) ... - CPU:
14750 root 20 0 19496 10636 8960 S |
try to clone the repo and run |
Have you solved this problem yet? I have encountered the same problem. |
My XPS 15 windows is taking hours to build... |
I had the same issue. building flash-attn was slow and the CPU load was very low. Only 2 instances of the process "NVIDIA cicc" was running at the same time. |
The same issue, I couldn't build it all night, and now following the suggestion to revert to commit 2.5.8, the CPU was fully utilized, and the build succeeded. |
Same issue here (32 RAM and i7 13th) |
Same issue here(jetson agx orin |
Super slow build. |
Same issue |
same |
same |
4 similar comments
same |
same |
same |
same |
Same |
Same to me, only 4 of 20 cores are occupied |
Win10, with ninja installed from conda. The CPU load reaches 100%, 44 core ~ 100 GB RAM usage. The building took about 1 hour. |
Same.
|
Actually there is no need to compile all file in setup.py
I will only keep necessary file, such as
My gpu is L20, so I don't need compile for sm90, so I remove code below
Add these patch or it will case undefined symbol error
compile source code with
It only cost less than 1min and run
This way only work for research not production. |
@tridao Could the flash-attn team support a dedicated compilation feature? For instance, we could specify HDIM=64 and DTYPE=float16 during installation (pip install -e . -v) to build a version only for head dimension 64 and torch.float16. This would greatly facilitate the development of flash-attn. |
Sure, happy to review a PR! |
ninja+ MAX_JOBS=256 = Done! |
I compiled with the latest source code, and the compilation was so slow that I had to fall back on commit 2.5.8. the previous version took me about 3-5 minutes to complete (70%CPU and 230GB memory usage), but this version barely sees the cpu working. what happened to him.
"MAX_JOBS" doesn't get the CPU excited either.
CentOS: 7.9.2009
Python: 3.10.14
GCC: 12.3.0
cmake: 3.27.9
nvcc: 12.2.140
wheel is OK
The text was updated successfully, but these errors were encountered: