Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Build succeeds and features succeed without lean server crash #106

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

Adarsh321123
Copy link
Contributor

Previously on Ubuntu, lake build failed. Now, lake build succeeds, and features like premise selection and tactic suggestion also succeed rather than cause the Lean server to crash.

@yangky11 yangky11 requested a review from Peiyang-Song August 7, 2024 13:29
@Peiyang-Song
Copy link
Member

Thanks @Adarsh321123 for the PR! Just adding a note in this PR: I think the Ubuntu we are talking about is one with CUDA-enabled GPU.

In the PR two groups of flags are deleted. I gave it a try and indeed it worked on Ubuntu with GPU, despite being fairly slow.

I also found that the first line of change seems not necessary -- the build succeeds with or without cutting two of the four flags on that line, i.e. from let mut flags := #["-DBUILD_CLI=OFF", "-DOPENMP_RUNTIME=NONE", "-DWITH_DNNL=OFF", "-DWITH_MKL=OFF"] to let mut flags := #["-DOPENMP_RUNTIME=NONE", "-DWITH_MKL=OFF"]. Any other reasons why deleting these two flags are needed?

So then the part of change that really functions here is the second, which is to delete the entire

  if ← useCUDA then
    flags := flags ++ #["-DWITH_CUDA=ON", "-DWITH_CUDNN=ON"]
  else
    flags := flags ++ #["-DWITH_CUDA=OFF", "-DWITH_CUDNN=OFF"]

It seems to me that this completely disables the use of CUDA on a machine with a CUDA-enabled GPU, instead of choosing whether to use CUDA based on the machine's specs. The fact that this works indicates that there might be issues with how the program is being built when WITH_CUDA is turned on. Yet at the same time, it seems a bit brutal that we just do not use CUDA anyways.

@Peiyang-Song
Copy link
Member

The first change seems functionally unnecessary so far, i will add it back if there is a reason for that.

The second change seems suboptimal but does work, despite being slow. I think it indicates that we should look into how CUDA in used during the build. For now, I will incorporate the second part of change (deleting the CUDA-related flags) while we look more into actual reasons behind that.

@Peiyang-Song Peiyang-Song changed the base branch from main to peiyang August 14, 2024 04:00
@Adarsh321123
Copy link
Contributor Author

Hi Peiyang! Thank you for your comments. I think the "Ubuntu" I mentioned may not necessarily be one with a CUDA-enabled GPU. This is because the same build and run errors exist on my M1 Mac, which does not have a CUDA-enabled GPU. As such, the bug may be related to CUDA rather than being because of CUDA.

I agree that the first set of flags is unnecessary on Ubuntu. However, it is necessary for a successful build on my M1 Mac. Please let me know if you notice the same on your end.

I agree that the second change is suboptimal. However, given that we have found an alternative for a while, I agree that it is acceptable to incorporate this change. Also, although this removes CUDA capabilities, I noticed that the 3 features are very fast on an Ubuntu desktop with a 4090, but very slow on a DGX server with 4 A100s. This again suggests that CUDA may not be the direct issue, but may be related.

@Peiyang-Song
Copy link
Member

I see, I said it's probably related to CUDA because I was able to build everything on a Ubuntu machine without CUDA-enabled GPUs. Anyways, I think we can incorporate the temporary suboptimal fix for now.

@Peiyang-Song
Copy link
Member

For the M1 Mac, that's interesting. So you were unable to successfully lake build on your M1 Mac without changing the first line?

@Adarsh321123
Copy link
Contributor Author

It's interesting that you were able to build on an Ubuntu machine without CUDA-enabled GPUs.

@Peiyang-Song
Copy link
Member

I actually don't have a M1 Mac myself, so if you could confirm that the first line is necessary for M1 Mac, and if I validate that the change in the first line does not make any other architecture not work (I'll handle that testing), we can just incorporate the change.

@Adarsh321123
Copy link
Contributor Author

Yes, I was unable to successfully run lake build on my M1 Mac without changing the first line. I will check this one more time!

@Peiyang-Song
Copy link
Member

Sounds great! Let me know of the result after you double check that. Then I'll merge. Thanks!

@Adarsh321123
Copy link
Contributor Author

I can confirm that I cannot lake build on my M1 Mac without the first change.

@Peiyang-Song
Copy link
Member

Sounds great, thanks for the contribution!

@Peiyang-Song Peiyang-Song merged commit 063e036 into lean-dojo:peiyang Aug 14, 2024
1 of 2 checks passed
@Peiyang-Song Peiyang-Song mentioned this pull request Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants