-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Cannot compile Fortran do concurrent for AMD GPU with new AMDFLANG compiler #214
Comments
Hi @sumseq. Internal ticket has been created to assist with your issue. Thanks! |
Hi @sumseq, can you try building again with the latest version of the new flang (available here)? I was able to build successfully with the Getting the new flang
Building MPI:
Building HDF5:
Building hipftSame as described in the original issue:
but with
|
Hi, I was not able to follow your mpi and hdf5 build instructions exactly, but I was able to install them with the new flang using our external library builder. I was able to compile hipft with those flags but see:
The warnings indicate a performance issue, but the last output seems to be saying it is ignoring the do concurrent flag (even though it seems to have tried to parallelized them anyways..) I tried to run the code but got:
This is an issue I encountered before and it seems to be due to Rocky 9.5 using a newer version of libffi. The code now runs, but when I run the testsuite, one of the tests failed with a small error:
But another test failed much worse:
The first test of the testsuite did pass however, so I am thinking there must be a specific part of the code that is the problem but I am not sure how to identify that. Just to test, I removed the GPU architecture flag, and then I can compile but when running I get:
I also tried to compile with the original flags but with "device" instead of "host" and got:
|
Problem Description
I am trying to use the new "AMD Modern Fortran Compiler" described here:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
on my code that uses "do concurrent" for GPU-offload with optional OpenMP Target data movement (for GPUs/compiler that do not support unified memory).
The code is "HipFT" located publicly here:
github.com/predsci/hipft
The code works on NVIDIA GPUs with nvfortran and HPE, and on Intel GPUs with ifx.
It also compiles and runs on AMD server GPUs with HPE's CCE compiler (see https://arxiv.org/pdf/2408.07843)
I have compiled HDF5 1.14.3 (with a configure fix) and OpenMPI 5.0.6 with the amdflang and amdclang compiler to link to the code.
When I try to compiler with:
-O3 -fopenmp -fdo-concurrent-parallel=device --offload-arch=gfx906
I get:
I am using 'mpif90' to compile the code which is using the amdflang:
If I try to compile without any OpenMP or Do Concurrent flags, the code compiles fine and runs correctly on 1 CPU core.
If I try to compile with just openmp turned on, and "do concurrent" set to host I get a lot of serialization warnings:
warning: loc("/home/caplanr/hipft/git_amd/src/hipft.f90":7683:7): Some
do concurrentloops are not perfectly-nested. These will be serialzied.
These concern me since if I cannot use DC with index ranges like "2:N-1" than I doubt the code will parallelize at all on either the GPU or CPU since a LOT of the loops are like that.
Note I also had to use:
-L/opt/amdfort/llvm/lib -lomptarget
in this case otherwise it cannot find the OpenMP target data movement symbols (although they should not be being used in this case....).Any help would be appreciated as I plan to present the code at SIAM's CSE meeting in a few months and would really like to have some AMD results.
-- Ron
Operating System
Rocky Linux 9.5 (Blue Onyx)
CPU
Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
GPU
AMD Radeon VII, gfx906, amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-, , amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
ROCm Version
ROCm 6.2.3
ROCm Component
flang
Steps to Reproduce
My rocm is actually 6.2.4, but that is not on the list.
My linux kernel is:
edge 6.10.6-1.el9.elrepo.x86_64
To reproduce, install the new AMD flang compiler from:
https://github.com/amd/InfinityHub-CI/tree/main/fortran
Next, clone the repo:
git clone https://github.com/predsci/hipft
Then, copy one of the build scripts from the
build_examples
folder and edit the top portion to resemble this:But:
Now, try to run the build script in the top level directory of the repo.
You should see:
You can go into the
src
folder and try to edit the Makefile and recompile as needed.(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4000
BDFID: 0
Internal Node ID: 0
Compute Unit: 6
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32508640(0x1f00ae0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx906
Uuid: GPU-b86490a172da5ee9
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 3584
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 472
SDMA engine uCode:: 145
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 1676083(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 1676083(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx9-generic:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
Here are my installed amd and rocm packages:
The text was updated successfully, but these errors were encountered: