-
Notifications
You must be signed in to change notification settings - Fork 0
Kokkos Example
The following example will introduce APEX using the Kokkos performance portability abstraction model.
APEX is integrated with the support for Kokkos profiling callbacks.
The following example is a dated but useful Lulesh 2.0 implementation using Kokkos.
This example uses the Kokkos API for all computational kernels.
The apex_exec
wrapper script has several options for supporting HIP programs:
--apex:kokkos enable Kokkos support
--apex:kokkos_tuning enable Kokkos runtime autotuning support
--apex:kokkos_fence enable Kokkos fences for async kernels
To enable basic HIP support, use the --apex:kokkos
flag:
[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:kokkos --apex:tasktree ./build/bin/kokkos_lulesh_2.0
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Warning: environment variable 'KOKKOS_PROFILE_LIBRARY' is deprecated. Use 'KOKKOS_TOOLS_LIBS' instead. Raised by Kokkos::initialize().
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Reading cache of Kokkos tuning results from: './apex_converged_tuning.yaml'
Running problem size 30^3 per domain until completion
Num processors: 1
Total number of elements: 27000
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
BuildList
Run completed:
Problem size = 30
MPI tasks = 1
Iteration count = 35
Final Origin Energy = 1.169999e+07
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 0.000000e+00
TotalAbsDiff = 0.000000e+00
MaxRelDiff = 0.000000e+00
Elapsed time = 0.23 (s)
Grind time (us/z/c) = 0.24540317 (per dom) (0.24540317 overall)
FOM = 4074.9269 (z/s)
Start Date/Time: 26/02/2023 14:15:39
Elapsed time: 0.48768 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.48768 seconds
Available CPU time on all ranks: 0.48768 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 1540.00 1540.00
status:VmHWM kB : 1 4.25e+04 4.25e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.36e+05 1.36e+05
status:VmPTE kB : 1 376.00 376.00
status:VmPeak kB : 1 7.51e+05 7.51e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.25e+04 4.25e+04
status:VmSize kB : 1 6.86e+05 6.86e+05
status:VmStk kB : 1 136.00 136.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 27.00 27.00
status:voluntary_ctxt_switches : 1 28.00 28.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.49 0.49
int apex_preload_main(int, char **, char **) : 1 0.49 0.49
Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems : 1225 0.00 0.04
Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A : 1225 0.00 0.04
Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCouran… : 385 0.00 0.02
Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroC… : 385 0.00 0.02
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQReg… : 385 0.00 0.02
Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F : 385 0.00 0.01
Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedFor… : 385 0.00 0.01
Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourgl… : 35 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagran… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… : 351 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMater… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… : 35 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolume… : 35 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForEl… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNo… : 35 0.00 0.00
Kokkos deep copy: Host _mirror -> HIP : 23 0.00 0.00
Kokkos deep copy: HIP -> Host _mirror : 12 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 352 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGra… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] InitStressTermsFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsFor… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForE… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationF… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… : 35 0.00 0.00
Kokkos deep copy: Host Scalar -> HIP : 13 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-… : 2 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 29 0.00 0.00
Kokkos deep copy: Host nodeElemCornerList_mirror ->… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos deep copy: Host m_nodeElemStart_mirror -> HI… : 1 0.00 0.00
Kokkos deep copy: Host regElemlist::entries_mirror … : 1 0.00 0.00
Kokkos deep copy: Host regElemlist::row_map_mirror … : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 2 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 6155
Writing: .//apex_tasktree.csv
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 60 rows
Found 0 ranks, with max graph node index of 59 and depth of 3
building common tree...
Rank 0 ...
1-> 0.488 - 100.000% [1] {min=0.488, max=0.488, mean=0.488, threads=1} APEX MAIN
1 |-> 0.488 - 99.993% [1] {min=0.488, max=0.488, mean=0.488, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.043 - 8.834% [1225] {min=0.043, max=0.043, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems
1 | |-> 0.040 - 8.237% [1225] {min=0.040, max=0.040, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A
1 | |-> 0.024 - 4.846% [385] {min=0.024, max=0.024, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCourantConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | |-> 0.023 - 4.718% [385] {min=0.023, max=0.023, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | |-> 0.018 - 3.662% [385] {min=0.018, max=0.018, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQRegionForElems
1 | |-> 0.013 - 2.726% [385] {min=0.013, max=0.013, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F
1 | |-> 0.012 - 2.448% [385] {min=0.012, max=0.012, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedForElems
1 | |-> 0.003 - 0.587% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourglassControlForElemsR6DomainPddEUliRiE_
1 | |-> 0.003 - 0.562% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagrangeElementsR6DomainEUliRiE_
1 | |-> 0.003 - 0.540% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems B
1 | |-> 0.003 - 0.533% [351] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-1D
1 | |-> 0.003 - 0.528% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMaterialPropertiesForElemsR6DomainEUliRiE_
1 | |-> 0.002 - 0.504% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems B
1 | |-> 0.002 - 0.401% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolumeForceForElemsR6DomainEUliRiE_
1 | |-> 0.002 - 0.366% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForElemsR6DomainEUlRKiRiE_
1 | |-> 0.001 - 0.236% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNodes
1 | |-> 0.001 - 0.230% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes A
1 | |-> 0.001 - 0.209% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems A
1 | |-> 0.001 - 0.200% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes B
1 | |-> 0.001 - 0.196% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes C
1 | |-> 0.001 - 0.194% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNodes
1 | |-> 0.001 - 0.187% [23] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host _mirror -> HIP
1 | |-> 0.001 - 0.154% [12] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: HIP -> Host _mirror
1 | |-> 0.001 - 0.142% [352] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [] via memset
1 | |-> 0.001 - 0.129% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes
1 | |-> 0.001 - 0.129% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGradientsForElems
1 | |-> 0.001 - 0.120% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] InitStressTermsForElems
1 | |-> 0.001 - 0.119% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsForElems
1 | |-> 0.001 - 0.118% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForElems
1 | |-> 0.001 - 0.117% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems A
1 | |-> 0.001 - 0.116% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationForNodes
1 | |-> 0.001 - 0.115% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems A
1 | |-> 0.001 - 0.111% [13] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host Scalar -> HIP
1 | | |-> 0.000 - 0.062% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-1D
1 | |-> 0.000 - 0.032% [29] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [_mirror] via memset
1 | |-> 0.000 - 0.021% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host nodeElemCornerList_mirror -> HIP nodeElemCornerList
1 | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems B
1 | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigxx] via memset
1 | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [vnewc] via memset
1 | |-> 0.000 - 0.015% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems C
1 | |-> 0.000 - 0.014% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigzz] via memset
1 | |-> 0.000 - 0.014% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigyy] via memset
1 | |-> 0.000 - 0.012% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [determ] via memset
1 | |-> 0.000 - 0.006% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCornerList_mirror] via memset
1 | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host m_nodeElemStart_mirror -> HIP m_nodeElemStart
1 | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::entries_mirror -> HIP regElemlist::entries
1 | |-> 0.000 - 0.003% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::row_map_mirror -> HIP regElemlist::row_map
1 | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [Buffer] via memset
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [m_nodeElemStart] via memset
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCount] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::row_map] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [m_nodeElemStart_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [nodeElemCornerList] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::entries] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::entries_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-2D
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::row_map_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regBinEnd] via memset
61 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
The --apex:kokkos_fence
flag will tell the Kokkos runtime to not return from any calls until all asynchronous requests are complete. This option can help with some profiling situations.
If the Kokkos runtime is configured to target GPU devices, the --apex:hip
and related options will capture the device activity, as well as API calls.
[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:kokkos --apex:hip --apex:tasktree ./build/bin/kokkos_lulesh_2.0
___ ______ _______ __
/ _ \ | ___ \ ___\ \ / /
/ /_\ \| |_/ / |__ \ V /
| _ || __/| __| / \
| | | || | | |___/ /^\ \
\_| |_/\_| \____/\/ \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Warning: environment variable 'KOKKOS_PROFILE_LIBRARY' is deprecated. Use 'KOKKOS_TOOLS_LIBS' instead. Raised by Kokkos::initialize().
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Reading cache of Kokkos tuning results from: './apex_converged_tuning.yaml'
Running problem size 30^3 per domain until completion
Num processors: 1
Total number of elements: 27000
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
BuildList
Run completed:
Problem size = 30
MPI tasks = 1
Iteration count = 35
Final Origin Energy = 1.169999e+07
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 0.000000e+00
TotalAbsDiff = 0.000000e+00
MaxRelDiff = 0.000000e+00
Elapsed time = 0.31 (s)
Grind time (us/z/c) = 0.32937249 (per dom) (0.32937249 overall)
FOM = 3036.0763 (z/s)
Start Date/Time: 26/02/2023 14:19:30
Elapsed time: 0.573162 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 2
Available CPU time on rank 0: 1.14632 seconds
Available CPU time on all ranks: 1.14632 seconds
Counter : #samp | mean | max
--------------------------------------------------------------------------------
GPU: Bytes Allocated: hipMalloc : 539 2.63e+05 1.56e+07
GPU: Bytes Freed: hipFree : 533 2.63e+05 1.56e+07
GPU: CopyDeviceToHost Bytes : 957 0.00 0.00
GPU: CopyHostToDevice Bytes : 5649 0.00 0.00
GPU: FillBuffer Bytes : 546 0.00 0.00
GPU: Host Bytes Allocated: hipHostMalloc : 3 5.27e+05 1.02e+06
GPU: Total Bytes Occupied on Device : 1072 2.49e+07 3.23e+07
GPU: Total Bytes Occupied on Host : 3 8.87e+05 1.58e+06
Memory: Bytes Allocated : 3 5.27e+05 1.02e+06
Memory: Total Bytes Occupied : 3 4.66e+06 1.11e+07
status:Threads : 1 2.00 2.00
status:VmData kB : 1 4.16e+05 4.16e+05
status:VmExe kB : 1 1540.00 1540.00
status:VmHWM kB : 1 4.18e+04 4.18e+04
status:VmLck kB : 1 0.00 0.00
status:VmLib kB : 1 1.36e+05 1.36e+05
status:VmPTE kB : 1 360.00 360.00
status:VmPeak kB : 1 7.51e+05 7.51e+05
status:VmPin kB : 1 0.00 0.00
status:VmRSS kB : 1 4.18e+04 4.18e+04
status:VmSize kB : 1 6.86e+05 6.86e+05
status:VmStk kB : 1 136.00 136.00
status:VmSwap kB : 1 0.00 0.00
status:nonvoluntary_ctxt_switches : 1 2.00 2.00
status:voluntary_ctxt_switches : 1 31.00 31.00
--------------------------------------------------------------------------------
GPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
GPU: CopyDeviceToHost : 957 0.00 0.03
GPU: CopyHostToDevice : 5649 0.00 0.03
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 1225 0.00 0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 1225 0.00 0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 385 0.00 0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 385 0.00 0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 385 0.00 0.01
GPU: FillBuffer : 546 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 385 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 385 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 35 0.00 0.00
GPU: desul::(anonymous namespace)::init_lock_arrays… : 1 0.00 0.00
GPU: Kokkos::(anonymous namespace)::init_lock_array… : 1 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 1 0.00 0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… : 1 0.00 0.00
--------------------------------------------------------------------------------
CPU Timers : #calls| mean | total
--------------------------------------------------------------------------------
APEX MAIN : 1 0.57 0.57
int apex_preload_main(int, char **, char **) : 1 0.52 0.52
hipMemcpyAsync : 6595 0.00 0.26
hipEventSynchronize : 5075 0.00 0.07
Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems : 1225 0.00 0.06
Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A : 1225 0.00 0.05
hipLaunchKernel : 5184 0.00 0.04
hipMemcpyToSymbolAsync : 5075 0.00 0.04
Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCouran… : 385 0.00 0.03
Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroC… : 385 0.00 0.03
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQReg… : 385 0.00 0.02
hipEventRecord : 5075 0.00 0.02
Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F : 385 0.00 0.02
Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedFor… : 385 0.00 0.02
hipDeviceSynchronize : 493 0.00 0.01
hipStreamSynchronize : 1310 0.00 0.00
hipFree : 533 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourgl… : 35 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagran… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… : 35 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMater… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… : 351 0.00 0.00
hipFuncGetAttributes : 34 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolume… : 35 0.00 0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForEl… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 352 0.00 0.00
hipMalloc : 539 0.00 0.00
hipMemsetAsync : 533 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… : 35 0.00 0.00
Kokkos deep copy: Host _mirror -> HIP : 23 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsFor… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] InitStressTermsFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGra… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForE… : 35 0.00 0.00
hipMemcpyToSymbol : 11 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… : 35 0.00 0.00
Kokkos deep copy: HIP -> Host _mirror : 12 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationF… : 35 0.00 0.00
Kokkos deep copy: Host Scalar -> HIP : 13 0.00 0.00
hipHostMalloc : 3 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-… : 2 0.00 0.00
hipMemset : 13 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 35 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 29 0.00 0.00
Kokkos deep copy: Host nodeElemCornerList_mirror ->… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos deep copy: Host regElemlist::entries_mirror … : 1 0.00 0.00
Kokkos deep copy: Host m_nodeElemStart_mirror -> HI… : 1 0.00 0.00
Kokkos deep copy: Host regElemlist::row_map_mirror … : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 2 0.00 0.00
hipGetDeviceCount : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… : 1 0.00 0.00
hipGetDeviceProperties : 2 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
hipEventCreate : 1 0.00 0.00
hipSetDevice : 1 0.00 0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… : 1 0.00 0.00
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Total timers : 48969
Writing: .//apex_tasktree.csv
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 370 rows
Found 0 ranks, with max graph node index of 369 and depth of 5
building common tree...
Rank 0 ...
1-> 0.573 - 100.000% [1] {min=0.573, max=0.573, mean=0.573, threads=1} APEX MAIN
1 |-> 0.524 - 91.403% [1] {min=0.524, max=0.524, mean=0.524, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.200 - 34.955% [536] {min=0.200, max=0.200, mean=0.000, threads=1} hipMemcpyAsync
1 | | |-> 0.001 - 0.159% [536] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | |-> 0.056 - 9.818% [1225] {min=0.056, max=0.056, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems
1 | | |-> 0.021 - 3.699% [1225] {min=0.021, max=0.021, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.009 - 1.522% [1225] {min=0.009, max=0.009, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.005 - 0.824% [1225] {min=0.005, max=0.005, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.006 - 1.130% [1225] {min=0.006, max=0.006, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.008 - 1.480% [1225] {min=0.008, max=0.008, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.013 - 2.204% [1225] {min=0.013, max=0.013, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double, double, double, double, double, double*, double*, double, double, int, Domain&, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.004 - 0.744% [1225] {min=0.004, max=0.004, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.053 - 9.208% [1225] {min=0.053, max=0.053, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A
1 | | |-> 0.018 - 3.113% [1225] {min=0.018, max=0.018, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.009 - 1.531% [1225] {min=0.009, max=0.009, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.005 - 0.802% [1225] {min=0.005, max=0.005, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.006 - 1.133% [1225] {min=0.006, max=0.006, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.008 - 1.464% [1225] {min=0.008, max=0.008, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.014 - 2.378% [1225] {min=0.014, max=0.014, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<EvalEOSForElems(Domain&, double*, int, int, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.004 - 0.739% [1225] {min=0.004, max=0.004, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.030 - 5.185% [385] {min=0.030, max=0.030, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCourantConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.016 - 2.877% [385] {min=0.016, max=0.016, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.012 - 2.154% [385] {min=0.012, max=0.012, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.003 - 0.483% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.009 - 1.501% [385] {min=0.009, max=0.009, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcCourantConstraintForElems(Domain&, int, int, double, double&)::{lambda(int, MinFinder&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.002 - 0.413% [385] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.224% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.358% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.001 - 0.257% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.001 - 0.218% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.029 - 5.054% [385] {min=0.029, max=0.029, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.017 - 2.973% [385] {min=0.017, max=0.017, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.013 - 2.258% [385] {min=0.013, max=0.013, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.003 - 0.480% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.009 - 1.590% [385] {min=0.009, max=0.009, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcHydroConstraintForElems(Domain&, int, int, double, double&)::{lambda(int, MinFinder&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.002 - 0.384% [385] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.199% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.358% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.001 - 0.213% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.001 - 0.107% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.022 - 3.806% [385] {min=0.022, max=0.022, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQRegionForElems
1 | | |-> 0.009 - 1.620% [385] {min=0.009, max=0.009, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.004 - 0.770% [385] {min=0.004, max=0.004, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.006 - 1.070% [385] {min=0.006, max=0.006, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.003 - 0.464% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.242% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.356% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.001 - 0.232% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.017 - 2.998% [385] {min=0.017, max=0.017, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F
1 | | |-> 0.006 - 1.067% [385] {min=0.006, max=0.006, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.003 - 0.488% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.255% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.355% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.003 - 0.460% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.003 - 0.474% [385] {min=0.003, max=0.003, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<EvalEOSForElems(Domain&, double*, int, int, int)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.001 - 0.225% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.016 - 2.798% [385] {min=0.016, max=0.016, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedForElems
1 | | |-> 0.005 - 0.881% [385] {min=0.005, max=0.005, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.003 - 0.465% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.254% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.357% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.003 - 0.462% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.003 - 0.485% [385] {min=0.003, max=0.003, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcSoundSpeedForElems(Domain&, double*, double, double*, double*, double*, double*, double, int, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.001 - 0.234% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.006 - 0.964% [389] {min=0.006, max=0.006, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.004 - 0.775% [1310] {min=0.004, max=0.004, mean=0.000, threads=1} hipStreamSynchronize
1 | |-> 0.004 - 0.767% [533] {min=0.004, max=0.004, mean=0.000, threads=1} hipFree
1 | |-> 0.004 - 0.615% [35] {min=0.004, max=0.004, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourglassControlForElemsR6DomainPddEUliRiE_
1 | | |-> 0.002 - 0.429% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.002 - 0.364% [35] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.002 - 0.265% [35] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcHourglassControlForElems(Domain&, double*, double)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.036% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.007% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.554% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagrangeElementsR6DomainEUliRiE_
1 | | |-> 0.001 - 0.221% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.001 - 0.155% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.001 - 0.136% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.048% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.028% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.094% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcLagrangeElements(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.493% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems B
1 | | |-> 0.002 - 0.336% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.040% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.117% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, double, int, int)::{lambda(Kokkos::Impl::HIPTeamMember const&)#1}, Kokkos::TeamPolicy<>, Kokkos::Experimental::HIP> >()
1 | | |-> 0.000 - 0.036% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.017% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.489% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems B
1 | | |-> 0.002 - 0.313% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.046% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.025% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.039% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.110% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, double*, double*, int, int)::{lambda(Kokkos::Impl::HIPTeamMember const&)#1}, Kokkos::TeamPolicy<>, Kokkos::Experimental::HIP> >()
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.459% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMaterialPropertiesForElemsR6DomainEUliRiE_
1 | | |-> 0.001 - 0.200% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.001 - 0.134% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.001 - 0.101% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.083% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.001 - 0.099% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.050% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.078% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.008% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.452% [351] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-1D
1 | | |-> 0.002 - 0.433% [2] {min=0.002, max=0.002, mean=0.001, threads=1} hipFuncGetAttributes
1 | |-> 0.002 - 0.415% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolumeForceForElemsR6DomainEUliRiE_
1 | | |-> 0.002 - 0.347% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.025% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.110% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<CalcVolumeForceForElems(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy>, 1024u, 1u>(Kokkos::Impl::ParallelReduce<CalcVolumeForceForElems(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> const*)
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.002 - 0.399% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForElemsR6DomainEUlRKiRiE_
1 | | |-> 0.001 - 0.210% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.001 - 0.143% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.043% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.082% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcQForElems(Domain&)::{lambda(int const&, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.040% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.023% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.023% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.004% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.002 - 0.328% [352] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [] via memset
1 | | |-> 0.001 - 0.178% [352] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.003 - 0.453% [352] {min=0.003, max=0.003, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.002 - 0.282% [539] {min=0.002, max=0.002, mean=0.000, threads=1} hipMalloc
1 | |-> 0.001 - 0.261% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNodes
1 | | |-> 0.001 - 0.087% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.047% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcPositionForNodes(Domain&, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.254% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes A
1 | | |-> 0.001 - 0.088% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.043% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyAccelerationBoundaryConditionsForNodes(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.252% [23] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host _mirror -> HIP
1 | | |-> 0.001 - 0.217% [23] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.053% [23] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.007% [46] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.001 - 0.240% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems A
1 | | |-> 0.000 - 0.063% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.050% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.029% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.002 - 0.268% [35] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, double*, double*, int, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.234% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes B
1 | | |-> 0.000 - 0.067% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.029% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyAccelerationBoundaryConditionsForNodes(Domain&)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.232% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNodes
1 | | |-> 0.000 - 0.058% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.043% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.053% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcVelocityForNodes(Domain&, double, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.231% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes C
1 | | |-> 0.000 - 0.063% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.028% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyAccelerationBoundaryConditionsForNodes(Domain&)::{lambda(int)#3}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.200% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsForElems
1 | | |-> 0.000 - 0.048% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.109% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.044% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.180% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] InitStressTermsForElems
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.040% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<InitStressTermsForElems(Domain&, double*, double*, double*, int)::{lambda(int const&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.038% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.026% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.179% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGradientsForElems
1 | | |-> 0.000 - 0.044% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.117% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.017% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.026% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.178% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.029% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcForceForNodes(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.010% [3] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbol
1 | | | |-> 0.000 - 0.001% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.177% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForElems
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.037% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<UpdateVolumesForElems(Domain&, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.170% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems A
1 | | |-> 0.000 - 0.049% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.232% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, double, int, int)::{lambda(int const&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.167% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems A
1 | | |-> 0.000 - 0.044% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.162% [12] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: HIP -> Host _mirror
1 | | |-> 0.001 - 0.138% [12] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.031% [12] {min=0.000, max=0.000, mean=0.000, threads=2} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.004% [24] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.001 - 0.162% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationForNodes
1 | | |-> 0.000 - 0.039% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.051% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcAccelerationForNodes(Domain&, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.035% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.152% [5] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyToSymbol
1 | | |-> 0.000 - 0.002% [5] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | |-> 0.001 - 0.134% [13] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host Scalar -> HIP
1 | | |-> 0.000 - 0.065% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-1D
1 | | | |-> 0.000 - 0.041% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipHostMalloc
1 | | | |-> 0.000 - 0.010% [3] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbol
1 | | | | |-> 0.000 - 0.001% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | | |-> 0.000 - 0.006% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::Experimental::HIPSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP>, 1024u, 1u>(Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::Experimental::HIPSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP> const*)
1 | | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP>, 1024u, 1u>(Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP> const*)
1 | | | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | | |-> 0.000 - 0.047% [11] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemset
1 | | | |-> 0.000 - 0.015% [11] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | | |-> 0.000 - 0.009% [26] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.052% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems B
1 | | |-> 0.000 - 0.027% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.039% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy>, 1024u, 1u>(Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> const*)
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.000 - 0.047% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems C
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.026% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#3}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy>, 1024u, 1u>(Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#3}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> const*)
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.000 - 0.037% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [vnewc] via memset
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.036% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigxx] via memset
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.046% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.031% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigyy] via memset
1 | | |-> 0.000 - 0.016% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.031% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigzz] via memset
1 | | |-> 0.000 - 0.016% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.046% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.030% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipHostMalloc
1 | |-> 0.000 - 0.030% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [determ] via memset
1 | | |-> 0.000 - 0.014% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.026% [29] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [_mirror] via memset
1 | |-> 0.000 - 0.018% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host nodeElemCornerList_mirror -> HIP nodeElemCornerList
1 | | |-> 0.000 - 0.016% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.009% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.012% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemset
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.011% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCornerList_mirror] via memset
1 | |-> 0.000 - 0.008% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | |-> 0.000 - 0.015% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: desul::(anonymous namespace)::init_lock_arrays_hip_kernel()
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Kokkos::(anonymous namespace)::init_lock_array_kernel_atomic()
1 | |-> 0.000 - 0.005% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::entries_mirror -> HIP regElemlist::entries
1 | | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.005% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host m_nodeElemStart_mirror -> HIP m_nodeElemStart
1 | | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.003% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::row_map_mirror -> HIP regElemlist::row_map
1 | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [Buffer] via memset
1 | | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.005% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipGetDeviceCount
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [m_nodeElemStart] via memset
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::row_map] via memset
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::entries] via memset
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [nodeElemCornerList] via memset
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipGetDeviceProperties
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [m_nodeElemStart_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCount] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::entries_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventCreate
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipSetDevice
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-2D
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::row_map_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regBinEnd] via memset
371 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[khuck@gilgamesh apex-tutorial]$ dot -Tsvg -O tasktree.dot
Clearly the graph contains a lot of noisy nodes, so we can prune it by using some of the apex-treesummary.py
options:
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot --tlimit 0.01
Reading tasktree...
Read 370 rows
Found 0 ranks, with max graph node index of 369 and depth of 5
Ignoring any tree nodes with less than 0.01 accumulated time...
Kept 18 rows
building common tree...
Rank 0 ...
1-> 0.573 - 100.000% [1] {min=0.573, max=0.573, mean=0.573, threads=1} APEX MAIN
1 |-> 0.524 - 91.403% [1] {min=0.524, max=0.524, mean=0.524, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.200 - 34.955% [536] {min=0.200, max=0.200, mean=0.000, threads=1} hipMemcpyAsync
1 | |-> 0.056 - 9.818% [1225] {min=0.056, max=0.056, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems
1 | | |-> 0.021 - 3.699% [1225] {min=0.021, max=0.021, mean=0.000, threads=1} hipEventSynchronize
1 | |-> 0.053 - 9.208% [1225] {min=0.053, max=0.053, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A
1 | | |-> 0.018 - 3.113% [1225] {min=0.018, max=0.018, mean=0.000, threads=1} hipEventSynchronize
1 | |-> 0.030 - 5.185% [385] {min=0.030, max=0.030, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCourantConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.016 - 2.877% [385] {min=0.016, max=0.016, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.012 - 2.154% [385] {min=0.012, max=0.012, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | |-> 0.029 - 5.054% [385] {min=0.029, max=0.029, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.017 - 2.973% [385] {min=0.017, max=0.017, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.013 - 2.258% [385] {min=0.013, max=0.013, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | |-> 0.022 - 3.806% [385] {min=0.022, max=0.022, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQRegionForElems
1 | |-> 0.017 - 2.998% [385] {min=0.017, max=0.017, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F
1 | |-> 0.016 - 2.798% [385] {min=0.016, max=0.016, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedForElems
17 total graph nodes
Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[khuck@gilgamesh apex-tutorial]$ dot -Tsvg -O tasktree.dot
APEX tutorial, © Copyright 2023, University of Oregon. For more information on APEX, see https://github.com/UO-OACISS/apex