Skip to content

Kokkos Example

Kevin Huck edited this page Feb 26, 2023 · 4 revisions

The following example will introduce APEX using the Kokkos performance portability abstraction model.

APEX is integrated with the support for Kokkos profiling callbacks.

Source Code

The following example is a dated but useful Lulesh 2.0 implementation using Kokkos.

This example uses the Kokkos API for all computational kernels.

Running the Kokkos example

The apex_exec wrapper script has several options for supporting HIP programs:

    --apex:kokkos                 enable Kokkos support
    --apex:kokkos_tuning          enable Kokkos runtime autotuning support
    --apex:kokkos_fence           enable Kokkos fences for async kernels

To enable basic HIP support, use the --apex:kokkos flag:

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:kokkos --apex:tasktree ./build/bin/kokkos_lulesh_2.0
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Warning: environment variable 'KOKKOS_PROFILE_LIBRARY' is deprecated. Use 'KOKKOS_TOOLS_LIBS' instead. Raised by Kokkos::initialize().
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Reading cache of Kokkos tuning results from: './apex_converged_tuning.yaml'
Running problem size 30^3 per domain until completion
Num processors: 1
Total number of elements: 27000

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

BuildList
Run completed:
   Problem size        =  30
   MPI tasks           =  1
   Iteration count     =  35
   Final Origin Energy = 1.169999e+07
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 0.000000e+00
        TotalAbsDiff = 0.000000e+00
        MaxRelDiff   = 0.000000e+00


Elapsed time         =       0.23 (s)
Grind time (us/z/c)  = 0.24540317 (per dom)  (0.24540317 overall)
FOM                  =  4074.9269 (z/s)


Start Date/Time: 26/02/2023 14:15:39
Elapsed time: 0.48768 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.48768 seconds
Available CPU time on all ranks: 0.48768 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1  1540.00  1540.00
                                     status:VmHWM kB :      1 4.25e+04 4.25e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.36e+05 1.36e+05
                                     status:VmPTE kB :      1   376.00   376.00
                                    status:VmPeak kB :      1 7.51e+05 7.51e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.25e+04 4.25e+04
                                    status:VmSize kB :      1 6.86e+05 6.86e+05
                                     status:VmStk kB :      1   136.00   136.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1    27.00    27.00
                      status:voluntary_ctxt_switches :      1    28.00    28.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.49     0.49
        int apex_preload_main(int, char **, char **) :      1     0.49     0.49
Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems :   1225     0.00     0.04
 Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A :   1225     0.00     0.04
Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCouran… :    385     0.00     0.02
Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroC… :    385     0.00     0.02
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQReg… :    385     0.00     0.02
 Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F :    385     0.00     0.01
Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedFor… :    385     0.00     0.01
Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourgl… :     35     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagran… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… :    351     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMater… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… :     35     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolume… :     35     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForEl… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNo… :     35     0.00     0.00
              Kokkos deep copy: Host _mirror -> HIP  :     23     0.00     0.00
              Kokkos deep copy: HIP  -> Host _mirror :     12     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :    352     0.00     0.00
 Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGra… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] InitStressTermsFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsFor… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForE… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationF… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… :     35     0.00     0.00
               Kokkos deep copy: Host Scalar -> HIP  :     13     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-… :      2     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :     29     0.00     0.00
Kokkos deep copy: Host nodeElemCornerList_mirror ->… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos deep copy: Host m_nodeElemStart_mirror -> HI… :      1     0.00     0.00
Kokkos deep copy: Host regElemlist::entries_mirror … :      1     0.00     0.00
Kokkos deep copy: Host regElemlist::row_map_mirror … :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      2     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 6155
Writing: .//apex_tasktree.csv
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 60 rows
Found 0 ranks, with max graph node index of 59 and depth of 3
building common tree...
Rank 0 ...
1-> 0.488 - 100.000% [1] {min=0.488, max=0.488, mean=0.488, threads=1} APEX MAIN
1 |-> 0.488 - 99.993% [1] {min=0.488, max=0.488, mean=0.488, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.043 - 8.834% [1225] {min=0.043, max=0.043, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems
1 | |-> 0.040 - 8.237% [1225] {min=0.040, max=0.040, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A
1 | |-> 0.024 - 4.846% [385] {min=0.024, max=0.024, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCourantConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | |-> 0.023 - 4.718% [385] {min=0.023, max=0.023, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | |-> 0.018 - 3.662% [385] {min=0.018, max=0.018, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQRegionForElems
1 | |-> 0.013 - 2.726% [385] {min=0.013, max=0.013, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F
1 | |-> 0.012 - 2.448% [385] {min=0.012, max=0.012, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedForElems
1 | |-> 0.003 - 0.587% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourglassControlForElemsR6DomainPddEUliRiE_
1 | |-> 0.003 - 0.562% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagrangeElementsR6DomainEUliRiE_
1 | |-> 0.003 - 0.540% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems B
1 | |-> 0.003 - 0.533% [351] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-1D
1 | |-> 0.003 - 0.528% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMaterialPropertiesForElemsR6DomainEUliRiE_
1 | |-> 0.002 - 0.504% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems B
1 | |-> 0.002 - 0.401% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolumeForceForElemsR6DomainEUliRiE_
1 | |-> 0.002 - 0.366% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForElemsR6DomainEUlRKiRiE_
1 | |-> 0.001 - 0.236% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNodes
1 | |-> 0.001 - 0.230% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes A
1 | |-> 0.001 - 0.209% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems A
1 | |-> 0.001 - 0.200% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes B
1 | |-> 0.001 - 0.196% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes C
1 | |-> 0.001 - 0.194% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNodes
1 | |-> 0.001 - 0.187% [23] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host _mirror -> HIP
1 | |-> 0.001 - 0.154% [12] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: HIP  -> Host _mirror
1 | |-> 0.001 - 0.142% [352] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [] via memset
1 | |-> 0.001 - 0.129% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes
1 | |-> 0.001 - 0.129% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGradientsForElems
1 | |-> 0.001 - 0.120% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] InitStressTermsForElems
1 | |-> 0.001 - 0.119% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsForElems
1 | |-> 0.001 - 0.118% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForElems
1 | |-> 0.001 - 0.117% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems A
1 | |-> 0.001 - 0.116% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationForNodes
1 | |-> 0.001 - 0.115% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems A
1 | |-> 0.001 - 0.111% [13] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host Scalar -> HIP
1 | | |-> 0.000 - 0.062% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-1D
1 | |-> 0.000 - 0.032% [29] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [_mirror] via memset
1 | |-> 0.000 - 0.021% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host nodeElemCornerList_mirror -> HIP nodeElemCornerList
1 | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems B
1 | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigxx] via memset
1 | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [vnewc] via memset
1 | |-> 0.000 - 0.015% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems C
1 | |-> 0.000 - 0.014% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigzz] via memset
1 | |-> 0.000 - 0.014% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigyy] via memset
1 | |-> 0.000 - 0.012% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [determ] via memset
1 | |-> 0.000 - 0.006% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCornerList_mirror] via memset
1 | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host m_nodeElemStart_mirror -> HIP m_nodeElemStart
1 | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::entries_mirror -> HIP regElemlist::entries
1 | |-> 0.000 - 0.003% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::row_map_mirror -> HIP regElemlist::row_map
1 | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [Buffer] via memset
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [m_nodeElemStart] via memset
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCount] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::row_map] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [m_nodeElemStart_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [nodeElemCornerList] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::entries] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::entries_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-2D
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::row_map_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regBinEnd] via memset
61 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.

DOT task graph of Kokkos example

The --apex:kokkos_fence flag will tell the Kokkos runtime to not return from any calls until all asynchronous requests are complete. This option can help with some profiling situations.

If the Kokkos runtime is configured to target GPU devices, the --apex:hip and related options will capture the device activity, as well as API calls.

[khuck@gilgamesh apex-tutorial]$ apex_exec --apex:kokkos --apex:hip --apex:tasktree ./build/bin/kokkos_lulesh_2.0
  ___  ______ _______   __
 / _ \ | ___ \  ___\ \ / /
/ /_\ \| |_/ / |__  \ V /
|  _  ||  __/|  __| /   \
| | | || |   | |___/ /^\ \
\_| |_/\_|   \____/\/   \/
APEX Version: v2.6.1-da0e52e-develop
Built on: 17:54:27 Feb 25 2023 (RelWithDebInfo)
C++ Language Standard version : 201402
Clang Compiler version : AMD Clang 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Warning: environment variable 'KOKKOS_PROFILE_LIBRARY' is deprecated. Use 'KOKKOS_TOOLS_LIBS' instead. Raised by Kokkos::initialize().
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Reading cache of Kokkos tuning results from: './apex_converged_tuning.yaml'
Running problem size 30^3 per domain until completion
Num processors: 1
Total number of elements: 27000

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

BuildList
Run completed:
   Problem size        =  30
   MPI tasks           =  1
   Iteration count     =  35
   Final Origin Energy = 1.169999e+07
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 0.000000e+00
        TotalAbsDiff = 0.000000e+00
        MaxRelDiff   = 0.000000e+00


Elapsed time         =       0.31 (s)
Grind time (us/z/c)  = 0.32937249 (per dom)  (0.32937249 overall)
FOM                  =  3036.0763 (z/s)


Start Date/Time: 26/02/2023 14:19:30
Elapsed time: 0.573162 seconds
Total processes detected: 1
HW Threads detected on rank 0: 96
Worker Threads observed on rank 0: 2
Available CPU time on rank 0: 1.14632 seconds
Available CPU time on all ranks: 1.14632 seconds

Counter                                              :  #samp |   mean  |  max
--------------------------------------------------------------------------------
                     GPU: Bytes Allocated: hipMalloc :    539 2.63e+05 1.56e+07
                           GPU: Bytes Freed: hipFree :    533 2.63e+05 1.56e+07
                         GPU: CopyDeviceToHost Bytes :    957     0.00     0.00
                         GPU: CopyHostToDevice Bytes :   5649     0.00     0.00
                               GPU: FillBuffer Bytes :    546     0.00     0.00
            GPU: Host Bytes Allocated: hipHostMalloc :      3 5.27e+05 1.02e+06
                 GPU: Total Bytes Occupied on Device :   1072 2.49e+07 3.23e+07
                   GPU: Total Bytes Occupied on Host :      3 8.87e+05 1.58e+06
                             Memory: Bytes Allocated :      3 5.27e+05 1.02e+06
                        Memory: Total Bytes Occupied :      3 4.66e+06 1.11e+07
                                      status:Threads :      1     2.00     2.00
                                    status:VmData kB :      1 4.16e+05 4.16e+05
                                     status:VmExe kB :      1  1540.00  1540.00
                                     status:VmHWM kB :      1 4.18e+04 4.18e+04
                                     status:VmLck kB :      1     0.00     0.00
                                     status:VmLib kB :      1 1.36e+05 1.36e+05
                                     status:VmPTE kB :      1   360.00   360.00
                                    status:VmPeak kB :      1 7.51e+05 7.51e+05
                                     status:VmPin kB :      1     0.00     0.00
                                     status:VmRSS kB :      1 4.18e+04 4.18e+04
                                    status:VmSize kB :      1 6.86e+05 6.86e+05
                                     status:VmStk kB :      1   136.00   136.00
                                    status:VmSwap kB :      1     0.00     0.00
                   status:nonvoluntary_ctxt_switches :      1     2.00     2.00
                      status:voluntary_ctxt_switches :      1    31.00    31.00
--------------------------------------------------------------------------------

GPU Timers                                           : #calls|   mean |  total
--------------------------------------------------------------------------------
                               GPU: CopyDeviceToHost :    957     0.00     0.03
                               GPU: CopyHostToDevice :   5649     0.00     0.03
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :   1225     0.00     0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :   1225     0.00     0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :    385     0.00     0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :    385     0.00     0.01
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :    385     0.00     0.01
                                     GPU: FillBuffer :    546     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :    385     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :    385     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :     35     0.00     0.00
GPU: desul::(anonymous namespace)::init_lock_arrays… :      1     0.00     0.00
GPU: Kokkos::(anonymous namespace)::init_lock_array… :      1     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :      1     0.00     0.00
GPU: void Kokkos::Experimental::Impl::hip_parallel_… :      1     0.00     0.00
--------------------------------------------------------------------------------

CPU Timers                                           : #calls|   mean |   total
--------------------------------------------------------------------------------
                                           APEX MAIN :      1     0.57     0.57
        int apex_preload_main(int, char **, char **) :      1     0.52     0.52
                                      hipMemcpyAsync :   6595     0.00     0.26
                                 hipEventSynchronize :   5075     0.00     0.07
Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems :   1225     0.00     0.06
 Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A :   1225     0.00     0.05
                                     hipLaunchKernel :   5184     0.00     0.04
                              hipMemcpyToSymbolAsync :   5075     0.00     0.04
Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCouran… :    385     0.00     0.03
Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroC… :    385     0.00     0.03
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQReg… :    385     0.00     0.02
                                      hipEventRecord :   5075     0.00     0.02
 Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F :    385     0.00     0.02
Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedFor… :    385     0.00     0.02
                                hipDeviceSynchronize :    493     0.00     0.01
                                hipStreamSynchronize :   1310     0.00     0.00
                                             hipFree :    533     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourgl… :     35     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagran… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… :     35     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMater… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… :    351     0.00     0.00
                                hipFuncGetAttributes :     34     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolume… :     35     0.00     0.00
Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForEl… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :    352     0.00     0.00
                                           hipMalloc :    539     0.00     0.00
                                      hipMemsetAsync :    533     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… :     35     0.00     0.00
              Kokkos deep copy: Host _mirror -> HIP  :     23     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] IntegrateStressFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyAcceleration… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsFor… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] InitStressTermsFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGra… :     35     0.00     0.00
 Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForE… :     35     0.00     0.00
                                   hipMemcpyToSymbol :     11     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassFo… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… :     35     0.00     0.00
              Kokkos deep copy: HIP  -> Host _mirror :     12     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationF… :     35     0.00     0.00
               Kokkos deep copy: Host Scalar -> HIP  :     13     0.00     0.00
                                       hipHostMalloc :      3     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-… :      2     0.00     0.00
                                           hipMemset :     13     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialProp… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :     35     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :     29     0.00     0.00
Kokkos deep copy: Host nodeElemCornerList_mirror ->… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos deep copy: Host regElemlist::entries_mirror … :      1     0.00     0.00
Kokkos deep copy: Host m_nodeElemStart_mirror -> HI… :      1     0.00     0.00
Kokkos deep copy: Host regElemlist::row_map_mirror … :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      2     0.00     0.00
                                   hipGetDeviceCount :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::ini… :      1     0.00     0.00
                              hipGetDeviceProperties :      2     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
                                      hipEventCreate :      1     0.00     0.00
                                        hipSetDevice :      1     0.00     0.00
Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
Kokkos::parallel_for [OpenMP] Kokkos::View::initial… :      1     0.00     0.00
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
                                        Total timers : 48969
Writing: .//apex_tasktree.csv
[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot
Reading tasktree...
Read 370 rows
Found 0 ranks, with max graph node index of 369 and depth of 5
building common tree...
Rank 0 ...
1-> 0.573 - 100.000% [1] {min=0.573, max=0.573, mean=0.573, threads=1} APEX MAIN
1 |-> 0.524 - 91.403% [1] {min=0.524, max=0.524, mean=0.524, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.200 - 34.955% [536] {min=0.200, max=0.200, mean=0.000, threads=1} hipMemcpyAsync
1 | | |-> 0.001 - 0.159% [536] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | |-> 0.056 - 9.818% [1225] {min=0.056, max=0.056, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems
1 | | |-> 0.021 - 3.699% [1225] {min=0.021, max=0.021, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.009 - 1.522% [1225] {min=0.009, max=0.009, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.005 - 0.824% [1225] {min=0.005, max=0.005, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.006 - 1.130% [1225] {min=0.006, max=0.006, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.008 - 1.480% [1225] {min=0.008, max=0.008, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.013 - 2.204% [1225] {min=0.013, max=0.013, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcEnergyForElems(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double, double, double, double, double, double*, double*, double, double, int, Domain&, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.004 - 0.744% [1225] {min=0.004, max=0.004, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.053 - 9.208% [1225] {min=0.053, max=0.053, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A
1 | | |-> 0.018 - 3.113% [1225] {min=0.018, max=0.018, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.009 - 1.531% [1225] {min=0.009, max=0.009, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.005 - 0.802% [1225] {min=0.005, max=0.005, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.006 - 1.133% [1225] {min=0.006, max=0.006, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.008 - 1.464% [1225] {min=0.008, max=0.008, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.014 - 2.378% [1225] {min=0.014, max=0.014, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<EvalEOSForElems(Domain&, double*, int, int, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.004 - 0.739% [1225] {min=0.004, max=0.004, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.030 - 5.185% [385] {min=0.030, max=0.030, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCourantConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.016 - 2.877% [385] {min=0.016, max=0.016, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.012 - 2.154% [385] {min=0.012, max=0.012, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.003 - 0.483% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.009 - 1.501% [385] {min=0.009, max=0.009, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcCourantConstraintForElems(Domain&, int, int, double, double&)::{lambda(int, MinFinder&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.002 - 0.413% [385] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.224% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.358% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.001 - 0.257% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.001 - 0.218% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.029 - 5.054% [385] {min=0.029, max=0.029, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.017 - 2.973% [385] {min=0.017, max=0.017, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.013 - 2.258% [385] {min=0.013, max=0.013, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.003 - 0.480% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.009 - 1.590% [385] {min=0.009, max=0.009, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcHydroConstraintForElems(Domain&, int, int, double, double&)::{lambda(int, MinFinder&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.002 - 0.384% [385] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.199% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.358% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.001 - 0.213% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.001 - 0.107% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.022 - 3.806% [385] {min=0.022, max=0.022, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQRegionForElems
1 | | |-> 0.009 - 1.620% [385] {min=0.009, max=0.009, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.004 - 0.770% [385] {min=0.004, max=0.004, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.006 - 1.070% [385] {min=0.006, max=0.006, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcMonotonicQRegionForElems(Domain&, int, double)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.003 - 0.464% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.242% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.356% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.001 - 0.232% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.017 - 2.998% [385] {min=0.017, max=0.017, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F
1 | | |-> 0.006 - 1.067% [385] {min=0.006, max=0.006, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.003 - 0.488% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.255% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.355% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.003 - 0.460% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.003 - 0.474% [385] {min=0.003, max=0.003, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<EvalEOSForElems(Domain&, double*, int, int, int)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.001 - 0.225% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.016 - 2.798% [385] {min=0.016, max=0.016, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedForElems
1 | | |-> 0.005 - 0.881% [385] {min=0.005, max=0.005, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.003 - 0.465% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.001 - 0.254% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.002 - 0.357% [385] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.003 - 0.462% [385] {min=0.003, max=0.003, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.003 - 0.485% [385] {min=0.003, max=0.003, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcSoundSpeedForElems(Domain&, double*, double, double*, double*, double*, double*, double, int, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.001 - 0.234% [385] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.006 - 0.964% [389] {min=0.006, max=0.006, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.004 - 0.775% [1310] {min=0.004, max=0.004, mean=0.000, threads=1} hipStreamSynchronize
1 | |-> 0.004 - 0.767% [533] {min=0.004, max=0.004, mean=0.000, threads=1} hipFree
1 | |-> 0.004 - 0.615% [35] {min=0.004, max=0.004, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL28CalcHourglassControlForElemsR6DomainPddEUliRiE_
1 | | |-> 0.002 - 0.429% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.002 - 0.364% [35] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.002 - 0.265% [35] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcHourglassControlForElems(Domain&, double*, double)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.036% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.007% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.554% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL20CalcLagrangeElementsR6DomainEUliRiE_
1 | | |-> 0.001 - 0.221% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.001 - 0.155% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.001 - 0.136% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.048% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.028% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.094% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcLagrangeElements(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.493% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems B
1 | | |-> 0.002 - 0.336% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.040% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.117% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, double, int, int)::{lambda(Kokkos::Impl::HIPTeamMember const&)#1}, Kokkos::TeamPolicy<>, Kokkos::Experimental::HIP> >()
1 | | |-> 0.000 - 0.036% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.017% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.489% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems B
1 | | |-> 0.002 - 0.313% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.046% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.025% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.039% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.110% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, double*, double*, int, int)::{lambda(Kokkos::Impl::HIPTeamMember const&)#1}, Kokkos::TeamPolicy<>, Kokkos::Experimental::HIP> >()
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.459% [35] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL31ApplyMaterialPropertiesForElemsR6DomainEUliRiE_
1 | | |-> 0.001 - 0.200% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.001 - 0.134% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.001 - 0.101% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.083% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.001 - 0.099% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.050% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.078% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.008% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.003 - 0.452% [351] {min=0.003, max=0.003, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-1D
1 | | |-> 0.002 - 0.433% [2] {min=0.002, max=0.002, mean=0.001, threads=1} hipFuncGetAttributes
1 | |-> 0.002 - 0.415% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL23CalcVolumeForceForElemsR6DomainEUliRiE_
1 | | |-> 0.002 - 0.347% [35] {min=0.002, max=0.002, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.019% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.025% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.110% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<CalcVolumeForceForElems(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy>, 1024u, 1u>(Kokkos::Impl::ParallelReduce<CalcVolumeForceForElems(Domain&)::{lambda(int, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> const*)
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.002 - 0.399% [35] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL13CalcQForElemsR6DomainEUlRKiRiE_
1 | | |-> 0.001 - 0.210% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.001 - 0.143% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.043% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.082% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelReduce<CalcQForElems(Domain&)::{lambda(int const&, int&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::InvalidType, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.040% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.023% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.023% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.004% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.002 - 0.328% [352] {min=0.002, max=0.002, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [] via memset
1 | | |-> 0.001 - 0.178% [352] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.003 - 0.453% [352] {min=0.003, max=0.003, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.002 - 0.282% [539] {min=0.002, max=0.002, mean=0.000, threads=1} hipMalloc
1 | |-> 0.001 - 0.261% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcPositionForNodes
1 | | |-> 0.001 - 0.087% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.047% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcPositionForNodes(Domain&, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.254% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes A
1 | | |-> 0.001 - 0.088% [35] {min=0.001, max=0.001, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.043% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyAccelerationBoundaryConditionsForNodes(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.252% [23] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host _mirror -> HIP
1 | | |-> 0.001 - 0.217% [23] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.053% [23] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.007% [46] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.001 - 0.240% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] IntegrateStressForElems A
1 | | |-> 0.000 - 0.063% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.050% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.029% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.002 - 0.268% [35] {min=0.002, max=0.002, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<IntegrateStressForElems(Domain&, double*, double*, double*, double*, int, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.234% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes B
1 | | |-> 0.000 - 0.067% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.029% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyAccelerationBoundaryConditionsForNodes(Domain&)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.232% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcVelocityForNodes
1 | | |-> 0.000 - 0.058% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.043% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.053% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcVelocityForNodes(Domain&, double, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.231% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyAccelerationBoundaryConditionsForNodes C
1 | | |-> 0.000 - 0.063% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.028% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyAccelerationBoundaryConditionsForNodes(Domain&)::{lambda(int)#3}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.200% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcKinematicsForElems
1 | | |-> 0.000 - 0.048% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.109% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcKinematicsForElems(Domain&, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.044% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.180% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] InitStressTermsForElems
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.040% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<InitStressTermsForElems(Domain&, double*, double*, double*, int)::{lambda(int const&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.038% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.026% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.179% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQGradientsForElems
1 | | |-> 0.000 - 0.044% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.117% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcMonotonicQGradientsForElems(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.017% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.026% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.178% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcForceForNodes
1 | | |-> 0.000 - 0.042% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.029% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcForceForNodes(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.010% [3] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbol
1 | | | |-> 0.000 - 0.001% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.177% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] UpdateVolumesForElems
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.037% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<UpdateVolumesForElems(Domain&, double, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.041% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.024% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.170% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcFBHourglassForceForElems A
1 | | |-> 0.000 - 0.049% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.001 - 0.232% [35] {min=0.001, max=0.001, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcFBHourglassForceForElems(Domain&, double*, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::MemoryTraits<1u> >, double, int, int)::{lambda(int const&)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.167% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems A
1 | | |-> 0.000 - 0.044% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.034% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.033% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.005% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.162% [12] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: HIP  -> Host _mirror
1 | | |-> 0.001 - 0.138% [12] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.031% [12] {min=0.000, max=0.000, mean=0.000, threads=2} GPU: CopyDeviceToHost
1 | | |-> 0.000 - 0.004% [24] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.001 - 0.162% [35] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcAccelerationForNodes
1 | | |-> 0.000 - 0.039% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.051% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_constant_memory<Kokkos::Impl::ParallelFor<CalcAccelerationForNodes(Domain&, int)::{lambda(int)#1}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> >()
1 | | |-> 0.000 - 0.035% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbolAsync
1 | | | |-> 0.000 - 0.020% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | | |-> 0.000 - 0.032% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.018% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventRecord
1 | | |-> 0.000 - 0.006% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventSynchronize
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.001 - 0.152% [5] {min=0.001, max=0.001, mean=0.000, threads=1} hipMemcpyToSymbol
1 | | |-> 0.000 - 0.002% [5] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | |-> 0.001 - 0.134% [13] {min=0.001, max=0.001, mean=0.000, threads=1} Kokkos deep copy: Host Scalar -> HIP
1 | | |-> 0.000 - 0.065% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewFill-1D
1 | | | |-> 0.000 - 0.041% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipHostMalloc
1 | | | |-> 0.000 - 0.010% [3] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyToSymbol
1 | | | | |-> 0.000 - 0.001% [3] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | | |-> 0.000 - 0.006% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::Experimental::HIPSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP>, 1024u, 1u>(Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::Experimental::HIPSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP> const*)
1 | | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP>, 1024u, 1u>(Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewFill<Kokkos::View<double*, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Experimental::HIP, Kokkos::AnonymousSpace>, Kokkos::MemoryTraits<0u> >, Kokkos::LayoutRight, Kokkos::Experimental::HIP, 1, int>, Kokkos::RangePolicy<Kokkos::Experimental::HIP, Kokkos::IndexType<int> >, Kokkos::Experimental::HIP> const*)
1 | | | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | | |-> 0.000 - 0.047% [11] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemset
1 | | | |-> 0.000 - 0.015% [11] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | | |-> 0.000 - 0.009% [26] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.052% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems B
1 | | |-> 0.000 - 0.027% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.039% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy>, 1024u, 1u>(Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#2}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> const*)
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.000 - 0.047% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] ApplyMaterialPropertiesForElems C
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | | |-> 0.000 - 0.026% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: void Kokkos::Experimental::Impl::hip_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#3}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy>, 1024u, 1u>(Kokkos::Impl::ParallelFor<ApplyMaterialPropertiesForElems(Domain&)::{lambda(int)#3}, Kokkos::RangePolicy<Kokkos::Experimental::HIP>, Kokkos::RangePolicy> const*)
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipFuncGetAttributes
1 | |-> 0.000 - 0.037% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [vnewc] via memset
1 | | |-> 0.000 - 0.022% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.036% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigxx] via memset
1 | | |-> 0.000 - 0.021% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.046% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.031% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigyy] via memset
1 | | |-> 0.000 - 0.016% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.031% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [sigzz] via memset
1 | | |-> 0.000 - 0.016% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.046% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.030% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipHostMalloc
1 | |-> 0.000 - 0.030% [35] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [determ] via memset
1 | | |-> 0.000 - 0.014% [35] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.045% [35] {min=0.000, max=0.000, mean=0.000, threads=4} GPU: FillBuffer
1 | |-> 0.000 - 0.026% [29] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [_mirror] via memset
1 | |-> 0.000 - 0.018% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host nodeElemCornerList_mirror -> HIP nodeElemCornerList
1 | | |-> 0.000 - 0.016% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.009% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.012% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemset
1 | | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.011% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCornerList_mirror] via memset
1 | |-> 0.000 - 0.008% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipLaunchKernel
1 | | |-> 0.000 - 0.015% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: desul::(anonymous namespace)::init_lock_arrays_hip_kernel()
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: Kokkos::(anonymous namespace)::init_lock_array_kernel_atomic()
1 | |-> 0.000 - 0.005% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::entries_mirror -> HIP regElemlist::entries
1 | | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.005% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host m_nodeElemStart_mirror -> HIP m_nodeElemStart
1 | | |-> 0.000 - 0.004% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.003% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos deep copy: Host regElemlist::row_map_mirror -> HIP regElemlist::row_map
1 | | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: CopyHostToDevice
1 | | |-> 0.000 - 0.000% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipDeviceSynchronize
1 | |-> 0.000 - 0.002% [2] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [Buffer] via memset
1 | | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.005% [2] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.002% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipGetDeviceCount
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [m_nodeElemStart] via memset
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::row_map] via memset
1 | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [regElemlist::entries] via memset
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::View::initialization [nodeElemCornerList] via memset
1 | | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipMemsetAsync
1 | | | |-> 0.000 - 0.001% [1] {min=0.000, max=0.000, mean=0.000, threads=1} GPU: FillBuffer
1 | |-> 0.000 - 0.001% [2] {min=0.000, max=0.000, mean=0.000, threads=1} hipGetDeviceProperties
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [m_nodeElemStart_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [nodeElemCount] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::entries_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipEventCreate
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} hipSetDevice
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] Kokkos::ViewCopy-2D
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regElemlist::row_map_mirror] via memset
1 | |-> 0.000 - 0.000% [1] {min=0.000, max=0.000, mean=0.000, threads=1} Kokkos::parallel_for [OpenMP] Kokkos::View::initialization [regBinEnd] via memset
371 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[khuck@gilgamesh apex-tutorial]$ dot -Tsvg -O tasktree.dot

DOT task graph of Kokkos and HIP example

Clearly the graph contains a lot of noisy nodes, so we can prune it by using some of the apex-treesummary.py options:

[khuck@gilgamesh apex-tutorial]$ apex-treesummary.py --ascii --dot --tlimit 0.01
Reading tasktree...
Read 370 rows
Found 0 ranks, with max graph node index of 369 and depth of 5
Ignoring any tree nodes with less than 0.01 accumulated time...
Kept 18 rows
building common tree...
Rank 0 ...
1-> 0.573 - 100.000% [1] {min=0.573, max=0.573, mean=0.573, threads=1} APEX MAIN
1 |-> 0.524 - 91.403% [1] {min=0.524, max=0.524, mean=0.524, threads=1} int apex_preload_main(int, char **, char **)
1 | |-> 0.200 - 34.955% [536] {min=0.200, max=0.200, mean=0.000, threads=1} hipMemcpyAsync
1 | |-> 0.056 - 9.818% [1225] {min=0.056, max=0.056, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcEnergyForElems
1 | | |-> 0.021 - 3.699% [1225] {min=0.021, max=0.021, mean=0.000, threads=1} hipEventSynchronize
1 | |-> 0.053 - 9.208% [1225] {min=0.053, max=0.053, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems A
1 | | |-> 0.018 - 3.113% [1225] {min=0.018, max=0.018, mean=0.000, threads=1} hipEventSynchronize
1 | |-> 0.030 - 5.185% [385] {min=0.030, max=0.030, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL29CalcCourantConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.016 - 2.877% [385] {min=0.016, max=0.016, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.012 - 2.154% [385] {min=0.012, max=0.012, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | |-> 0.029 - 5.054% [385] {min=0.029, max=0.029, mean=0.000, threads=1} Kokkos::parallel_reduce [HIP, Dev:0] ZL27CalcHydroConstraintForElemsR6DomainiidRdEUliR9MinFinderE_
1 | | |-> 0.017 - 2.973% [385] {min=0.017, max=0.017, mean=0.000, threads=1} hipMemcpyAsync
1 | | | |-> 0.013 - 2.258% [385] {min=0.013, max=0.013, mean=0.000, threads=4} GPU: CopyDeviceToHost
1 | |-> 0.022 - 3.806% [385] {min=0.022, max=0.022, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcMonotonicQRegionForElems
1 | |-> 0.017 - 2.998% [385] {min=0.017, max=0.017, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] EvalEOSForElems F
1 | |-> 0.016 - 2.798% [385] {min=0.016, max=0.016, mean=0.000, threads=1} Kokkos::parallel_for [HIP, Dev:0] CalcSoundSpeedForElems
17 total graph nodes

Task tree also written to tasktree.txt.
Computing new stats...
Building dot file
done.
[khuck@gilgamesh apex-tutorial]$ dot -Tsvg -O tasktree.dot

DOT task graph of reduced Kokkos HIP example