coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) #2191

olupton · 2023-01-25T11:00:04Z

Context

In #2186 (in conjunction with BlueBrain/spack#1814 for the definition of build_type=FastDebug) I tried to enable some more compiler optimisation flags.

This led to test failures (#2186 (comment), specifically in the test:neuron:nmodl:intel:legacy and test:neuron:nmodl:intel:shared tests) in the coreneuron_modtests::test_pointer_py_cpu test.

Overview of the issue

The error message is:

148/189 Test  #91: coreneuron_modtests::test_pointer_py_cpu ...........................***Failed    5.90 sec
NEURON -- VERSION 9.0.dev-1270-g544bbbc9f+ HEAD (544bbbc9f+) 2023-01-24
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 "./axial.mod" "./axial_pp.mod" "./bacur.mod" "./banocur.mod" "./fornetcon.mod" "./invlfire.mod" "./natrans.mod" "./netmove.mod" "./sample.mod" "./unitstest.mod" "./version_macros.mod" "./watchrange.mod"
numprocs=1
Traceback (most recent call last):
  File "test/coreneuron/test_pointer.py", line 320, in <module>
    test_axial()
  File "test/coreneuron/test_pointer.py", line 212, in test_axial
    chk(std, run(tstop))
  File "test/coreneuron/test_pointer.py", line 200, in run
    return result(m)
  File "test/coreneuron/test_pointer.py", line 184, in result
    assert x / mx < 1e-9
AssertionError

which comes from the test using coreneuron with cell_permute = 0:

nrn/test/coreneuron/test_pointer.py

Lines 210 to 214 in c62dc10

    
           coreneuron.enable = True 
        
           coreneuron.cell_permute = 0 
        
           chk(std, run(tstop)) 
        
           coreneuron.cell_permute = 1 
        
           chk(std, run(tstop))

(this is a bit fragile, for example reducing the number of cells from 5 to 1 in the first argument on

nrn/test/coreneuron/test_pointer.py

Line 155 in c62dc10

m = Model(5, 5)

causes cell_permute = 0 to pass, but cell_permute = 1 still fails)

The issue comes from the BEFORE STEP block:

nrn/test/coreneuron/mod files/axial.inc

Lines 44 to 48 in c62dc10

    
           BEFORE STEP { 
        
               if (ri > 0) { 
        
                   pim = pim - ia : child contributions 
        
               } 
        
           }

which NMODL translates into a loop with

#pragma ivdep
#pragma omp simd

annotations.
If I understand correctly pim is a POINTER variable that refers to the RANGE variable im on a different instance of a mechanism derived from axial.inc, and multiple instances of the mechanism may have pim values referring to a common, other instance of the mechanism. This means that multiple iterations of the loop are updating the same value, which gives correct results if the loop is executed serially.

Removing #pragma omp simd from this loop is sufficient to get the correct result.

@nrnhines: what is your perspective here? Should this test pass as-is?

Expected result/behavior

Tests should pass when reasonable compiler optimisations are enabled, so this needs to be fixed somehow.

One opinion is that the mechanisms/test are buggy, as they declare that they are thread-safe:

nrn/test/coreneuron/mod files/axial.inc

Line 5 in c62dc10

THREADSAFE

and then assume non-existent atomic magic when updating pim.

The documentation (also here), is not super clear on whether these mechanisms are breaking the THREADSAFE contract.

The targets of POINTER variables are generally only set at runtime, so in the general case we must assume that all instances of a mechanism may have POINTER variables pointing to the same places, which would imply extra care is needed around atomicity etc...

NEURON setup

Version: master
Installation method: CMake build with Intel Classic compiler
OS + Version: BB5
Compiler + Version: Intel Classic compiler

Minimal working example - MWE

flags="-g -fno-omit-frame-pointer -O2 -nolib-inline -fno-builtin -fp-model consistent -no-ftz -qsimd-honor-fp-model -qsimd-serialize-fp-reduction -march=skylake -mtune=skylake -diag-disable=10121"
cmake -G Ninja .. \
  -DCMAKE_BUILD_TYPE=Custom \
  -DCMAKE_C_COMPILER=icc \
  -DCMAKE_CXX_COMPILER=icpc \
  -DCMAKE_C_COMPILER_LAUNCHER=ccache \
  -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -DCMAKE_C_FLAGS="${flags}" \
  -DCMAKE_CXX_FLAGS="${flags}" \
  -DNRN_ENABLE_CORENEURON=ON \
  -DNRN_ENABLE_TESTS=ON \
  -DCORENRN_ENABLE_NMODL=ON \
  -DCORENRN_ENABLE_DEBUG_CODE=OFF \
  -DNRN_ENABLE_INTERVIEWS=OFF \
  -DNRN_ENABLE_RX3D=OFF

were the options I used, with:

$ icpc --version
icpc (ICC) 2021.4.0 20210910
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

followed by

$ ctest -R test_pointer -V

The text was updated successfully, but these errors were encountered:

nrnhines · 2023-01-25T14:01:18Z

not super clear on whether these mechanisms are breaking the THREADSAFE contract.

They are breaking it. (But the documentation should be made clear about that)
By adding the THREADSAFE keyword (and it is best that it be added before any variable declarations in the NEURON block)
the author is asserting that the mod file is threadsafe. (As is in terms of POINTER and VERBATIM. And GLOBAL variables that are written to are threadsafe if promoted to thread instance variables.) This was pretty straightforward when pthreads were the only intra process parallelization. But things have gotten more ambiguous with OPENMP and GPU parallelization.

assume non-existent atomic magic when updating pim

I regret not adding a comment to this effect for those pointer update statements that increment from multiple other value locations. I remember looking for pragma's that would hint that the pointer update needed to be atomic. And also remember thinking that the original pthread strategy of writing

LOCK
  pim = pim - ia : child contributions
UNLOCK

was overkill as, in principle, there should be a separate lock for each distinct pim since the problem was mulitiple ia contributing to a single pim. It would be great if a hint like

ATOMIC pim = pim - ia

would be sufficient to have the translator output what is needed get the compiler to do the right thing. Although, I suppose the hint is already there in the form of the pointer increment statement.

pramodk · 2023-01-25T22:23:05Z

there are few possible scenarios here:

presence of THREADSAFE
presence of POINTER
presence of BBCOREPOINTER

Until now, THREADSAFE was treated as guarantee of parallelisation not only for pthreads but also for SIMD as well as GPU execution.

Let's say we don't want to introduce new keyword like ATOMIC. So with the current language features, one should write this as:

LOCK
  pim = pim - ia : child contributions
UNLOCK

In this case:

on CPU: this could be interpreted as some kind of atomic updates or atomic execution of a given statement block. We can translate these statements to:

...
       #pragma omp simd
        for(....) {
        .....
             #pragma omp critical
             {
                pim = pim - ia : child contributions
             }
       }

Essentially, this will avoid the SIMD parallelisation of the loop due to atomic section. This won't be performant but at least will produce correct code. This can be also atomic instead of critical section (see below).

on GPU: it is difficult to map everything ("collection of statements" or "a complex statement") as an atomic operation because the expression needs to be of the form like:

x++; 
x--; 
++x; 
--x; 
x binop= expr; 
x = x binop expr; 
x = expr binop x;

So we need to impose restrictions like "there can be only one statement in a LOCK-UNLOCK block and it needs to be of the forms" mentioned above.

Does this look "reasonable" ?

pramodk · 2023-01-25T22:28:56Z

Bit late but just realised PROTECT keyword:

NEURON {
    GLOBAL var
}

BREAKPOINT {
    PROTECT var = var + 1
}

I wonder if PROTECT should be used as "ATOMIC" in our example:

PROTECT pim = pim - ia : child contributions

nrnhines · 2023-01-25T22:50:24Z

I forgot about that. At present it just surrounds the statement using

src/nmodl/parsact.cpp:    replacstr(q1, "/* PROTECT */_NMODLMUTEXLOCK\n");
src/nmodl/parsact.cpp:    q = insertstr(q2->next, "\n _NMODLMUTEXUNLOCK /* end PROTECT */\n");

and has that annoying property of one (pthread mutex?) lock for many noninteracting independent statements

pramodk · 2023-01-26T00:11:14Z

and has that annoying property of one (pthread mutex?) lock for many noninteracting independent statements

I am thinking following:

If there is a MUTEX LOCK - UNLOCK then that block can be executed on CPU (and not on GPU). It's true that it won't be performant but I guess we can just priorities correctness here.
If there is a PROTECT statements then those will generate code like below:

#pragma omp atomic update
statement1 = ...
#pragma omp atomic update
statement2 = ...

These atomic statements are compatible for CPU as well as GPU and will be executed "as fast as possible".
So user should preferably use PROTECT statements.

nrnhines · 2023-01-26T00:52:19Z

That seems very good to me. I read that "The atomic construct ensures that a specific storage location is accessed atomically, ..." and mentally underline "specific storage location" to mean it's ideal for our typical case of thousands of storage locations each accessed by only a couple of instances of the statement. I.e. for our neuron trees, parents are accessed most often by only 1 or 2 children and extremely rarely by 3-5 children
Don't know to what extent this is new to "Version 5.0 - November 2018". Is all the functionality now available? If so, then great. And we should update the translators to use the pragma if omp is available, otherwise the mutex.

* axial.inc used in test_pointer.py is not "really thread-safe" * update it to use PROTECT based on discussion in #2191 * fixes one of the test mentioned in #1792 - [ ] merge nmodl PR BlueBrain/nmodl/pull/994 and then update submodule - [ ] update mod2c support fixes #2191

* Update pointer test with PROTECT for SIMD/SIMT execution - axial.inc used in test_pointer.py is not "really thread-safe" - update it to use PROTECT based on discussion in #2191 - fixes one of the test mentioned in #1792 * Update mod2c and NMODL with fixes for PROTECT and MUTEX constructs * Update docs fixes #2191

olupton added bug question testing coreneuron parallel Parallelisation related issues including MPI nmodl intel labels Jan 25, 2023

olupton changed the title ~~coreneuron_modtests::test_pointer_py_cpu incompatible with autovectorisation (Intel compiler with -O2)~~ coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) Jan 25, 2023

pramodk mentioned this issue Jan 26, 2023

Improve / fix codegen support for PROTECT and MUTEX constructs BlueBrain/nmodl#994

Merged

pramodk mentioned this issue Jan 26, 2023

Update pointer test with PROTECT for SIMD/SIMT execution #2195

Merged

3 tasks

pramodk closed this as completed in #2195 Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) #2191

coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) #2191

olupton commented Jan 25, 2023

nrnhines commented Jan 25, 2023

pramodk commented Jan 25, 2023

pramodk commented Jan 25, 2023

nrnhines commented Jan 25, 2023

pramodk commented Jan 26, 2023

nrnhines commented Jan 26, 2023

coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) #2191

coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) #2191

Comments

olupton commented Jan 25, 2023

Context

Overview of the issue

Expected result/behavior

NEURON setup

Minimal working example - MWE

nrnhines commented Jan 25, 2023

pramodk commented Jan 25, 2023

pramodk commented Jan 25, 2023

nrnhines commented Jan 25, 2023

pramodk commented Jan 26, 2023

nrnhines commented Jan 26, 2023