Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) #2191

Closed
olupton opened this issue Jan 25, 2023 · 6 comments · Fixed by #2195
Labels

Comments

@olupton
Copy link
Collaborator

olupton commented Jan 25, 2023

Context

In #2186 (in conjunction with BlueBrain/spack#1814 for the definition of build_type=FastDebug) I tried to enable some more compiler optimisation flags.

This led to test failures (#2186 (comment), specifically in the test:neuron:nmodl:intel:legacy and test:neuron:nmodl:intel:shared tests) in the coreneuron_modtests::test_pointer_py_cpu test.

Overview of the issue

The error message is:

148/189 Test  #91: coreneuron_modtests::test_pointer_py_cpu ...........................***Failed    5.90 sec
NEURON -- VERSION 9.0.dev-1270-g544bbbc9f+ HEAD (544bbbc9f+) 2023-01-24
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 "./axial.mod" "./axial_pp.mod" "./bacur.mod" "./banocur.mod" "./fornetcon.mod" "./invlfire.mod" "./natrans.mod" "./netmove.mod" "./sample.mod" "./unitstest.mod" "./version_macros.mod" "./watchrange.mod"
numprocs=1
Traceback (most recent call last):
  File "test/coreneuron/test_pointer.py", line 320, in <module>
    test_axial()
  File "test/coreneuron/test_pointer.py", line 212, in test_axial
    chk(std, run(tstop))
  File "test/coreneuron/test_pointer.py", line 200, in run
    return result(m)
  File "test/coreneuron/test_pointer.py", line 184, in result
    assert x / mx < 1e-9
AssertionError

which comes from the test using coreneuron with cell_permute = 0:

coreneuron.enable = True
coreneuron.cell_permute = 0
chk(std, run(tstop))
coreneuron.cell_permute = 1
chk(std, run(tstop))

(this is a bit fragile, for example reducing the number of cells from 5 to 1 in the first argument on

m = Model(5, 5)
causes cell_permute = 0 to pass, but cell_permute = 1 still fails)

The issue comes from the BEFORE STEP block:

BEFORE STEP {
if (ri > 0) {
pim = pim - ia : child contributions
}
}

which NMODL translates into a loop with

#pragma ivdep
#pragma omp simd

annotations.
If I understand correctly pim is a POINTER variable that refers to the RANGE variable im on a different instance of a mechanism derived from axial.inc, and multiple instances of the mechanism may have pim values referring to a common, other instance of the mechanism. This means that multiple iterations of the loop are updating the same value, which gives correct results if the loop is executed serially.

Removing #pragma omp simd from this loop is sufficient to get the correct result.

@nrnhines: what is your perspective here? Should this test pass as-is?

Expected result/behavior

Tests should pass when reasonable compiler optimisations are enabled, so this needs to be fixed somehow.

One opinion is that the mechanisms/test are buggy, as they declare that they are thread-safe:


and then assume non-existent atomic magic when updating pim.

The documentation (also here), is not super clear on whether these mechanisms are breaking the THREADSAFE contract.

The targets of POINTER variables are generally only set at runtime, so in the general case we must assume that all instances of a mechanism may have POINTER variables pointing to the same places, which would imply extra care is needed around atomicity etc...

NEURON setup

  • Version: master
  • Installation method: CMake build with Intel Classic compiler
  • OS + Version: BB5
  • Compiler + Version: Intel Classic compiler

Minimal working example - MWE

flags="-g -fno-omit-frame-pointer -O2 -nolib-inline -fno-builtin -fp-model consistent -no-ftz -qsimd-honor-fp-model -qsimd-serialize-fp-reduction -march=skylake -mtune=skylake -diag-disable=10121"
cmake -G Ninja .. \
  -DCMAKE_BUILD_TYPE=Custom \
  -DCMAKE_C_COMPILER=icc \
  -DCMAKE_CXX_COMPILER=icpc \
  -DCMAKE_C_COMPILER_LAUNCHER=ccache \
  -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -DCMAKE_C_FLAGS="${flags}" \
  -DCMAKE_CXX_FLAGS="${flags}" \
  -DNRN_ENABLE_CORENEURON=ON \
  -DNRN_ENABLE_TESTS=ON \
  -DCORENRN_ENABLE_NMODL=ON \
  -DCORENRN_ENABLE_DEBUG_CODE=OFF \
  -DNRN_ENABLE_INTERVIEWS=OFF \
  -DNRN_ENABLE_RX3D=OFF

were the options I used, with:

$ icpc --version
icpc (ICC) 2021.4.0 20210910
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

followed by

$ ctest -R test_pointer -V
@olupton olupton added bug question testing coreneuron parallel Parallelisation related issues including MPI nmodl intel labels Jan 25, 2023
@olupton olupton changed the title coreneuron_modtests::test_pointer_py_cpu incompatible with autovectorisation (Intel compiler with -O2) coreneuron_modtests::test_pointer_py_cpu incompatible with vectorisation (Intel compiler with -O2) Jan 25, 2023
@nrnhines
Copy link
Member

not super clear on whether these mechanisms are breaking the THREADSAFE contract.

They are breaking it. (But the documentation should be made clear about that)
By adding the THREADSAFE keyword (and it is best that it be added before any variable declarations in the NEURON block)
the author is asserting that the mod file is threadsafe. (As is in terms of POINTER and VERBATIM. And GLOBAL variables that are written to are threadsafe if promoted to thread instance variables.) This was pretty straightforward when pthreads were the only intra process parallelization. But things have gotten more ambiguous with OPENMP and GPU parallelization.

assume non-existent atomic magic when updating pim

I regret not adding a comment to this effect for those pointer update statements that increment from multiple other value locations. I remember looking for pragma's that would hint that the pointer update needed to be atomic. And also remember thinking that the original pthread strategy of writing

LOCK
  pim = pim - ia : child contributions
UNLOCK

was overkill as, in principle, there should be a separate lock for each distinct pim since the problem was mulitiple ia contributing to a single pim. It would be great if a hint like

ATOMIC pim = pim - ia

would be sufficient to have the translator output what is needed get the compiler to do the right thing. Although, I suppose the hint is already there in the form of the pointer increment statement.

@pramodk
Copy link
Member

pramodk commented Jan 25, 2023

there are few possible scenarios here:

  • presence of THREADSAFE
  • presence of POINTER
  • presence of BBCOREPOINTER

Until now, THREADSAFE was treated as guarantee of parallelisation not only for pthreads but also for SIMD as well as GPU execution.

Let's say we don't want to introduce new keyword like ATOMIC. So with the current language features, one should write this as:

LOCK
  pim = pim - ia : child contributions
UNLOCK

In this case:

  • on CPU: this could be interpreted as some kind of atomic updates or atomic execution of a given statement block. We can translate these statements to:
...
       #pragma omp simd
        for(....) {
        .....
             #pragma omp critical
             {
                pim = pim - ia : child contributions
             }
       }

Essentially, this will avoid the SIMD parallelisation of the loop due to atomic section. This won't be performant but at least will produce correct code. This can be also atomic instead of critical section (see below).

  • on GPU: it is difficult to map everything ("collection of statements" or "a complex statement") as an atomic operation because the expression needs to be of the form like:
x++; 
x--; 
++x; 
--x; 
x binop= expr; 
x = x binop expr; 
x = expr binop x;

So we need to impose restrictions like "there can be only one statement in a LOCK-UNLOCK block and it needs to be of the forms" mentioned above.

Does this look "reasonable" ?

@pramodk
Copy link
Member

pramodk commented Jan 25, 2023

Bit late but just realised PROTECT keyword:

NEURON {
    GLOBAL var
}

BREAKPOINT {
    PROTECT var = var + 1
}

I wonder if PROTECT should be used as "ATOMIC" in our example:

PROTECT pim = pim - ia : child contributions

@nrnhines
Copy link
Member

I forgot about that. At present it just surrounds the statement using

src/nmodl/parsact.cpp:    replacstr(q1, "/* PROTECT */_NMODLMUTEXLOCK\n");
src/nmodl/parsact.cpp:    q = insertstr(q2->next, "\n _NMODLMUTEXUNLOCK /* end PROTECT */\n");

and has that annoying property of one (pthread mutex?) lock for many noninteracting independent statements

@pramodk
Copy link
Member

pramodk commented Jan 26, 2023

and has that annoying property of one (pthread mutex?) lock for many noninteracting independent statements

I am thinking following:

  • If there is a MUTEX LOCK - UNLOCK then that block can be executed on CPU (and not on GPU). It's true that it won't be performant but I guess we can just priorities correctness here.
  • If there is a PROTECT statements then those will generate code like below:
#pragma omp atomic update
statement1 = ...
#pragma omp atomic update
statement2 = ...

These atomic statements are compatible for CPU as well as GPU and will be executed "as fast as possible".
So user should preferably use PROTECT statements.

@nrnhines
Copy link
Member

That seems very good to me. I read that "The atomic construct ensures that a specific storage location is accessed atomically, ..." and mentally underline "specific storage location" to mean it's ideal for our typical case of thousands of storage locations each accessed by only a couple of instances of the statement. I.e. for our neuron trees, parents are accessed most often by only 1 or 2 children and extremely rarely by 3-5 children
Don't know to what extent this is new to "Version 5.0 - November 2018". Is all the functionality now available? If so, then great. And we should update the translators to use the pragma if omp is available, otherwise the mutex.

pramodk added a commit that referenced this issue Jan 26, 2023
* axial.inc used in test_pointer.py is not "really thread-safe"
* update it to use PROTECT based on discussion in #2191
* fixes one of the test mentioned in #1792

- [ ] merge nmodl PR BlueBrain/nmodl/pull/994 and then update submodule
- [ ] update mod2c support

fixes #2191
pramodk added a commit that referenced this issue Feb 2, 2023
* Update pointer test with PROTECT for SIMD/SIMT execution
  - axial.inc used in test_pointer.py is not "really thread-safe"
  - update it to use PROTECT based on discussion in #2191
  - fixes one of the test mentioned in #1792
* Update mod2c and NMODL with fixes for PROTECT and MUTEX constructs 
* Update docs

fixes #2191
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants