Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_module_functions: separate gc pushes #177

Merged
merged 3 commits into from
Dec 9, 2024

Conversation

benlorenz
Copy link
Contributor

@benlorenz benlorenz commented Dec 1, 2024

https://github.com/JuliaInterop/CxxWrap.jl#stl-updates

Creating a new jlcxx::Array<...>() calls jl_alloc_array_1d which can trigger garbage collection. Thus we need to protect each array before creating the next one, instead of doing one JL_GC_PUSH6 call here:

module.for_each_function([&](FunctionWrapperBase& f)
{
Array<jl_datatype_t*> arg_types_array;
jl_value_t* boxed_f = nullptr;
jl_value_t* boxed_thunk = nullptr;
Array<jl_value_t*> arg_names_array;
Array<jl_value_t*> arg_default_values_array;
jl_value_t* boxed_n_kwargs;
JL_GC_PUSH6(arg_types_array.gc_pointer(), &boxed_f, &boxed_thunk, arg_names_array.gc_pointer(), arg_default_values_array.gc_pointer(), &boxed_n_kwargs);

This might help with some rare crashes during initialization or precompilation we have seen in the Oscar CI for quite a while (see oscar-system/Oscar.jl#3296).

The backtrace I got locally was:

Precompiling Oscar...                                              
  5 dependencies successfully precompiled in 242 seconds. 113 already precompiled.
                                                                   
[16014] signal 11 (1): Segmentation fault                          
in expression starting at none:1                                   
fill_types_vec at /tmp/rrprecomp/artifacts/dee31c8c121f2be2dd7c7b2ec3d41b2be90cc8c7/lib/libcxxwrap_julia.so (unknown line)
_ZZ20get_module_functionsENKUlRN5jlcxx19FunctionWrapperBaseEE_clES1_ at /tmp/rrprecomp/artifacts/dee31c8c121f2be2dd7c7b2ec3d41b2be90cc8c7/lib/libcxxwrap_julia.so (unknown line)                                                          
get_module_functions at /tmp/rrprecomp/artifacts/dee31c8c121f2be2dd7c7b2ec3d41b2be90cc8c7/lib/libcxxwrap_julia.so (unknown line)
get_module_functions at /tmp/rrprecomp/packages/CxxWrap/eWADG/src/CxxWrap.jl:444 [inlined]
initialize_julia_module at /tmp/rrprecomp/packages/CxxWrap/eWADG/src/CxxWrap.jl:433
__init__ at /tmp/rrprecomp/packages/CxxWrap/eWADG/src/StdLib.jl:25
jfptr___init___45622 at /tmp/rrprecomp/compiled/v1.11/CxxWrap/WGIJU_QbeCS.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_module_run_initializer at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:76
run_module_init at ./loading.jl:1336
register_restored_modules at ./loading.jl:1324
#_include_from_serialized#1066 at ./loading.jl:1213
_include_from_serialized at ./loading.jl:1169 [inlined]
_include_from_serialized at ./loading.jl:1169 [inlined]
#_require_search_from_serialized#1077 at ./loading.jl:1969
_require_search_from_serialized at ./loading.jl:1908
jfptr__require_search_from_serialized_44533.1 at /net/site-local.linux64/julia/julia-1.11.1/lib/julia/sys.so (unknown line)
_require at ./loading.jl:2450
__require_prelocked at ./loading.jl:2315
jfptr___require_prelocked_69877.1 at /net/site-local.linux64/julia/julia-1.11.1/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_in_world at /cache/build/builder-demeter6-6/julialang/julia-master/src/builtins.c:894
#invoke_in_world#3 at ./essentials.jl:1089 [inlined]
invoke_in_world at ./essentials.jl:1086 [inlined]
_require_prelocked at ./loading.jl:2302
macro expansion at ./loading.jl:2241 [inlined]
macro expansion at ./lock.jl:273 [inlined]
__require at ./loading.jl:2198
jfptr___require_69814.1 at /net/site-local.linux64/julia/julia-1.11.1/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_in_world at /cache/build/builder-demeter6-6/julialang/julia-master/src/builtins.c:894
#invoke_in_world#3 at ./essentials.jl:1089 [inlined]
invoke_in_world at ./essentials.jl:1086 [inlined]
require at ./loading.jl:2191
jfptr_require_69803.1 at /net/site-local.linux64/julia/julia-1.11.1/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
call_require at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:486 [inlined]
eval_import_path at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:523
jl_toplevel_eval_flex at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:759
jl_toplevel_eval_flex at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]

Unfortunately this is slightly different from the backtraces in the Oscar CI but it might still be related and I could not reproduce the other backtrace yet.

I will keep it a draft for now as I dig into another backtrace I just got, but happy to receive some feedback. Maybe there is another good way to protect these Arrays?

cc: @fingolfin

Note: the diff is best viewed with ignore whitespace enabled: https://github.com/JuliaInterop/libcxxwrap-julia/pull/177/files?w=1

Edit: running against stl-updates branch because otherwise the version doesn't fit to the main branch here. I did my debugging with libcxxwrap-julia 0.13.2 though.

Edit2:
The second crash (which is closer to the backtrace from the Oscar CI) appeared after fixing the first one and should be fixed with the second commit by protecting the CppFunctionInfo struct during the push_back (which grows the array and thus allocates).
Another option to fix this would be to add a special version of push_back for jl_value_t* which also adds the argument to the JL_GC_PUSH macro. This would help for other users of Array<jl_value_t*> where the argument might be unprotected.

backtrace for the second crash:
[17115] signal 11 (1): Segmentation fault                           
in expression starting at /tmp/rrprecomp/packages/Hecke/Pu8AF/ext/PolymakeExt.jl:3
jl_is_layout_opaque at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:1330 [inlined]
jl_datatype_layout at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:1338 [inlined]
jl_f_tuple at /cache/build/builder-demeter6-6/julialang/julia-master/src/builtins.c:936 [inlined]
jl_f_tuple at /cache/build/builder-demeter6-6/julialang/julia-master/src/builtins.c:926
jl_method_error at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:2270
jl_lookup_generic_ at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:3106 [inlined]
ijl_apply_generic at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:3121
#408 at ./compiler/typeutils.jl:57
anymap at ./compiler/utilities.jl:43 [inlined]
argtypes_to_type at ./compiler/typeutils.jl:57
abstract_call_known at ./compiler/abstractinterpretation.jl:2199
abstract_call at ./compiler/abstractinterpretation.jl:2282
abstract_call at ./compiler/abstractinterpretation.jl:2275
abstract_call at ./compiler/abstractinterpretation.jl:2420
abstract_eval_call at ./compiler/abstractinterpretation.jl:2435
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2451
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2749
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:3041
typeinf_local at ./compiler/abstractinterpretation.jl:3319
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3401
_typeinf at ./compiler/typeinfer.jl:244
typeinf at ./compiler/typeinfer.jl:215
typeinf_ext at ./compiler/typeinfer.jl:1101
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1139
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1135
jfptr_typeinf_ext_toplevel_39647.1 at /net/site-local.linux64/julia/julia-1.11.1/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_type_infer at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:390
jl_generate_fptr_impl at /cache/build/builder-demeter6-6/julialang/julia-master/src/jitlayers.cpp:511
jl_compile_method_internal at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:2536 [inlined]
jl_compile_method_internal at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:2423
_jl_invoke at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:2940 [inlined]
ijl_apply_generic at /cache/build/builder-demeter6-6/julialang/julia-master/src/gf.c:3125
initialize_julia_module at /tmp/rrprecomp/packages/CxxWrap/eWADG/src/CxxWrap.jl:436
__init__ at /tmp/rrprecomp/packages/Polymake/nU3NN/src/Polymake.jl:108
unknown function (ip: 0x14b3a38073bf)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_module_run_initializer at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:76
run_module_init at ./loading.jl:1336

creating a new Array<...> calls jl_alloc_array_1d which can trigger garbage collection
thus we need to protect each array before creating the next one instead of one JL_GC_PUSH6 call
to do this we need some extra scope blocks
@benlorenz benlorenz closed this Dec 1, 2024
@benlorenz benlorenz reopened this Dec 1, 2024
@benlorenz benlorenz closed this Dec 1, 2024
@benlorenz benlorenz reopened this Dec 1, 2024
@benlorenz benlorenz marked this pull request as ready for review December 3, 2024 13:33
@benlorenz
Copy link
Contributor Author

The windows failures seem unrelated, and I don't know how to properly tests these changes. Locally, I have been rerunning Oscar pre-compilation many times (with rr and without), so far no crashes after applying the second commit to 0.13.2, but I will keep the loop running.

@barche
Copy link
Contributor

barche commented Dec 3, 2024

Thanks for looking into this, it looks like the Windows failures are related to ranges, so that would be unrelated to this.

I still have to look in detail what these changes here mean, one idea to make testing simpler could be to try to manually trigger the GC at the worst possible times, e.g. by adding a compilation flag that activates this for test builds.

@benlorenz
Copy link
Contributor Author

A more detailed explanation:

  • The first commit splits up the one JL_GC_PUSH6 into three separate pushes (which forces us to add these extra scope blocks). The extra two pushes are placed before the next Array constructions, since these allocate and may collect the other previously created Arrays:

    Array<...> x; // allocates julia objects
    ...
    Array<...> y; // this allocates again and might trigger GC which then frees some metadata objects from x
    JL_GC_PUSH2(x.gc_pointer(), y.gc_pointer());
    // both arrays will be protected from now on but this might be too late for x

    so I changed that pattern to

    Array<...> x;
    JL_GC_PUSH1(x.gc_pointer();
    {
      Array<...> y;
      JL_GC_PUSH1(y.gc_pointer());

    and now x is protected when y is created.
    The extra scopes and indents are needed because any scope must have only one call to JL_GC_PUSH*.

  • The second commit is somewhat similar:
    jl_new_struct creates a new julia object and returns a jl_value_t*, which is passed to push_back.
    Then push_back needs to grow the array to make space for the object and calls jl_array_grow_end, but this allocates and might trigger the GC which then corrupts the unrooted new struct we just created.
    To protect the struct I assigned it to a new variable cppfuncinfo which I added to one of the JL_GC_PUSH3 macros at the top of that scope. And then we can pass it to push_back.

    The alternative I suggested in the Edit2 is to add an extra specialization for jl_value_t* arguments to push_back like this:

    /// Append an element to the end of the list
    // special version for jl_value_t* to protect the argument during resize
    void push_back(jl_value_t* val)
    {
      JL_GC_PUSH2(&m_array, &val);
      const size_t pos = jl_array_len(m_array);
      jl_array_grow_end(m_array, 1);
      jl_value_t* jval = box<ValueT>(val);
      jl_array_ptr_set(m_array, pos, jval);
      JL_GC_POP();
    }

    The difference is the extra &val in the JL_GC_PUSH.

@benlorenz
Copy link
Contributor Author

So I ran some tests with extra GC calls:

diff --git a/src/c_interface.cpp b/src/c_interface.cpp
index c8b4c47..c1e8d0c 100644
--- a/src/c_interface.cpp
+++ b/src/c_interface.cpp
@@ -137,19 +137,23 @@ JLCXX_API jl_array_t* get_module_functions(jl_module_t* jlmod)
   const jlcxx::Module& module = registry().get_module(jlmod);
   module.for_each_function([&](FunctionWrapperBase& f)
   {
+    jl_gc_collect(JL_GC_AUTO);
     Array<jl_datatype_t*> arg_types_array;
     jl_value_t* boxed_f = nullptr;
     jl_value_t* boxed_thunk = nullptr;
     JL_GC_PUSH3(arg_types_array.gc_pointer(), &boxed_f, &boxed_thunk);
     {
+       jl_gc_collect(JL_GC_AUTO);
        Array<jl_value_t*> arg_names_array;
        JL_GC_PUSH1(arg_names_array.gc_pointer());
        {
+          jl_gc_collect(JL_GC_AUTO);
           Array<jl_value_t*> arg_default_values_array;
           jl_value_t* boxed_n_kwargs = nullptr;
           jl_value_t* cppfuncinfo = nullptr;
           JL_GC_PUSH3(arg_default_values_array.gc_pointer(), &boxed_n_kwargs, &cppfuncinfo);
 
+          jl_gc_collect(JL_GC_AUTO);
           fill_types_vec(arg_types_array, f.argument_types());
 
           boxed_f = jlcxx::box<void*>(f.pointer());
@@ -183,6 +187,7 @@ JLCXX_API jl_array_t* get_module_functions(jl_module_t* jlmod)
             arg_default_values_array.wrapped(),
             boxed_n_kwargs
             );
+          jl_gc_collect(JL_GC_AUTO);
           function_array.push_back(cppfuncinfo);
           JL_GC_POP();

This seems to work fine. I ran this for about 500 iterations re-precompiling CxxWrap, Oscar and its dependencies.

Adding similar explicit GC calls before the Array calls in the original code quickly triggers segmentation faults.

Copy link
Contributor

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Great detective work and explanation, thanks @benlorenz

@barche
Copy link
Contributor

barche commented Dec 8, 2024

Sorry for the delay on this, I wanted to investigate the MSVC problem, which turned out to be due to a newer MSVC version on github CI. Should build now.

@barche
Copy link
Contributor

barche commented Dec 9, 2024

The failure on windows nightly is due to a problem in Julia, the one on Mac nightly is either another intermittent failure or due to a change in Julia between now and the last test run. I think we can merge this.

@benlorenz
Copy link
Contributor Author

Thanks! The failure on macos was probably caused by references to StdString objects which were already deleted. This only happens when the GC triggers in a bad spot and should get fixed by JuliaInterop/CxxWrap.jl#461.

@barche barche merged commit dd11c21 into JuliaInterop:main Dec 9, 2024
9 of 11 checks passed
@barche
Copy link
Contributor

barche commented Dec 9, 2024

Many thanks for this work, I had seen very rare crashes for a long time and always had a nagging feeling there was some error, without knowing what exactly. Hopefully this was the only one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants