Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround the lifecycle management issue of IImportRegistration. #657

Merged
merged 8 commits into from
Sep 26, 2023

Conversation

PengZheng
Copy link
Contributor

@PengZheng PengZheng commented Sep 25, 2023

It avoids a testing crash.

Considering we just make bundle uninstall product-ready in 2.4.0 and OSGi permits uninstalling a bundle independent of other bundles, we can delay #653 until the 3.0.0 release.

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2023

Codecov Report

Merging #657 (9d1f39d) into master (e7aee12) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 9d1f39d differs from pull request most recent head de76411. Consider uploading reports for the commit de76411 to get more accurate results

@@           Coverage Diff           @@
##           master     #657   +/-   ##
=======================================
  Coverage   81.60%   81.61%           
=======================================
  Files         260      260           
  Lines       34675    34677    +2     
=======================================
+ Hits        28297    28301    +4     
+ Misses       6378     6376    -2     
Files Coverage Δ
...xx_remote_services/admin/src/RemoteServiceAdmin.cc 78.70% <ø> (ø)
...s/remote_services/topology_manager/tms_tst/main.cc 100.00% <100.00%> (ø)

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

It helps catch issues like #653.
@PengZheng
Copy link
Contributor Author

PengZheng commented Sep 25, 2023

The last commit enabling RelWithDebInfo manifests additional memory leaks:

	 25 - run_test_tm_scoped (Failed)
	 26 - run_test_rsa_dfi (Failed)
	 41 - pubsub_websocket_v2_tests (Failed)

I can verify that loading and unloading remote_service_admin_dfi will lead to ASAN report and that adding CELIX_LOAD_BUNDLES_WITH_NODELETE=true will make the issue disappear.

I suspect they relate to some global shared object resource.

@PengZheng
Copy link
Contributor Author

PengZheng commented Sep 25, 2023

One leak turns out to be caused by mg_init_library:

This function is new in version 1.9 (as dummy implementation) and effective only from version 1.10. For compatibility reasons, other functions (such as mg_start();) will initialize the required features as well, but they will no longer do a de-initialization, leaving a memory leak when the library is unloaded.

https://github.com/civetweb/civetweb/blob/master/docs/api/mg_init_library.md

This shows that doing global initialization (for CURL/civetweb) inside a customized launcher library is a must. @xuzhenbao
It also shows that moving away from do global initialization inside the framework is correct, and that we shall emphasize this in the documentation to help our users avoid the same mistakes. @pnoltes

@PengZheng
Copy link
Contributor Author

PengZheng commented Sep 25, 2023

Another leak in run_test_tm_scoped is caused by tramp_table_alloc in libffi.
The memory allocated is attached to global tramp_globals in libffi, and libffi provides no explicit way to deallocate it.
If we keep libffi in address space during the lifetime of the PROGRAM, then there is no risk of memory leak.
Otherwise if we keep loading/unloading libffi, then the memory leaks are real.

The solution is simple, link libffi to the executable:

target_link_libraries(test_tm_scoped PRIVATE
        Celix::framework
        GTest::gtest
        jansson::jansson
        calculator_api
        Celix::rsa_common
        CURL::libcurl
        civetweb::civetweb
        libffi::libffi
)

Again, we should remind our users to link libffi to their custom launcher if they need Celix components depend on libffi.

@PengZheng
Copy link
Contributor Author

PengZheng commented Sep 25, 2023

The failure of run_test_rsa_dfi brings libxml2 into attention.
As noted in its documentation of xmlCleanupParser:

It's sometimes very hard to guess if libxml2 is in use in the application, some libraries or plugins may use it without notice. In case of doubt abstain from calling this function or do it just before calling exit() to avoid leak reports from valgrind !

https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-parser.html#xmlCleanupParser
And we are a plugin framework!
Therefore, we should warn our user of this kind of pitfall.

@PengZheng
Copy link
Contributor Author

Whew! I have fixed these leaks lurking in our tests for a very long time.
Note that to debug such issue, linux ASLR should be turned off by echo 0 | sudo tee /proc/sys/kernel/randomize_va_space so that we can ask debugger which shared object these leakers reported by ASAN belong to at a appropriate time when the offending shared object is still in address space.

IMHO, we should warn our users very loudly in our documentation of these shared object initialization/de-initialization pitfalls.
Also, it should be pointed out to our users that the standard place to deal with these is a customized launcher.
Last but not the lease, libffi is very special in that there is no way to clean its global allocation (correct me if I were wrong), thus it should be kept in address space if possible (by linking it to the executable). Keep loading and unloading libffi will wreak a havoc.

WDYT? @pnoltes @rlenferink

Copy link
Contributor

@pnoltes pnoltes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

My compliments to finding these issues. These issues are clear if you know what the problem and solution is, but hard to find given cryptic memory errors.
That they are hard to find is of course also the reason why these issues are present in our code base for quite some time.

@pnoltes
Copy link
Contributor

pnoltes commented Sep 25, 2023

Whew! I have fixed these leaks lurking in our tests for a very long time. Note that to debug such issue, linux ASLR should be turned off by echo 0 | sudo tee /proc/sys/kernel/randomize_va_space so that we can ask debugger which shared object these leakers reported by ASAN belong to at a appropriate time when the offending shared object is still in address space.

IMHO, we should warn our users very loudly in our documentation of these shared object initialization/de-initialization pitfalls. Also, it should be pointed out to our users that the standard place to deal with these is a customized launcher. Last but not the lease, libffi is very special in that there is no way to clean its global allocation (correct me if I were wrong), thus it should be kept in address space if possible (by linking it to the executable). Keep loading and unloading libffi will wreak a havoc.

WDYT? @pnoltes @rlenferink

I agree, both in warning for init/deinit library requirements in the documentation and that this should be solved in a custom launcher.

An alternative to consider is to provided init/deinit libs that use the __attribute__((constructor)) / __attribute__((destructor)) to initialize / deinitialize libraries. These library then needs to be linked against the generated launcher.

But IMO it is clearer to do this in the launcher. Maybe some effort can be made to make it possible that a custom launcher can easily use the PROPERTIES provided in the add_celix_container (as is currently done in the generated main.c/main.cc file).

@pnoltes
Copy link
Contributor

pnoltes commented Sep 25, 2023

IMO this PR is also needed for the 2.4.0 release, so I will wait for a merge before starting a new vote.

@PengZheng
Copy link
Contributor Author

An alternative to consider is to provided init/deinit libs that use the attribute((constructor)) / attribute((destructor)) to initialize / deinitialize libraries. These library then needs to be linked against the generated launcher.

But IMO it is clearer to do this in the launcher. Maybe some effort can be made to make it possible that a custom launcher can easily use the PROPERTIES provided in the add_celix_container (as is currently done in the generated main.c/main.cc file).

Indeed, manual initialization in launcher is less than ideal.
The biggest trouble is these initialization functions have options, e.g. mg_init_library. Different bundles may have different requirements, and thus set different options.

@PengZheng
Copy link
Contributor Author

PengZheng commented Sep 26, 2023

It's interesting to observe that libxml2 does have its destructor defined:

https://github.com/GNOME/libxml2/blob/c7ff438b830d8ebeac684f3f8ad3229587c88373/threads.c#L665-L677

#if defined(HAVE_ATTRIBUTE_DESTRUCTOR) && !defined(LIBXML_STATIC) && \
    !defined(_WIN32)
static void
ATTRIBUTE_DESTRUCTOR
xmlDestructor(void) {
    /*
     * Calling custom deallocation functions in a destructor can cause
     * problems, for example with Nokogiri.
     */
    if (xmlFree == free)
        xmlCleanupParser();
}
#endif

Thus there is no need to call xmlCleanupParser in our application code, unless we are using static libxml2.

It will be automatically called by libxml2 destructor xmlDestructor.
@PengZheng PengZheng merged commit dcd8157 into master Sep 26, 2023
@PengZheng PengZheng deleted the hotfix/653-cxx_remote_services_test_crash_stopgap branch September 26, 2023 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants