Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGFAULT and/or SEGABRT with GtkCanvas #87

Open
nHackel opened this issue Mar 7, 2025 · 7 comments
Open

SEGFAULT and/or SEGABRT with GtkCanvas #87

nHackel opened this issue Mar 7, 2025 · 7 comments

Comments

@nHackel
Copy link
Contributor

nHackel commented Mar 7, 2025

Hello,

for I think a few years now, we've had sporadic segmenation faults in our Gtk4.jl GUI. With recent versions we've been able to somewhat reliably produce the error and it looked like the CairoSurfaces of the GtkCanvas were somtimes garbage collected.

Usually we didnt't see any Julia-level errors or assertions about that. Today we tried to change our GtkCanvas printing from Cairo.jl to CairoMakie.jl and there we got some more Julia stacktraces and log messages.

I've been able to produce this MWE:

using Gtk4, CairoMakie, Cairo

function Base.setproperty!(surface::CairoSurface, fieldname::Symbol, val)
    if fieldname == :ptr && val == C_NULL
        @show stracktrace()
    end
    setfield!(surface, fieldname, val)
end

config = CairoMakie.ScreenConfig(1.0, 1.0, :good, true, false, nothing)
CairoMakie.activate!()

canvas = GtkCanvas()
w = GtkWindow(canvas,"CairoMakie example")

@guarded draw(canvas) do widget
    # global f, ax, p = lines(1:10) # no stacktrace
    f, ax, p = lines(1:10) # stacktrace

    CairoMakie.autolimits!(ax)
    screen = CairoMakie.Screen(f.scene, config, Gtk4.cairo_surface(canvas))
    CairoMakie.resize!(f.scene, Gtk4.width(widget), Gtk4.height(widget))
    CairoMakie.cairo_draw(screen, f.scene)
end

which prints a stacktrace if the surface access is illegal. So on the hand the "bug" is partially in the user code, because the figure isn't referenced and thus can be garbage collected.

But I'm not sure why that garbage collection triggers the CairoSurface to be also destroyed. I think that part might be an issue in Gtk4.jl. With this MWE I've also not managed to create a segfault, so something "regenerates" the cairo surface?

I saw here that the canvas get be reinitialized when resized, so maybe it happens there

@nHackel
Copy link
Contributor Author

nHackel commented Mar 7, 2025

My solution for our UI package for now will be to have something akin to:

mutable struct CairoMakieCanvas
  canvas::GtkCanvas
  figure::CairoMakie.Figure
end

function draw(fig::Figure, canvas::CairoMakieCanvas)
    canvas.figure = fig
    @guarded draw(canvas.canvas) do widget
	# ... CairoMakie.cairo_draw and so 
    end
end

and update the figure and draw

@jwahlstrand
Copy link
Member

When the canvas is resized, the CairoImageSurface backing store has to be replaced, since there's no way to resize the surface. I wonder if we should be creating the CairoMakie screen using a separate CairoImageSurface that is then copied onto the GtkCanvas backing store in the draw function. I'll bet it would not affect performance much.

Alternatively, could you call cairo_surface_reference when creating the screen, to keep the surface alive while it's being drawn on? You could unreference it after the draw using cairo_surface_destroy.

@nHackel
Copy link
Contributor Author

nHackel commented Mar 8, 2025

I'm not sure if I understand the situation correctly. The replacement due to resize would happen in either case right? And also with the the workaround I mentioned in my comment?

In the MWE the stacktrace seemed to only be printed in the case without the global keyword. If resizing replaces the store in either case, then maybe the GC'ed figure also has another effect.

I played around a bit more with the workaround I posted and so far I wasn't able to crash the GUI again. I'll only be able to a do proper stress test on monday with live image reconstructions and multiple sensor readings in parallel

@nHackel
Copy link
Contributor Author

nHackel commented Mar 9, 2025

Ah okay I see the issue with having to redraw the surface now. What is still unclear to me why we get an issue when we construct the figure locally and don't have an outside reference to it. Maybe that issue is not being reproduced by the MWE correctly.

For the reference issue we could try an approach with a package extension on CairoMakie and using ScopedValues. In the extension we could define a new draw method as follows:

const figure_cache = ScopedValue(CairoMakie.Figure)()

function Gtk4.draw(fig::CairoMakie.Figure, canvas::GtkCanvas)
  @with figure_cache => fig begin
    @guarded draw(canvas) do widget
      f = figure_cache[]
      screen = CairoMakie.Screen(f.scene, config, Gtk4.cairo_surface(canvas))
      CairoMakie.resize!(f.scene, Gtk4.width(widget), Gtk4.height(widget))
      CairoMakie.cairo_draw(screen, f.scene)
    end
  end
end

And then I think each instance of this function call should have its own figure_cache. I haven't tried it out yet (because Julia is broken on my PC atm), but once I do I can make a PR if I see that helping us with our problem

@jwahlstrand
Copy link
Member

For me the stacktraces appear whether or not the global keyword is used... it makes sense that the surface should get GC'ed in either case. I was just wondering if the crash issue was being caused by code writing to the surface after it was GC'ed, in which case adding a reference should be able to fix it.

Do the crashes occur more when you use multiple threads?

@nHackel
Copy link
Contributor Author

nHackel commented Mar 9, 2025

Yes so with the CairoMakie figures we would get a few assertion errors for Makie where the surface was a C_NULL. If we had a lot of drawing updates either because we had many plots being updated "at the same time" or updates with a quick succession we had our segfaults. Those two cases I can test tomorrow more thoroughly with the CairoMakieCanvas workaround. The scoped value one I can test in 1-2 weeks.

In those cases we also always have more than two threads, but I think most updates to the drawing function happen within @idle_add, so via the main loop task. And we also have julia -t x,1, so only one thread in the Gtk4 thread pool.

The issue where I tested my workaround I could reproduce by opening two windows with I think 5 GtkCanvases each. That just SEGFAULT/SEBARBT reliably recently and it happened with just one thread too. All the data and code one would need for that is open-source, so I can give you a reproducer for that, but it's not a small MWE

@nHackel
Copy link
Contributor Author

nHackel commented Mar 9, 2025

Oh and we just added a reference to the figure, not to the surface. Now I'm unsure if that actually fixed it for us or if we just dont encounter our race condition anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants