-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add direct Enzyme support #476
Comments
How about we try to fix bugs together in DI instead of always undermining it? We could start with this one you mention: EnzymeAD/Enzyme.jl#1942 (I managed to reproduce it with pure Enzyme, independently from DI). I understand your concerns about Enzyme performance / correctness being misrepresented by DI. But whenever someone wants access to multiple AD backends, as is the case for most SciML libraries, DI is the natural way to go. It offers unified syntax, and it can do things that no backend on its own can do at the moment, like sparse autodiff1. If you show me a meaningful example where LinearSolve fails with DI + Enzyme (I have no doubt there are plenty), I'll be happy to try and fix it. I have already altered DI's design significantly to fit Enzyme's peculiarities (e.g. by introducing Footnotes
|
Oh for sure, and I'm not at all saying to remove DI. It's a fantastic package that makes a lot of things easier, some of which you describe above! But also at the end of the day we both want to make things easier for our users. From my point of view, we should help the ecosystem by writing extensions/examples for Enzyme that maximize performance and compatibility. I don't see how this is different from you opening up PR's/issues on various repos asking if DI can be helpful? In some cases, like LinearSolve.jl, that's an extension to add EnzymeRules to simplify the code being differentiated (also learning to better performance). In other cases, where a package dispatches to an autodiff backend, it makes sense to call Enzyme directly. The fact that the other day we saw a segfault when using AutoEnzyme in DI for a small docs example where Enzyme directly worked is rather worrying to me and implies that presently the overhead of DI might be making problems for Enzyme users more generically. As a result, packages have been usually adopting a dual approach, calling Enzyme.autodiff/related directly when given an AutoEnzyme or other object, and DI.gradient/related for other non-specialized ADTypes. This lets users get the best of both worlds, performance and compatibility when available, and general support for AD packages, as well as all the nice things like sparsity that DI provides. I'm all up for fixing issues as they arise, but also in the case where we have a backend which is more performant and stable, we should use it and not force the burden of debugging our packages on users, when unnecessary. Often times, someone will try something once, and if it fails or runs unnecessarily slow they'll just never use things again. I get that your goal is to try to have as many packages as possible start using DI, but there's more than enough from for both DI and Enzyme extensions in the Julia autodiff ecosystem! :) |
As for the issue you opened seemingly after my initial comment, the MWE implies that DI is introducing type instability through the splat operator. This may result in performance slowdowns as well as some code no longer being able to be differentiated (as is clearly the case there). I'll quickly look into fixing it, but also historically we have been advising people to remove type instabilities (as well as unions), which here would require not using DI |
My main message is that this doesn't always make sense. There is a convenience tradeoff between (1) using DI for everything and (2) using DI for everything except Enzyme + using Enzyme's native interface. And of course option (2) seems easy to wou because you know Enzyme inside and out, but that's not the case for most users, even power users from SciML.
This was a very specific case due to JuMP's unusual requirement that arguments should be splatted into the function Still, whenever you say "DI is problematic because it can segfault on Enzyme", it also implies "Enzyme itself is problematic because it can segfault". Indeed, the pure Enzyme MWE for splatting shows that this type instability is not handled gracefully, and just crashes the console. Sure, Enzyme provides a multi-argument way around this bug (which is inaccessible through DI's single-argument interface), but it remains an Enzyme bug because it doesn't happen with other backends.
On the other hand, DI also allows users to experiment with Enzyme at virtually no cost. Until LinearSolve switched to DI, it had no way of using Enzyme for its internal Jacobians. Now it's as simple as a backend argument switch, because the necessary bindings are already in DI.
But then what's the point of going through all the trouble of supporting Enzyme in DI, if you're just gonna go around telling people not to use it that way?
Fair enough, so how would you handle
Except that in doing so, we force another burden on the users: maintenance of tens of copies of essentially identical Enzyme extensions. My hope is that, if we keep putting our minds together, we could just write this extension once in DI and be good enough for most use cases. |
I think the last bit addressing the "maintenance of tens of copies of essentially identical Enzyme extensions" hits the nail on the head. If maintainers had unlimited time, adding individual package extensions would be great. The appeal of DI is the centralization of that maintenance burden into a single package. We keep up with any breaking changes in any backends for package developers, freeing up dev time. Adding specialized package extensions on top of DI is probably a valuable approach for performance critical hot loops. Individual package developers will have to decide whether the gained performance-delta is worth the increased maintenance burden. Here, Avik and Chris will have to make that call. At the end of the day, to advance the Julia AD ecosystem, that maintenance burden should be centralized as much as possible. DI made big steps toward Enzyme with the new argument activities. |
Oh for sure, but my argument here isn't that users ought use Enzyme directly (though that also may be wise/helpful), but libraries that already hardcode a bunch of autodiff support should. Such libraries already have a bunch of code for various AD tools, so there is already precedence for having such code around (and of course users all benefit without writing additional code). I guess I bucket library devs and end users in two separate categories.
Yeah for sure, but also if the use of DI makes it more likely to hit Enzyme issues, its natural to just use Enzyme directly, no?
Oh definitely and that's one of the most significant advantages of DI (both swapping out backends, and also the sparse support). I'm only suggesting here that Nonlinearsolve.jl add some special cases when something is calling Dense Enzyme to improve performance/compatibility.
I think you're conflating LinearSolve and NonlinearSolve. Like I said above "In some cases, like LinearSolve.jl, that's an extension to add EnzymeRules to simplify the code being differentiated (also learning to better performance)", some repos need an enzyme extension for a custom rule (https://github.com/SciML/LinearSolve.jl/blob/main/ext/LinearSolveEnzymeExt.jl). This was needed to get some things differentiating at the time (that now might work without). My guess it's that it's like 2-5x (and possibly more with threading on) with the Enzyme extension?
Yeah I wouldn't special case this, just dense. Sparse can always call DI (unless say a future sparse backend wants to specialize). |
Yeah that might be the main difference between our mindsets, because I see library devs as users of AD systems ;) But I get where you come from.
All I'm saying is that this needs to be decided on a case by case basis and with examples, to justify the cost of additional implementations living side by side.
Sorry yes I meant NonlinearSolve in my remark. Of course extensions for rule systems are still warranted and essential, like those in LinearSolve. This discussion was solely focused on extensions for calling into AD, not for making stuff AD-compatible. |
Sure, but at this point having seen segfaults in the wild for some tiny example code (in addition to the performance issues discussed slack), at least my default is that there should be a separate Enzyme backend unless DI can be shown to be @code_typed equivalent to relevant Enzyme calls. In other words, if the function DI passes to Enzyme autodiff utilities is equivalent to the original user function (and not with a closure, unwrapping, etc that could dramatically change how julia compiles the code and add type instabilities, drop alias info, etc) the default being not having an Enzyme extension is reasonable, if there is an indirection, it should default to using Enzyme directly (since that indirection, like above, can frequently be the source of Enzyme failing to differentiate a function). The reason I'd push this is because Enzyme has a defined scope of Julia code that it works on. Adding indirection can cause Julia to compile code outside of that scope, causing crashes (like above), in addition to all the perf issues. I agree in the long term it would be nice for Enzyme to handle all code always, but otherwise that's equivalent to asking for significantly more feature dev for something which is already supported natively. Our goal was explicitly to start with a small but well defined set of code (e.g. originally just code you could use in a @cuda kernel from GPUCompiler), and do exceptionally well on it. This lets us organically grow an ecosystem of users who can use Enzyme without being stuck trying to "boil the ocean" before anything done. We've been growing that scope with time, but again if something is likely to cause code to move outside of it, I'd recommend someone to use the version which works (and open an issue). |
Let me rephrase that the way I see it: "Given infinite developer, money and time resources, provided everyone is perfectly at ease with autodiff in general and Enzyme in particular, and assuming that optimal performance in every single case is worth more than convenience, conciseness and maintainability, then there should be a separate Enzyme backend unless DI can be shown to be But in the current situation of the Julia ecosystem, I think this is a completely unreasonable request, and it would be essentially equivalent to const DifferentiationInterface = Enzyme Most users and package devs don't even need a fraction of what you're demanding for DI + Enzyme to be useful. I'm going to stop the conversation here because it makes me sad and I admire your work too much to keep fighting. In any case, it's up to package developers to decide what they want to do with the tools that we offer. |
But that's exactly my point! I'm just trying to propose a threshold that would indicate where an extension would be useful or not, and catch the relevant usage and performance bugs. To be clear, my biggest concern here is not performance -- but code which fails with DI but works with Enzyme directly (performance is good too but not crashing should be the first priority). My argument is that if the extra indirection is added, it probably presents issues (and should be catchable). But most cases this isn't the case. So let's see how it works as a rule of thumb on some samples (again not sure if its the best rule of thumb, but its what I could think of just now): using ADTypes
import DifferentiationInterface as DI
# Note I modified Enzyme.autodiff with noinline so we could see if and how the call occured (without getting inlined into an llvmcall at which point the info is lost)
# Specifically
# @noinline function autodiff(
# rmode::ReverseMode{ReturnPrimal,RuntimeActivity,RABI,Holomorphic,ErrIfFuncWritten},
# f::FA,
# ::Type{A},
# args::Vararg{Annotation,Nargs},
#
# and
#
# @noinline function autodiff(
# ::ForwardMode{ReturnPrimal,RABI,ErrIfFuncWritten,RuntimeActivity},
# f::FA,
# ::Type{A},
# args::Vararg{Annotation,Nargs},
using Enzyme: Enzyme
x = [2.0, 3.0]
a = [3.0, 4.0]
function mul(a, b)
dot(a, b)
end
function grad(f::F, a, x) where F
DI.gradient(f, AutoEnzyme(), x, DI.Constant(a))
end
@code_typed grad(mul, a, x)
julia> @code_typed grad(mul, a, x)
#
# CodeInfo(
# 1 ── %1 = Base.arraysize(x, 1)::Int64
# │ %2 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Float64}, svec(Any, Int64), 0, :(:ccall), Vector{Float64}, :(%1), :(%1)))::Vector{Float64}
# │ %3 = Base.arraysize(%2, 1)::Int64
# │ %4 = Base.slt_int(%3, 0)::Bool
# │ %5 = Core.ifelse(%4, 0, %3)::Int64
# │ %6 = Base.slt_int(%5, 1)::Bool
# └─── goto #3 if not %6
# 2 ── goto #4
# 3 ── goto #4
# 4 ┄─ %10 = φ (#2 => true, #3 => false)::Bool
# │ %11 = φ (#3 => 1)::Int64
# │ %12 = φ (#3 => 1)::Int64
# │ %13 = Base.not_int(%10)::Bool
# └─── goto #10 if not %13
# 5 ┄─ %15 = φ (#4 => %11, #9 => %23)::Int64
# │ %16 = φ (#4 => %12, #9 => %24)::Int64
# │ Base.arrayset(false, %2, 0.0, %15)::Vector{Float64}
# │ %18 = (%16 === %5)::Bool
# └─── goto #7 if not %18
# 6 ── goto #8
# 7 ── %21 = Base.add_int(%16, 1)::Int64
# └─── goto #8
# 8 ┄─ %23 = φ (#7 => %21)::Int64
# │ %24 = φ (#7 => %21)::Int64
# │ %25 = φ (#6 => true, #7 => false)::Bool
# │ %26 = Base.not_int(%25)::Bool
# └─── goto #10 if not %26
# 9 ── goto #5
# 10 ┄ goto #11
# 11 ─ goto #12
# 12 ─ goto #13
# 13 ─ %32 = %new(Duplicated{Vector{Float64}}, x, %2)::Duplicated{Vector{Float64}}
# │ %33 = %new(Const{Vector{Float64}}, a)::Const{Vector{Float64}}
# │ invoke Enzyme.autodiff($(QuoteNode(ReverseMode{false, false, FFIABI, false, true}()))::ReverseMode{false, false, FFIABI, false, true}, $(QuoteNode(Const{typeof(mul)}(mul)))::Const{typeof(mul)}, Active::Type{Active}, %32::Duplicated{Vector{Float64}}, %33::Const{Vector{Float64}})::Tuple{Tuple{Nothing, Nothing}}
# └─── goto #14
# 14 ─ return %2
# ) => Vector{Float64} We can see that mul was directly send to an Enzyme call so this should be fine as a rule of thumb. Let's look at the multi arg case earlier (which is fixed above by DI support), to confirm that its absence would trigger the rule of thumb. function grad2(mul, a, x)
DI.gradient(Base.Fix1(mul, a), AutoEnzyme(), x)
end
julia> @code_typed grad2(mul, a, x)
# CodeInfo(
# 1 ── %1 = %new(Base.Fix1{typeof(mul), Vector{Float64}}, mul, a)::Base.Fix1{typeof(mul), Vector{Float64}}
# │ %2 = Base.arraysize(x, 1)::Int64
# │ %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Float64}, svec(Any, Int64), 0, :(:ccall), Vector{Float64}, :(%2), :(%2)))::Vector{Float64}
# │ %4 = Base.arraysize(%3, 1)::Int64
# │ %5 = Base.slt_int(%4, 0)::Bool
# │ %6 = Core.ifelse(%5, 0, %4)::Int64
# │ %7 = Base.slt_int(%6, 1)::Bool
# └─── goto #3 if not %7
# 2 ── goto #4
# 3 ── goto #4
# 4 ┄─ %11 = φ (#2 => true, #3 => false)::Bool
# │ %12 = φ (#3 => 1)::Int64
# │ %13 = φ (#3 => 1)::Int64
# │ %14 = Base.not_int(%11)::Bool
# └─── goto #10 if not %14
# 5 ┄─ %16 = φ (#4 => %12, #9 => %24)::Int64
# │ %17 = φ (#4 => %13, #9 => %25)::Int64
# │ Base.arrayset(false, %3, 0.0, %16)::Vector{Float64}
# │ %19 = (%17 === %6)::Bool
# └─── goto #7 if not %19
# 6 ── goto #8
# 7 ── %22 = Base.add_int(%17, 1)::Int64
# └─── goto #8
# 8 ┄─ %24 = φ (#7 => %22)::Int64
# │ %25 = φ (#7 => %22)::Int64
# │ %26 = φ (#6 => true, #7 => false)::Bool
# │ %27 = Base.not_int(%26)::Bool
# └─── goto #10 if not %27
# 9 ── goto #5
# 10 ┄ goto #11
# 11 ─ goto #12
# 12 ─ goto #13
# 13 ─ %33 = %new(Duplicated{Vector{Float64}}, x, %3)::Duplicated{Vector{Float64}}
# │ %34 = %new(Const{Base.Fix1{typeof(mul), Vector{Float64}}}, %1)::Const{Base.Fix1{typeof(mul), Vector{Float64}}}
# │ invoke Enzyme.autodiff($(QuoteNode(ReverseMode{false, false, FFIABI, false, true}()))::ReverseMode{false, false, FFIABI, false, true}, %34::Const{Base.Fix1{typeof(mul), Vector{Float64}}}, Active::Type{Active}, %33::Duplicated{Vector{Float64}})::Tuple{Tuple{Nothing}}
# └─── goto #14
# 14 ─ return %3
# ) => Vector{Float64}
# We can see that this doesn't directly forward mul into the call, instead passing in Base.Fix1{typeof(mul), Vector{Float64}}} and might cause problems under our rule of thumb. Thus, libraries that require Fix1/etc to use DI probably need an Enzyme Ext. Let's try the hessian case causing issues f(x::T...) where {T} = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
f_nosplat(x::AbstractVector) = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
function h1(x)
backend = AutoEnzyme()
DI.hessian(f_nosplat, backend, x) # works
end
function h2(x)
DI.hessian(splat(f), backend, x) # segfaults
end
julia> @code_typed h1(x)
CodeInfo(
1 ─ %1 = invoke DifferentiationInterface._prepare_hessian_aux($(QuoteNode(Val{16}()))::Val{16}, f_nosplat::typeof(f_nosplat), $(QuoteNode(AutoEnzyme()))::AutoEnzyme{Nothing, Nothing}, x::Vector{Float64})::DifferentiationInterface.HVPGradientHessianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.ForwardOverReverseHVPPrep{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}, DifferentiationInterface.NoPushforwardPrep}, DifferentiationInterfaceEnzymeExt.EnzymeGradientPrep{Vector{Float64}}}
│ %2 = invoke DifferentiationInterface.hessian(f_nosplat::typeof(f_nosplat), %1::DifferentiationInterface.HVPGradientHessianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.ForwardOverReverseHVPPrep{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}, DifferentiationInterface.NoPushforwardPrep}, DifferentiationInterfaceEnzymeExt.EnzymeGradientPrep{Vector{Float64}}}, $(QuoteNode(AutoEnzyme()))::AutoEnzyme{Nothing, Nothing}, x::Vector{Float64})::Matrix{Float64}
└── return %2
) => Matrix{Float64} okay this isn't terribly helpful so let's recurse in a bit to the inner calls function h1(f::F, x, dx) where F
@inline DI.hvp(f, AutoEnzyme(), x, dx)
end
@code_typed h1(f_nosplat, x, (x,))
CodeInfo(
1 ── %1 = Base.getfield(dx, 1, true)::Vector{Float64}
│ %2 = %new(Duplicated{Vector{Float64}}, x, %1)::Duplicated{Vector{Float64}}
│ %3 = invoke Enzyme.autodiff($(QuoteNode(ForwardMode{false, FFIABI, true, false}()))::ForwardMode{false, FFIABI, true, false}, $(QuoteNode(Const{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}}(DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}(f_nosplat, AutoEnzyme(), DifferentiationInterface.Rewrap{0, Tuple{}}(())))))::Const{DifferentiationInterface.var"#inner_gradient#46"{typeof(f_nosplat), AutoEnzyme{Nothing, Nothing}, DifferentiationInterface.Rewrap{0, Tuple{}}}}, Duplicated{Vector{Float64}}::Type{Duplicated{Vector{Float64}}}, %2::Duplicated{Vector{Float64}})::@NamedTuple{1::Vector{Float64}}
└─── goto #3
2 ── nothing::Nothing
3 ┄─ %6 = Base.getfield(%3, 1)::Vector{Float64}
└─── goto #4
4 ── goto #5
5 ── goto #6
6 ── goto #7
7 ── %11 = Core.tuple(%6)::Tuple{Vector{Float64}}
└─── goto #8
8 ── goto #9
9 ── goto #10
10 ─ return %11
) => Tuple{Vector{Float64}}
This indicates that the wrapper of DI.inner_gradient could be an issue, meriting further investigation. It looks like this is a closure capturing the function argument which actually might provide problems for user code. Let's reimplement from DI: https://github.com/gdalle/DifferentiationInterface.jl/blob/efb5acf1f3df51cd5f85edc7fb693703a4cd5bf0/DifferentiationInterface/src/second_order/hvp.jl#L91 In particular, let's recurse a bit more to debug and try to fix. # copying from DI
# function inner_gradient(_x, unannotated_contexts...)
# annotated_contexts = rewrap(unannotated_contexts...)
# return gradient(f, nested(inner(backend)), _x, annotated_contexts...)
# end
function h1(f::F, x, dx) where F
prep = @inline DI.prepare_hvp(f, AutoEnzyme(), x, (x,))
@inline DI.hvp(f, prep, AutoEnzyme(), x, dx)
end
@code_typed h1(f_nosplat, x, (x,))
# ...
# much more inlining and debugging later
#
function my_inner_gradient(_x, f::F) where F
return DI.gradient(f, AutoEnzyme(), _x)
end
function my_hvp(f::F, _x, dx) where F
return (DI.pushforward(my_inner_gradient, DI.NoPushforwardPrep(), AutoEnzyme(), _x, dx, DI.Constant(f)),)
end
function my_h1(x, dx)
my_hvp(f_nosplat, x, dx)
end
@code_typed my_h1(x, (x,))
# CodeInfo(
# 1 ─ %1 = Base.getfield(dx, 1, true)::Vector{Float64}
# │ %2 = %new(Duplicated{Vector{Float64}}, x, %1)::Duplicated{Vector{Float64}}
# │ %3 = invoke Enzyme.autodiff($(QuoteNode(ForwardMode{false, FFIABI, true, false}()))::ForwardMode{false, FFIABI, true, false}, $(QuoteNode(Const{typeof(my_inner_gradient)}(my_inner_gradient)))::Const{typeof(my_inner_gradient)}, Duplicated{Vector{Float64}}::Type{Duplicated{Vector{Float64}}}, %2::Duplicated{Vector{Float64}}, $(QuoteNode(Const{typeof(f_nosplat)}(f_nosplat)))::Const{typeof(f_nosplat)})::@NamedTuple{1::Vector{Float64}}
# └── goto #3
# 2 ─ nothing::Nothing
# 3 ┄ %6 = Base.getfield(%3, 1)::Vector{Float64}
# └── goto #4
# 4 ─ goto #5
# 5 ─ goto #6
# 6 ─ goto #7
# 7 ─ %11 = Core.tuple(%6)::Tuple{Vector{Float64}}
# └── goto #8
# 8 ─ %13 = Core.tuple(%11)::Tuple{Tuple{Vector{Float64}}}
# └── goto #9
# 9 ─ return %13
# ) => Tuple{Tuple{Vector{Float64}}}
Great, with our fix looks like we're directly passing in functions without creating closures/indirection! This is probably good to go! (opened an issue on DI to fix this JuliaDiff/DifferentiationInterface.jl#555 Let's check out my_h2 function my_h2(x, dx)
my_hvp(splat(f), x, dx)
end
@code_typed my_h2(x, (x,))
# julia> @code_typed my_h2(x, (x,))
# CodeInfo(
# 1 ─ %1 = Base.getfield(dx, 1, true)::Vector{Float64}
# │ %2 = %new(Duplicated{Vector{Float64}}, x, %1)::Duplicated{Vector{Float64}}
# │ %3 = invoke Enzyme.autodiff($(QuoteNode(ForwardMode{false, FFIABI, true, false}()))::ForwardMode{false, FFIABI, true, false}, $(QuoteNode(Const{typeof(my_inner_gradient)}(my_inner_gradient)))::Const{typeof(my_inner_gradient)}, Duplicated{Vector{Float64}}::Type{Duplicated{Vector{Float64}}}, %2::Duplicated{Vector{Float64}}, $(QuoteNode(Const{Base.Splat{typeof(f)}}(splat(f))))::Const{Base.Splat{typeof(f)}})::@NamedTuple{1::Vector{Float64}}
# └── goto #3
# 2 ─ nothing::Nothing
# 3 ┄ %6 = Base.getfield(%3, 1)::Vector{Float64}
# └── goto #4
# 4 ─ goto #5
# 5 ─ goto #6
# 6 ─ goto #7
# 7 ─ %11 = Core.tuple(%6)::Tuple{Vector{Float64}}
# └── goto #8
# 8 ─ %13 = Core.tuple(%11)::Tuple{Tuple{Vector{Float64}}}
# └── goto #9
# 9 ─ return %13
# ) => Tuple{Tuple{Vector{Float64}}}
# Yeah so it looks like we're sending in a closure Const{Base.Splat{typeof(f)}}(splat(f))) to autodiff, so this could (and does) cause more problems |
Thank you for this analysis, it will be helpful to improve DI! |
To be clear, this is the code_typed analysis I mentioned above as the back of envelope threshold for whether an Enzyme Ext makes sense in addition to DI. Obviously there will be mitigating factors in either direction (code re-use, extreme performance/compat), but this is what I'm suggesting be done in a way that generalizes to the types of arguments/functions intended for use. If it passes the smoke test, an Enzyme Ext probably isn't necessary, if not it probably does. However, as seen in the past few days, even seemingly simple use cases can fail this (including DI internally shown above not needing the closure). Thus my personal default is that it probably is necessary to have an extension until this analysis is performed (which would probably pass for most users which have simpler cases). I am in no way trying to undermine DI. I am, however, trying to keep users from having broken (first priority) or slow (second priority) code with the default setup. |
I had interpreted your requirement as "DI should generate the exact same
Thank you for clarifying. I think in the end we are mostly in agreement, except on the default recommendation:
Both are understandable from our respective standpoints, I think they will simply be appealing to different categories of devs and that's okay. |
And I really appreciate you taking the time to dive into DI internals and debug, the same way I try to do it for Enzyme. I do think we can end up with something pretty decent with some more elbow grease. |
imo this is exactly backwards. we should never do extra work to implement a feature that adds code complexity and maintenance burden without direct evidence that it's needed. Edit: and any PR made as a result of such analysis should have a |
Any discussion that is too general is hard to be correct on. Context matters. NonlinearSolve uses Jacobians for the core Newton methods, and jvp/vjp operations within some line search algorithms. For this context, DI looks pretty good. In fact, that should be DI's bread and butter operation really, and it seems to do rather well already. If it's within 10% of optimal, then I don't see why we wouldn't drop the maintenance burden. We of course need to hold DI accountable for keeping up its performance and continually improving it, but in this context it looks reasonable. That doesn't mean it's reasonable everywhere. Fully supporting Optimization.jl and SciMLSensitivity.jl will take a lot more work in DI, those libraries demand different things. But pointing out the hvp has overhead in DI for example has no bearing on this library since it's not used here. The moment DI is ready for a given library, we should use it, consolidate the code, and reduce the maintenance. It won't be ready for all of them at the same time. |
I totally agree and context does matter and I opened this issue to suggest that it's worth doing the investigation here given a recent case that changed my impression of the relative stability. The specific context which made me think it was worth revisiting places like this repo to investigate a potential extension was an issue on JuMP where using DI caused a segfault where using Enzyme directly ran fine. This was for the intro docs example so if DI causes a crash for simple functions it gives me a bit of pause that merits investigation (and the rule of thumb I mention above of "does it add wrappers" hopefully serves as a good way to judge if DI will introduce overhead which increases that likelihood). using Enzyme, DifferentiationInterface
f(x::T...) where {T} = (1 - x[1])^2 + 100 * (x[2] - x[1]^2)^2
function ∇²f(x::Vararg{T,N}) where {T,N}
H = Matrix{T}(undef, N, N)
direction(i) = ntuple(j -> Enzyme.Active(T(i == j)), N)
∇f(x...) = Enzyme.autodiff(Enzyme.Reverse, f, Enzyme.Active, x...)[1]
hess = Enzyme.autodiff(
Enzyme.Forward,
∇f,
Enzyme.BatchDuplicated.(Enzyme.Active.(x), ntuple(direction, N))...,
)[1]
for j in 1:N, i in 1:j
H[j, i] = hess[j][i]
end
return H
end
x = [2.0, 3.0]
∇²f(x...) # returns hessian
backend = AutoEnzyme()
DifferentiationInterface.hessian(splat(f), backend, x) # segfaults While I personally think the additional risk of crash is more critical than a performance gap, it's easier to do the performance test so I took the example from the readme in this repo and the performance gap is ~10x (though @gdalle let me know if something should be changed for the setup). It's also probably worth doing the wrapper analysis as well to determine stability. using DifferentiationInterface, ADTypes, Enzyme, BenchmarkTools, StaticArrays
f(u, p) = u .* u .- 2
u0 = @SVector[1.0, 1.0]
p = [1.0, 2.0]
prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))
@btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
# 265.006 ns (4 allocations: 592 bytes)
# 2×2 Matrix{Float64}:
# 2.0 0.0
# 0.0 2.0
function idx(f::F, x, p, i) where F
return f(x, p)[i]
end
function jac1(f::F, x, p) where F
dx = Enzyme.autodiff(Reverse, idx, Const(f), Active(x), Const(p), Const(1))[1][2]
dx2 = Enzyme.autodiff(Reverse, idx, Const(f), Active(x), Const(p), Const(2))[1][2]
return hcat(dx, dx2)
end
@btime jac1($f, $u0, $p)
# 17.817 ns (0 allocations: 0 bytes)
# 2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
# 2.0 0.0
# 0.0 2.0 |
First of all, thank you @wsmoses for focusing on a concrete benchmark instead of generalities. It helped me understand some things and even fix a few bugs. Therefore, the code below is run on the branch from JuliaDiff/DifferentiationInterface.jl#557 (I'll explain the contents of that PR in a minute). I won't go back to the JuMP example because the main culprit here is the obligation to splat arguments into the function, even if there are thousands of them. DI's behavior here may or may not be improved by a solution to JuliaDiff/DifferentiationInterface.jl#555, we'll see. But let me discuss your second example in more detail. Warning I just remembered that DI's design principlesTo understand DI's behavior on this example, it helps to list a few rules of thumb that I used when writing the package:
Obviously none of that applies here, which explains a big part of the bad performance. Now onto the nitty gritty details. Jacobians without cheatingSetup codeusing ADTypes
using BenchmarkTools
import DifferentiationInterface as DI
using DifferentiationInterface: Constant
using Enzyme: Enzyme, Active, Const, Reverse
using StaticArrays
import ForwardDiff
f(u0, p) = u0 .* u0 .- 2
u0 = @SVector[1.0, 1.0]
p = [1.0, 2.0] If you want people to compute Jacobians with Enzyme, the first thing they will try is
julia> DI.jacobian(f, AutoEnzyme(), u0, Constant(p))
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> DI.jacobian(f, prep, AutoEnzyme(; mode=Enzyme.Reverse), u0, Constant(p))
ERROR: setindex!(::SVector{2, Float64}, value, ::Int) is not defined.
Hint: Use `MArray` or `SizedArray` to create a mutable static array Reverse julia> Enzyme.jacobian(Reverse, f, u0, Const(p))
ERROR: MethodError: no method matching jacobian(::EnzymeCore.ReverseMode{…}, ::typeof(f), ::SVector{…}, ::Const{…})
Closest candidates are:
jacobian(::EnzymeCore.ReverseMode{ReturnPrimal, RuntimeActivity, RABI, Holomorphic, ErrIfFuncWritten}, ::F, ::X; n_outs, chunk) where {ReturnPrimal, F, X, RABI<:EnzymeCore.ABI, ErrIfFuncWritten, RuntimeActivity, Holomorphic}
@ Enzyme ~/.julia/packages/Enzyme/Vjlrr/src/Enzyme.jl:2087
jacobian(::EnzymeCore.ForwardMode, ::Any...; kwargs...)
@ Enzyme ~/.julia/packages/Enzyme/Vjlrr/src/Enzyme.jl:2029 Reverse julia> f_noparam(u0) = u0 .* u0 .- 2
f_noparam (generic function with 1 method)
julia> Enzyme.jacobian(Reverse, f_noparam, u0)
ERROR: setindex!(::SVector{2, Float64}, value, ::Int) is not defined.
Hint: Use `MArray` or `SizedArray` to create a mutable static array
... On the other hand, forward julia> Enzyme.jacobian(Forward, f_noparam, u0)[1]
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0 Jacobians with cheatingTo squeeze maximum performance out of Enzyme, you didn't use partial_output(f::F, u0, p, i) where {F} = f(u0, p)[i]
partial_output_shuffled(u0, f::F, p, i) where {F} = f(u0, p)[i]
function enzyme_jac_cheating(f::F, u0, p) where {F}
du01 = Enzyme.autodiff(
Reverse, partial_output, Const(f), Active(u0), Const(p), Const(1)
)[1][2]
du02 = Enzyme.autodiff(
Reverse, partial_output, Const(f), Active(u0), Const(p), Const(2)
)[1][2]
return hcat(du01, du02)
end
function di_jac_cheating(f::F, u0, p) where {F}
du01 = DI.gradient(
partial_output_shuffled, AutoEnzyme(), u0, Constant(f), Constant(p), Constant(1)
)
du02 = DI.gradient(
partial_output_shuffled, AutoEnzyme(), u0, Constant(f), Constant(p), Constant(2)
)
return hcat(du01, du02)
end What do we get in the benchmark? julia> @btime enzyme_jac_cheating($f, $u0, $p)
17.600 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
julia> @btime di_jac_cheating($f, $u0, $p)
26.199 ns (2 allocations: 64 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0 I guess that's not too bad, although I'll have to investigate where the allocations come from. Why is DI slow?On the branch from JuliaDiff/DifferentiationInterface.jl#557 (where I modified the batch size to adapt to the array length), we get the following measurements. julia> prep = DI.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p));
julia> @btime DI.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
132.796 ns (3 allocations: 256 bytes)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 As you notice, the return type is a julia> J = @MMatrix(zeros(2, 2))
2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
0.0 0.0
0.0 0.0
julia> @btime DI.jacobian!($f, $J, $prep, AutoEnzyme(), $u0, Constant($p))
34.976 ns (0 allocations: 0 bytes)
2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0 In this case, ForwardDiff remains our safest bet by far, and it is also accessible from DI: julia> @btime DI.jacobian($f, AutoForwardDiff(; chunksize=2), $u0, Constant($p))
2.464 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0 Why is DI wrong?These experiments also made me realize some issues with how DI handles |
I don't think that calling autodiff directly is cheating... Enzyme.autodiff is the standard interface and we explicitly warn folks about the sugar functions being more limited (though I do agree we should still add this support here). I agree it's more natural for end users to try them first, but for a package extension trying to maximize performance and compatibility, using the core function is fine imo. from https://enzyme.mit.edu/index.fcgi/julia/stable/#Gradient-Convenience-functions: While the convenience functions discussed below use autodiff internally, they are generally more limited in their functionality. Beyond that, these convenience functions may also come with performance penalties; especially if one makes a closure of a multi-argument function instead of calling the appropriate multi-argument autodiff function directly. In any case, I think your experiment supports my suggestion that there be an Enzyme Extension here. I don't mind if that extension is implemented with DI.gradient instead of Enzyme.autodiff as above (though the gap there is stil ~50% slowdown), but the point is that some special case handling here is required for the Enzyme ADType. |
But this special handling does not scale well at all if the function is costly or has high-dimensional outputs, can we at least agree on that? Computing the whole vector only to throw most of it away seems very strange to me. Unless Enzyme is able to somehow optimize that away? function idx(f::F, x, p, i) where F
return f(x, p)[i]
end |
Often yes it can, but it depends on the program. And sure, then in your special case handler if you're worried about it you can use the index'd algorithm for low dim and something else for high. I just wrote it that way because it was the easier setup, but you could totally set something up nicely with batch mode with a similar behavior. |
Since I didn't realize DI does forward mode for jacobian's by default, you get something a lot more concise as follows. If you want to truly speed race, here's a couple of ways to tune that puts it on par with forwarddiff. julia> @btime hcat(Enzyme.autodiff(Forward, $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
3.740 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
2.340 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
julia> @btime DifferentiationInterface.jacobian($f, AutoForwardDiff(; chunksize=2), $u0, Constant($p))
2.460 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0 But again per below (using latest DI release), I think it's clear that some special handling is required for AutoEnzyme here -- be it calling DI.gradient or Enzyme.autodiff julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(), $u0, Constant($p))
449.050 ns (23 allocations: 1.69 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(; mode=Enzyme.Reverse), $u0, Constant($p))
ERROR: setindex!(::SVector{2, Float64}, value, ::Int) is not defined.
Hint: Use `MArray` or `SizedArray` to create a mutable static array
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] setindex!(a::SVector{2, Float64}, value::Float64, i::Int64)
@ StaticArrays ~/.julia/packages/StaticArrays/MSJcA/src/indexing.jl:3
[3] macro expansion
@ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:159 [inlined]
[4] _broadcast!
@ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:143 [inlined]
[5] _copyto!
@ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:70 [inlined]
[6] copyto!
@ ~/.julia/packages/StaticArrays/MSJcA/src/broadcast.jl:63 [inlined]
[7] materialize!
@ ./broadcast.jl:914 [inlined]
[8] materialize!
@ ./broadcast.jl:911 [inlined]
[9] (::DifferentiationInterfaceEnzymeExt.var"#5#6")(d0::SVector{2, Float64}, d::SVector{2, Float64})
@ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:39
[10] foreach(::Function, ::@NamedTuple{…}, ::NTuple{…})
@ Base ./abstractarray.jl:3098
[11] batch_seeded_autodiff_thunk(::EnzymeCore.ReverseModeSplit{true, true, false, 0, true, FFIABI, false, true, false}, ::NTuple{16, SVector{…}}, ::Const{typeof(f)}, ::Type{BatchDuplicated}, ::BatchDuplicated{SVector{…}, 16}, ::Const{Vector{…}})
@ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:38
[12] value_and_pullback(f::typeof(f), ::DifferentiationInterface.NoPullbackPrep, backend::AutoEnzyme{ReverseMode{false, false, FFIABI, false, false}, Nothing}, x::SVector{2, Float64}, ty::NTuple{16, SVector{2, Float64}}, contexts::Constant{Vector{Float64}})
@ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:128
[13] pullback(f::typeof(f), prep::DifferentiationInterface.NoPullbackPrep, backend::AutoEnzyme{ReverseMode{false, false, FFIABI, false, false}, Nothing}, x::SVector{2, Float64}, ty::NTuple{16, SVector{2, Float64}}, contexts::Constant{Vector{Float64}})
@ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/reverse_onearg.jl:142
[14] (::DifferentiationInterface.var"#42#43"{16, Tuple{typeof(f)}, AutoEnzyme{ReverseMode{…}, Nothing}, SVector{2, Float64}, Tuple{Constant{…}}, DifferentiationInterface.NoPullbackPrep, Int64, Vector{NTuple{…}}})(a::Int64)
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:269
[15] iterate
@ ./generator.jl:47 [inlined]
[16] _collect
@ ./array.jl:854 [inlined]
[17] collect_similar
@ ./array.jl:763 [inlined]
[18] map
@ ./abstractarray.jl:3285 [inlined]
[19] _jacobian_aux(f_or_f!y::Tuple{…}, prep::DifferentiationInterface.PullbackJacobianPrep{…}, backend::AutoEnzyme{…}, x::SVector{…}, contexts::Constant{…})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:268
[20] jacobian(f::typeof(f), prep::DifferentiationInterface.PullbackJacobianPrep{16, NTuple{…}, NTuple{…}, DifferentiationInterface.NoPullbackPrep}, backend::AutoEnzyme{ReverseMode{…}, Nothing}, x::SVector{2, Float64}, contexts::Constant{Vector{…}})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:157
[21] jacobian(f::typeof(f), backend::AutoEnzyme{ReverseMode{false, false, FFIABI, false, false}, Nothing}, x::SVector{2, Float64}, contexts::Constant{Vector{Float64}})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/fallbacks/no_prep.jl:49
[22] var"##core#1571"(f#1568::typeof(f), u0#1569::SVector{2, Float64}, p#1570::Vector{Float64})
@ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:561
[23] var"##sample#1572"(::Tuple{typeof(f), SVector{2, Float64}, Vector{Float64}}, __params::BenchmarkTools.Parameters)
@ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:570
[24] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::@Kwargs{})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:187
[25] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:182
[26] #invokelatest#2
@ ./essentials.jl:892 [inlined]
[27] invokelatest
@ ./essentials.jl:889 [inlined]
[28] #lineartrial#46
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:51 [inlined]
[29] lineartrial
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:50 [inlined]
[30] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::@Kwargs{})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:300
[31] tune!
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289 [inlined]
[32] tune!(b::BenchmarkTools.Benchmark)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289
[33] top-level scope
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:666
Some type information was truncated. Use `show(err)` to see complete types. |
I am 100% convinced that on this small instance, it is possible to speed up Enzyme significantly beyond what DI can do at the moment. And I also see that it comes at the cost of some careful special-casing, likely based on function runtime and dimensionality. In any case, I think we should let @avik-pal make a decision, and whatever it is that decision can also be revisited in the future. Meanwhile, I'm gonna do my best to bring DI+Enzyme to a satisfactory level. |
For sure, and happy to help you get there :) Generalizing slightly from this, my personal takeaway from this at the moment when folks ask if we should add an Enzyme extension is:
All of course subject to change as things progress moving forward, but that's presently what I'd recommend. |
Non-allocating fast handling of static arrays is of course pretty essential. Though that is something that DI should be able to specialize on, and if we need to use ForwardDiff for that case because that's the setup that currently is optmized I think that's probably fine. Small systems like 10 or less with static arrays is a very common case of course so its performance matters a lot (there's lots of models which are repeated nonlinear solves of such systems, solving them a few hundred thousand times). Generally to achieve this we've had some specialization specifically for static arrays (for example SimpleNonlinearSolve.jl vs NonlienarSolve.jl) as it's just necessary, and so for this we can also just ensure that the static array optimized solvers are targeting whatever the best target is. |
As discussed with @ExpandingMan, specializing every DI backend on static arrays would double the number of extensions, so it's not a great idea for code maintenance. But realistically I guess we only need ForwardDiff, FiniteDiff and Enzyme? |
yeah, you only really care about StaticArrays for forward mode. |
I guess a question to me is whether that would need to be done for more array types or is it just StaticArrays? For example using a view
julia> u = ones(4); u0 = view(u, 2:3)
2-element view(::Vector{Float64}, 2:3) with eltype Float64:
1.0
1.0
julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, (view([0., 1., 0., 0.], 2:3), view([0., 0., 1.0, 0.], 2:3)))), $(Const(p)))[1]...)
128.774 ns (4 allocations: 336 bytes)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> function idx(f::F, x, p, i) where F
return f(x, p)[i]
end
idx (generic function with 1 method)
julia> function jac1(f::F, x, p) where F
dx = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(1))[2]
dx2 = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(2))[2]
return hcat(dx, dx2)
end
jac1 (generic function with 1 method)
julia> @btime jac1($f, $u0, $p)
5.079 μs (59 allocations: 3.31 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(), $u0, Constant($p))
ERROR: MethodError: Cannot `convert` an object of type Vector{Float64} to an object of type SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
Closest candidates are:
convert(::Type{T}, ::T) where T
@ Base Base.jl:84
convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray
@ LinearAlgebra ~/git/Enzyme.jl/julia-1.10.5/share/julia/stdlib/v1.10/LinearAlgebra/src/factorization.jl:108
convert(::Type{T}, ::LinearAlgebra.AbstractQ) where T<:AbstractArray
@ LinearAlgebra ~/git/Enzyme.jl/julia-1.10.5/share/julia/stdlib/v1.10/LinearAlgebra/src/abstractq.jl:49
...
Stacktrace:
[1] (::Base.Fix1{typeof(convert), Type{SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}}})(y::Vector{Float64})
@ Base ./operators.jl:1118
[2] map
@ ./tuple.jl:294 [inlined]
[3] pushforward(f::typeof(f), ::DifferentiationInterface.NoPushforwardPrep, backend::AutoEnzyme{Nothing, Nothing}, x::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, tx::NTuple{16, Vector{Float64}}, contexts::Constant{Vector{Float64}})
@ DifferentiationInterfaceEnzymeExt ~/.julia/packages/DifferentiationInterface/v1iM4/ext/DifferentiationInterfaceEnzymeExt/forward_onearg.jl:73
[4] (::DifferentiationInterface.var"#40#41"{16, Tuple{typeof(f)}, AutoEnzyme{Nothing, Nothing}, SubArray{Float64, 1, Vector{…}, Tuple{…}, true}, Tuple{Constant{…}}, DifferentiationInterface.NoPushforwardPrep, Int64, Vector{NTuple{…}}})(a::Int64)
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:236
[5] iterate
@ ./generator.jl:47 [inlined]
[6] _collect
@ ./array.jl:854 [inlined]
[7] collect_similar
@ ./array.jl:763 [inlined]
[8] map
@ ./abstractarray.jl:3285 [inlined]
[9] _jacobian_aux(f_or_f!y::Tuple{…}, prep::DifferentiationInterface.PushforwardJacobianPrep{…}, backend::AutoEnzyme{…}, x::SubArray{…}, contexts::Constant{…})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:235
[10] jacobian(f::typeof(f), prep::DifferentiationInterface.PushforwardJacobianPrep{…}, backend::AutoEnzyme{…}, x::SubArray{…}, contexts::Constant{…})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/first_order/jacobian.jl:157
[11] jacobian(f::typeof(f), backend::AutoEnzyme{Nothing, Nothing}, x::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, contexts::Constant{Vector{Float64}})
@ DifferentiationInterface ~/.julia/packages/DifferentiationInterface/v1iM4/src/fallbacks/no_prep.jl:49
[12] var"##core#21078"(f#21075::typeof(f), u0#21076::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, p#21077::Vector{Float64})
@ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:561
[13] var"##sample#21079"(::Tuple{typeof(f), SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Vector{Float64}}, __params::BenchmarkTools.Parameters)
@ Main ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:570
[14] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; maxevals::Int64, kwargs::@Kwargs{})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:187
[15] _lineartrial(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:182
[16] #invokelatest#2
@ ./essentials.jl:892 [inlined]
[17] invokelatest
@ ./essentials.jl:889 [inlined]
[18] #lineartrial#46
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:51 [inlined]
[19] lineartrial
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:50 [inlined]
[20] tune!(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::@Kwargs{})
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:300
[21] tune!
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289 [inlined]
[22] tune!(b::BenchmarkTools.Benchmark)
@ BenchmarkTools ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:289
[23] top-level scope
@ ~/.julia/packages/BenchmarkTools/QNsku/src/execution.jl:666
Some type information was truncated. Use `show(err)` to see complete types.
julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(; mode=Enzyme.Reverse), $u0, Constant($p))
27.761 μs (431 allocations: 25.58 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 |
This is due to a combination of two things:
I have opened an issue here: JuliaDiff/DifferentiationInterface.jl#559 I'm very happy we're finding bugs in DI, feel free to also open issues on the repo instead of just mentioning them in this (already very long) thread. |
For a regular old plain array it also looks like DI adds a ~10x slowdown.
julia> u0 = ones(2);
julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
107.521 ns (4 allocations: 336 bytes)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> function idx(f::F, x, p, i) where F
return f(x, p)[i]
end
idx (generic function with 1 method)
julia> function jac1(f::F, x, p) where F
dx = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(1))[2]
dx2 = Enzyme.gradient(Reverse, idx, Const(f), x, Const(p), Const(2))[2]
return hcat(dx, dx2)
end
jac1 (generic function with 1 method)
julia> @btime jac1($f, $u0, $p)
218.175 ns (7 allocations: 576 bytes)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(), $u0, Constant($p))
1.442 μs (45 allocations: 3.98 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0
julia> @btime DifferentiationInterface.jacobian($f, AutoEnzyme(; mode=Enzyme.Reverse), $u0, Constant($p))
2.169 μs (61 allocations: 5.23 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 |
For the regular array doing DI prep shaved off some time but it remains ~6x slower for some reason julia> prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))
DifferentiationInterface.PushforwardJacobianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.NoPushforwardPrep}([([1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0])], [([2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0])], DifferentiationInterface.NoPushforwardPrep(), 2)
julia> @btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
670.331 ns (21 allocations: 1.91 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 |
Again, you're comparing apples to oranges. Your optimized Enzyme routine calls Honestly though, I'm tired of this back and forth, which doesn't prove much except that things are faster when we hardcode them. We already knew that. The whole point of DI is to avoid hardcoding everything everywhere, but I can't seem to get that point across. |
I'd like to contribute an outside voice that may be helpful here. The benefits of a generic AD interface for packages like NonlinearSolve.jl are obvious. There are multiple back-ends available at different levels of maturity and if a moment ever comes when there is one and only one obvious choice for autodiff in Julia, it is well in the future. There are also potential benefits (particularly for sake of testing) to being able to easily swap out finite differencing. Of course developing a generic interface that comes with no performance penalties is very hard to do. DI achieves this very well in the vast majority of "simple" cases. There is likely a large class of cases which currently come with some significant performance penalty in DI which have relatively straightforward fixes and can and should be fixed. There is then a (probably much smaller) class of cases which are not reasonable to expect any generic interface to achieve without significant performance penalties. I suspect NonlinearSolve falls into the second class, but this is a decision that it will have to be made by its developers based on its specific needs and benchmarks. Now, as for StaticArrays, while ultimately in most cases they work amazingly well, there is no denying that they are also a major source of headaches for package developers. Normally users of StaticArrays are looking for a whole bunch of extra idiosyncratic guarantees that it may or may not be reasonable for AD packages to make. Indeed, most StaticArrays users will consider any functions with heap allocations "broken", which is not necessarily compatible with the philosophy of an AD implementation or interface. I have come to the conclusion that it's not entirely reasonable to expect AD to work on them without significant penalties in the generic case, and they often require slightly more specialization, so my own approach has been StaticDiffInterface.jl. This may well be entirely temporary, as the most important cases (Enzyme and ForwardDiff) tend to work pretty well now, but at least at this moment I both very much want a generic interface and do not really trust StaticArrays to not get broken (either by AD back-ends or an interface that is not specifically for them). The fact that DI is extensible enough that it allows for this approach illustrates the usefulness of a generic interface. Despite whatever difficulties they present, I don't think StaticArrays changes the overall picture of interface vs direct implementation in any significant way. In most cases (and I have tested quite a lot) you can get perfectly good performance through an interface. There will undoubtedly be cases where that fails. If those are important for NonlinearSolve, it should use a direct AD dependency. Finally, I think showing benchmarks here is only useful if one can show the clear applicability to common use cases in NonlinearSolve. Again, nobody should doubt that there will be cases where DI is fine and cases where it costs a significant penalty. It's perfectly fine that both types of cases exist, it's merely a matter of deciding whether or not a particular package is truly stuck with edge cases that a generic interface cannot solve and how crucial it is for that package to be generic across AD back-ends. |
It is a bit odd though. julia> prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))
DifferentiationInterface.PushforwardJacobianPrep{16, NTuple{16, Vector{Float64}}, NTuple{16, Vector{Float64}}, DifferentiationInterface.NoPushforwardPrep}([([1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0])], [([2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [2.0, 0.0], [0.0, 2.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0])], DifferentiationInterface.NoPushforwardPrep(), 2)
julia> @btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
670.331 ns (21 allocations: 1.91 KiB)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 That form should be lowering to exactly: julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
107.521 ns (4 allocations: 336 bytes)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 so I'm not sure I understand what would cause the difference. Can someone profile what the actual overhead is here? That must be able to be cut out.
I would actually be surprised if NonlinearSolve is in the latter case. Optimization.jl, yes. SciMLSensitivity.jl, yes. NonlinearSolve.jl is just doing Jacobians and jvps of functions
Agreed. If we have to specialize on static arrays, so be it. That's straightforwad. I expect StaticArrays are weird because most things with arrays assume the array interface is satisfied, which assumes DI should handle the common array case well, and use ArrayInterface.jl to take the optimal path on all generic array types (i.e. paths for indexing vs non-indexable, etc.). That's what it should focus on getting for now. There's still a ton of generic code optimizing for it to do, and Array, so reducing its scope to just those cases and taking StaticArrays out of it is a good use of time. |
I explained the overhead in great detail in this comment, section "Why is DI slow?". TLDR:
DI does not assume that the batch size
You can see the whole code in this file: Your remark actually made me realize a suboptimal aspect, fixed in JuliaDiff/DifferentiationInterface.jl#561 (x2 speedup). I would gladly call
It may be straightforward, but it doesn't come for free. Anything I do in DI, I have to test properly, against as many backends as I can support. That's how I get 99% code coverage (by the way I just added code coverage to Enzyme in EnzymeAD/Enzyme.jl#1954). The productive part of this thread was coming up with actual benchmarks where DI is slowed down. I have taken note of them, opened some issues to keep track. Barring a few exceptions (JuMP-mandated splatting), most of them are definitely fixable, because I designed the API to be flexible enough. It's just a matter of time and energy. |
I would just keep JuMP entirely separate for this reason. Not supporting full Julia functions is just a non-starter and we shouldn't have things splat because JuMP needs it 😅. That's one of the reasons we had to start making Optimization.jl for the neural network support, splatting for a NN is just... yeah that's not happening.
That stat is very misleading though. In a real sense it has a lot less coverage than something like SparseDiffTools because the coverage metric is only whether a line of code is hit, not for all of the different dispatches and types that can hit it. Testing the full gambit of FillArrays, MultiScaleArrays, ArrayPartition, CuSparseMatrixCSR, etc. is definitely not covered from what I can tell right now.
ForwardDiff benchmarked it at 12 before. I'm skeptical it's 10 and not 8, if the SIMD is good it should max out when all of the operations fully SIMD right? Saving a few primals at that point usually isn't a big deal.
That's probably the solution here. NonlinearSolve literally just wants a Jacobian. If we need a faster Jacobian setup for Enzyme, we should let Enzyme.jl control its definition via Enzyme.jacobian! and DI/NonlinearSolve.jl should just use it. I'm sure Billy can make that fast and we'll gladly just use the fast version of that. The upside is that I know Enzyme has coming special tools for sparsity handling, and so it would be great if Enzyme was handling that the way it wanted to and DI could just facilitate shuttling down to the Enzyme-specific calls to abstract away the interface. In the end, someone has to make a Jacobian function if we want a Jacobian, so let's just make sure the Enzyme.jacobian is the good one. |
Let's not even discuss SparseDiffTools given the bad stuff we found in #473 and JuliaDiff/SparseDiffTools.jl#307. I've been working tirelessly so that we can retire it as soon as possible, and that day is coming.
That is entirely correct. I have tests for StaticArrays and JLArrays (GPU emulators) but I can't test every array type in the ecosystem while keeping the CI time reasonable (keep in mind everything is multiplied by 12 backends and 8 differentiation operators).
That's why I used the
If Billy codes a faster Jacobian, DI can also fall back on it, and everyone is happy. But I gather the Enzyme philosophy sofar has been "high-level functions are in general not optimally fast" (see this comment). This is the root of all our disagreements above, because low-level operators need to be hardcoded in each case for Enzyme to reach top speed.
I'll be happy to add bindings to Spadina once it's accessible from Julia, documented and tested. |
I'm happy to report a x15 speedup in DI's handling of this benchmark case, and we get the right Jacobian type now. DI v0.6.8 (before this discussion): julia> prep = DI.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p));
julia> @btime DI.jacobian(f, $prep, AutoEnzyme(), $u0, Constant($p))
375.357 ns (4 allocations: 592 bytes)
2×2 Matrix{Float64}:
2.0 0.0
0.0 2.0 DI v0.6.10 (from the branch JuliaDiff/DifferentiationInterface.jl#565): julia> prep = DI.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p));
julia> @btime DI.jacobian(f, $prep, AutoEnzyme(), $u0, Constant($p))
25.890 ns (1 allocation: 48 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0 It's not fully allocation-free yet, but this shows that progress on such cases is not limited by the API, just by the current generic implementation. |
Awesome! By chance, did you figure out the source of the other ~10x gap? Is it just that allocation, or is it something more? Note that below I used DI#main and Enzyme#main. Per your request I added multi-arg support to ForwardMode functions. However, note that the semantics of Enzyme.jacobian are sufficiently different from DI.jacobian that offhand I don't think you can directly forward one to the other (in particular how const/duplicated/etc function's works differs from your present semantics, as well as the return type being multi dimensional). julia> using Enzyme, BenchmarkTools, DifferentiationInterface, ADTypes, StaticArrays
julia> f(u, p) = u .* u .- 2
f (generic function with 1 method)
julia> u0 = @SVector[1.0, 1.0]
2-element SVector{2, Float64} with indices SOneTo(2):
1.0
1.0
julia> p = [1.0, 2.0]
2-element Vector{Float64}:
1.0
2.0
julia> prep = DifferentiationInterface.prepare_jacobian(f, AutoEnzyme(), u0, Constant(p))
DifferentiationInterface.PushforwardJacobianPrep{2, Tuple{SVector{2, Float64}, SVector{2, Float64}}, Tuple{MVector{2, Float64}, MVector{2, Float64}}, DifferentiationInterface.NoPushforwardPrep}(Tuple{SVector{2, Float64}, SVector{2, Float64}}[([1.0, 0.0], [0.0, 1.0])], Tuple{MVector{2, Float64}, MVector{2, Float64}}[([1.0e-323, 6.23553276621004e-310], [1.0e-323, 6.23553276621004e-310])], DifferentiationInterface.NoPushforwardPrep(), 2)
julia> @btime DifferentiationInterface.jacobian($f, $prep, AutoEnzyme(), $u0, Constant($p))
19.158 ns (1 allocation: 48 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
julia> @btime hcat(Enzyme.autodiff(Enzyme.set_abi(Forward,InlineABI), $f, $(BatchDuplicated(u0, Enzyme.onehot(u0))), $(Const(p)))[1]...)
2.330 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
julia> @btime Enzyme.jacobian(Enzyme.set_abi(Forward,InlineABI), $f, $u0, $(Const(p)))
2.330 ns (0 allocations: 0 bytes)
([2.0 0.0; 0.0 2.0], nothing) @gdalle again just to re-iterate the goal of DI is fantastic. The idea of having an interface that encapsulates all of the common use cases is a laudable and critical task. However, each end user may have additional information which lets them run code more efficiently directly and/or let more code be successfully differentiated. Adding that generally to DI would mean a separate function for each potential context e.g. (known scalar output jacobian, know array output jacobian, symmetric input jacobian, coupled input output, etc). Covering every case is clearly a non goal as it would be boiling the ocean. The question is a trade off between what level of generality that gets you the most use cases bang for your buck. There will always be cases where the more information is worth the trade, and others where it's not. This also will simultaneously depend on the state of the artifact's implementation, which will change over time and likely lead to more users being able to switch over to DI. Your improvements here are fantastic, but still I think at the current state of the artifact and end application, it merits the ~10 lines of code needed to add special handling for Enzyme. Again, this could be calling DI.gradient in the special handling if you want, I don't care if the special handling is implemented with DI subfunctions instead of DI.jacobian. But it does look like it's merited -- and of course may chance in the future if these gaps disappear! But that's no reason to block a PR you don't like, we can always remove it once it's no longer needed. On a side note: Frankly this entire conversation has been exhausting and makes me think twice about engaging with DI in the future. Here's what seemed to happen in this thread from my view: I offer the suggestion that it may be warranted to add a ~10 line special case handler for my package and you responded like it was a personal attack and claiming to undermine you, demanding a justification why Enzyme deserves an extension here. I give a high level description for why/when it can be beneficial and you now demand "Do you have benchmarks or errors to show that DI is insufficient here? Not in JuMP, in this very repo." I benchmark the function in the readme of this repo and you claim this is "cheating" and "hard-coded" by calling the actual autodiff API. At this point I've shown this is sufficiently general that there's now a draft PR (#478), and a nicer Enzyme.jacobian function. I don't know what more you'll want next, and I don't think it's productive to continue this thread, so I'm going to close the issue. I'm glad that you've used these examples to improve DI, and I hope it continues to improve in the future. Feel free to take the PR, or feel free to ignore it. |
Yes, I eventually figured it out! DI's default On that PR branch, DI's generic implementation using Enzyme's pushforwards is within a factor 2 of native Enzyme (note that it doesn't use your new julia> mode = Enzyme.set_abi(Enzyme.Forward, Enzyme.InlineABI)
EnzymeCore.ForwardMode{false, EnzymeCore.InlineABI, false, false}()
julia> backend = AutoEnzyme(; mode);
julia> prep = prepare_jacobian(f, backend, u0, Constant(p));
julia> @btime jacobian(f, $prep, $backend, $u0, Constant($p))
3.815 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
julia> @btime hcat(Enzyme.autodiff($mode, $f, $(Enzyme.BatchDuplicated(u0, Enzyme.onehot(u0))), $(Enzyme.Const(p)))[1]...,)
2.996 ns (0 allocations: 0 bytes)
2×2 SMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
2.0 0.0
0.0 2.0
As discussed privately on Slack, I'll strive to be more Enzyme-friendly in the future, and learn from your custom implementations rather than dismiss them. I can even help out with #478 if necessary. |
AutoEnzyme should probably be specialized and not fall back to DI.
In addition to being slower in some cases, it's been shown to cause errors (even segfaults) when using AutoEnzyme in DI whereas using Enzyme directly ran successfully.
see jump-dev/JuMP.jl#3836 (comment)
The text was updated successfully, but these errors were encountered: