You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, we are working on the new PR to enable the Conv-Bias-Elu fusion in XLA, which can take advantage of the cuDNN runtime compiled fusion kernels. This is going to be the first PR to exploit this new cuDNN feature in XLA. More patterns will come after this PR gets merged.
We have conducted some simple benchmark with synthetic data and the results show about 5% to 40% perf improvements over the original Conv-Bias (currently XLA can only fusion this part) and then Elu in most cases.
Also, we plan to make this feature turned on by default, though it might lead to longer the compilation time since cuDNN will need some time to compile every kernel/engine during the autotune. And this overhead is currently ~1.5s for each engine on Ampere.
Any thoughts and feedback are welcomed. Thanks. For more detailed implementation, please refer to this PR.
As a side note, in the native TF, we have already enabled this feature by setting TF_CUDNN_USE_RUNTIME_FUSION and the current list of supported patterns are:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Recently, we are working on the new PR to enable the Conv-Bias-Elu fusion in XLA, which can take advantage of the cuDNN runtime compiled fusion kernels. This is going to be the first PR to exploit this new cuDNN feature in XLA. More patterns will come after this PR gets merged.
We have conducted some simple benchmark with synthetic data and the results show about 5% to 40% perf improvements over the original Conv-Bias (currently XLA can only fusion this part) and then Elu in most cases.
Also, we plan to make this feature turned on by default, though it might lead to longer the compilation time since cuDNN will need some time to compile every kernel/engine during the autotune. And this overhead is currently ~1.5s for each engine on Ampere.
Any thoughts and feedback are welcomed. Thanks. For more detailed implementation, please refer to this PR.
As a side note, in the native TF, we have already enabled this feature by setting
TF_CUDNN_USE_RUNTIME_FUSION
and the current list of supported patterns are:More supported patterns from cuDNN can be found in the online cuDNN guide.
Beta Was this translation helpful? Give feedback.
All reactions