Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorRT] Fix perf issue for DDS nodes run by TRT 10 #23424

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

chilo-ms
Copy link
Contributor

There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10. The issue arises because when cudaStreamSynchronize is called after inference, GPU memory is released back to the OS. As a result, for the next inference run, TRT reallocates GPU memory again from the OS, introducing overhead and leading to performance degradation.
The solution is to increase the memory pool threshold, allowing TRT to retain the allocated memory and reduce this overhead.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines 2483 to 2485

std::set<std::string> exclude_ops_set;
std::set<std::string> exclude_ops_set; // currently not support to exclude ops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::set<std::string> exclude_ops_set;
std::set<std::string> exclude_ops_set; // currently not support to exclude ops
std::set<std::string> exclude_ops_set; // currently not support to exclude ops

Comment on lines 2670 to 2675

#if NV_TENSORRT_MAJOR >= 10
// TRT EP will take appropriate actions later to prevent performance degradation if the graph has DDS op that run by TRT 10.
is_dds_op_in_graph_ = IsDDSOpInSubGraph(graph, result, dds_op_set_);
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if NV_TENSORRT_MAJOR >= 10
// TRT EP will take appropriate actions later to prevent performance degradation if the graph has DDS op that run by TRT 10.
is_dds_op_in_graph_ = IsDDSOpInSubGraph(graph, result, dds_op_set_);
#endif
#if NV_TENSORRT_MAJOR >= 10
// TRT EP will take appropriate actions later to prevent performance degradation if the graph has DDS op that run by TRT 10.
is_dds_op_in_graph_ = IsDDSOpInSubGraph(graph, result, dds_op_set_);
#endif

Comment on lines 2775 to 2778
std::vector<NodeComputeInfo>& node_compute_funcs) {

#if NV_TENSORRT_MAJOR >= 10
// There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<NodeComputeInfo>& node_compute_funcs) {
#if NV_TENSORRT_MAJOR >= 10
// There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10.
std::vector<NodeComputeInfo>& node_compute_funcs) {
#if NV_TENSORRT_MAJOR >= 10
// There is a known performance issue with the DDS ops (NonMaxSuppression, NonZero and RoiAlign) when running TRT EP with TRT 10.

Comment on lines 2786 to 2788
#endif

for (auto& fused_node_graph : fused_nodes_and_graphs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
for (auto& fused_node_graph : fused_nodes_and_graphs) {
#endif
for (auto& fused_node_graph : fused_nodes_and_graphs) {

Comment on lines 643 to 645
* Check if DDS op is in the ComputeCapability/subgraph.
*/
bool IsDDSOpInSubGraph(const GraphViewer& graph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Check if DDS op is in the ComputeCapability/subgraph.
*/
bool IsDDSOpInSubGraph(const GraphViewer& graph,
* Check if DDS op is in the ComputeCapability/subgraph.
*/
bool IsDDSOpInSubGraph(const GraphViewer& graph,

Comment on lines 263 to 266
bool TensorrtExecutionProvider::IsDDSOpInSubGraph(const GraphViewer& graph,
std::vector<std::unique_ptr<ComputeCapability>>& compute_capabilities,
std::unordered_set<std::string>& dds_op_set) const {
auto is_dds_op = [&](const auto& node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool TensorrtExecutionProvider::IsDDSOpInSubGraph(const GraphViewer& graph,
std::vector<std::unique_ptr<ComputeCapability>>& compute_capabilities,
std::unordered_set<std::string>& dds_op_set) const {
auto is_dds_op = [&](const auto& node) {
bool TensorrtExecutionProvider::IsDDSOpInSubGraph(const GraphViewer& graph,
std::vector<std::unique_ptr<ComputeCapability>>& compute_capabilities,
std::unordered_set<std::string>& dds_op_set) const {
auto is_dds_op = [&](const auto& node) {

Comment on lines 269 to 271
};

for (auto& compute_capability : compute_capabilities) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
};
for (auto& compute_capability : compute_capabilities) {
};
for (auto& compute_capability : compute_capabilities) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant