Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass an error message to the failure node #6181

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

popojk
Copy link
Contributor

@popojk popojk commented Jan 18, 2025

Tracking issue

Closes #4574

Why are the changes needed?

In flytekit, we add err promise to the failure node input interface. This err message explains why the workflow failed. However, propeller doesn't pass an error message as an input to the failure node. Therefore, current error message in the failure node is always None.

What changes were proposed in this pull request?

  1. In FailureNodeLookup, we added a new attribute OriginalError to store runtime execution error.
  2. In NodeExecutor, preExecute func will call a new func ResolveOnFailureNodeInput if current execution node is an on failure node, and the new func will inject execution error if err input existed in the workflow.
  3. Update flytepropeller/pkg/utils/assert/literals.go to support more literal map assertion test.

How was this patch tested?

We designed a workflow that triggers the clean_up on failure node, which includes an err input with a default value of None. During workflow execution in the backend, we expected the error message to be passed to the on failure node.

import typing
from click.testing import CliRunner

from flytekit import task, workflow, ImageSpec, WorkflowFailurePolicy
from flytekit.clis.sdk_in_container import pyflyte
from flytekit.types.error.error import FlyteError

@task
def create_cluster(name: str):
    print(f"Creating cluster: {name}")


@task
def t1(a: int, b: str):
    print('Execute P1')
    print(f"{a} {b}")
    raise ValueError("Fail!")


@task
def delete_cluster(name: str):
    print(f"Deleting cluster {name}")


@task
def clean_up(name: str, err: typing.Optional[FlyteError] = None):  # err is always None for now
    print('execute clean up')
    print(f"Deleting cluster {name} due to {err}")
    print(err)


@workflow(on_failure=clean_up)
def wf(name: str = "my_cluster"):
    c = create_cluster(name=name)
    t = t1(a=1, b="2")
    d = delete_cluster(name=name)
    c >> t >> d

The error message was passed into err input as we expected

image

Check all the applicable boxes

  • All new and existing tests passed.
  • All commits are signed-off.

Summary by Bito

Implementation of error message propagation to failure nodes in Flyte workflows, enhancing error handling capabilities. Changes include adding OriginalError field to FailureNodeLookup, modifications to node executor, and splitting error resolution logic. The implementation includes new assertions in failure node lookup tests and improvements to literal type comparison functionality in assert utilities. These updates enable better debugging capabilities and workflow failure information access.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 2

@flyte-bot
Copy link
Collaborator

flyte-bot commented Jan 18, 2025

Code Review Agent Run #f10373

Actionable Suggestions - 5
  • flytepropeller/pkg/utils/assert/literals.go - 1
    • Consider adding nil check for GetStructure · Line 24-24
  • flytepropeller/pkg/controller/executors/failure_node_lookup.go - 1
    • Consider initializing OriginalError field · Line 14-14
  • flytepropeller/pkg/controller/nodes/resolve_test.go - 1
    • Consider handling MakeLiteral error return · Line 503-503
  • flytepropeller/pkg/controller/executors/failure_node_lookup_test.go - 1
    • Consider checking type assertion result · Line 55-55
  • flytepropeller/pkg/controller/nodes/executor.go - 1
    • Consider handling GetOriginalError error return · Line 771-773
Review Details
  • Files reviewed - 8 · Commit Range: bde2f15..109201f
    • flytepropeller/pkg/controller/executors/failure_node_lookup.go
    • flytepropeller/pkg/controller/executors/failure_node_lookup_test.go
    • flytepropeller/pkg/controller/nodes/executor.go
    • flytepropeller/pkg/controller/nodes/resolve.go
    • flytepropeller/pkg/controller/nodes/resolve_test.go
    • flytepropeller/pkg/controller/nodes/subworkflow/subworkflow.go
    • flytepropeller/pkg/controller/workflow/executor.go
    • flytepropeller/pkg/utils/assert/literals.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

Copy link

codecov bot commented Jan 18, 2025

Codecov Report

Attention: Patch coverage is 69.89247% with 28 lines in your changes missing coverage. Please review.

Project coverage is 37.09%. Comparing base (4dd64d8) to head (a9d3e3a).
Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
flytepropeller/pkg/utils/assert/literals.go 60.00% 10 Missing and 4 partials ⚠️
flytepropeller/pkg/controller/nodes/executor.go 18.18% 8 Missing and 1 partial ⚠️
flytepropeller/pkg/controller/nodes/resolve.go 90.24% 2 Missing and 2 partials ⚠️
...er/pkg/controller/nodes/subworkflow/subworkflow.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6181      +/-   ##
==========================================
+ Coverage   37.02%   37.09%   +0.07%     
==========================================
  Files        1317     1318       +1     
  Lines      132534   132797     +263     
==========================================
+ Hits        49069    49265     +196     
- Misses      79219    79270      +51     
- Partials     4246     4262      +16     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.31% <ø> (+0.05%) ⬆️
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.29% <ø> (ø)
unittests-flyteidl 7.23% <ø> (-0.01%) ⬇️
unittests-flyteplugins 53.87% <ø> (+0.01%) ⬆️
unittests-flytepropeller 42.81% <69.89%> (+0.17%) ⬆️
unittests-flytestdlib 55.35% <ø> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@flyte-bot
Copy link
Collaborator

flyte-bot commented Jan 18, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Feature Improvement - Error Message Propagation to Failure Nodes

failure_node_lookup.go - Added OriginalError field and GetOriginalError method to store execution errors

executor.go - Added logic to resolve error input for failure nodes

resolve.go - Implemented error input resolution functions

subworkflow.go - Updated failure node lookup initialization with error handling

executor.go - Modified failure node lookup to include execution error

Testing - Enhanced Test Coverage for Error Handling

failure_node_lookup_test.go - Added tests for failure node lookup with error handling

resolve_test.go - Added tests for error input resolution

literals.go - Added assertion utilities for testing error literals

assert.FailNow(t, "Not yet implemented for types %v", reflect.TypeOf(lt1.GetType()))
}

assert.Equal(t, lt1.GetStructure().GetTag(), lt2.GetStructure().GetTag())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding nil check for GetStructure

Consider adding a nil check for GetStructure() before accessing GetTag() to avoid potential panic if structure is nil

Code suggestion
Check the AI-generated fix before applying
Suggested change
assert.Equal(t, lt1.GetStructure().GetTag(), lt2.GetStructure().GetTag())
structure1 := lt1.GetStructure()
structure2 := lt2.GetStructure()
if structure1 != nil && structure2 != nil {
assert.Equal(t, structure1.GetTag(), structure2.GetTag())
}

Code Review Run #f10373


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Message: "node failure",
}
expectedLiterals := make(map[string]*core.Literal, 1)
errorLiteral, _ := coreutils.MakeLiteral(&core.Error{Message: execErr.Message, FailedNodeId: nID,})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider handling MakeLiteral error return

Consider adding error handling for MakeLiteral() call since it returns an error that is currently being ignored with _

Code suggestion
Check the AI-generated fix before applying
Suggested change
errorLiteral, _ := coreutils.MakeLiteral(&core.Error{Message: execErr.Message, FailedNodeId: nID,})
errorLiteral, err := coreutils.MakeLiteral(&core.Error{Message: execErr.Message, FailedNodeId: nID,})
if err != nil {
t.Fatalf("Failed to create error literal: %v", err)
return
}

Code Review Run #f10373


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

nl := NewTestNodeLookup(
map[string]v1alpha1.ExecutableNode{v1alpha1.StartNodeID: n, failureNodeID: n},
map[string]v1alpha1.ExecutableNodeStatus{v1alpha1.StartNodeID: ns, failureNodeID: ns},
)

assert.NotNil(t, nl)

failureNodeLookup := NewFailureNodeLookup(nl, n, ns)
failureNodeLookup := NewFailureNodeLookup(nl, n, ns, originalErr).(FailureNodeLookup)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider checking type assertion result

Consider adding a test case to verify behavior when originalErr is nil. The type assertion to FailureNodeLookup assumes the cast will always succeed.

Code suggestion
Check the AI-generated fix before applying
Suggested change
failureNodeLookup := NewFailureNodeLookup(nl, n, ns, originalErr).(FailureNodeLookup)
failureNodeLookup, ok := NewFailureNodeLookup(nl, n, ns, originalErr).(FailureNodeLookup)
assert.True(t, ok)

Code Review Run #f10373


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 771 to 773
originalErr, _ := failureNodeLookup.GetOriginalError()
if originalErr != nil {
ResolveOnFailureNodeInput(ctx, nodeInputs, node.GetID(), originalErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider handling GetOriginalError error return

Consider adding error handling for GetOriginalError() call. The error return value is currently being ignored which could mask potential issues.

Code suggestion
Check the AI-generated fix before applying
Suggested change
originalErr, _ := failureNodeLookup.GetOriginalError()
if originalErr != nil {
ResolveOnFailureNodeInput(ctx, nodeInputs, node.GetID(), originalErr)
originalErr, err := failureNodeLookup.GetOriginalError()
if err != nil {
return handler.PhaseInfoFailure(core.ExecutionError_SYSTEM, "FailureNodeError", err.Error(), nil), nil
} else if originalErr != nil {
ResolveOnFailureNodeInput(ctx, nodeInputs, node.GetID(), originalErr)

Code Review Run #f10373


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
Signed-off-by: Alex Wu <[email protected]>
@popojk popojk force-pushed the Pass-an-error-message-to-the-failure-node branch from 109201f to ab9dfd2 Compare January 18, 2025 08:22
Signed-off-by: Alex Wu <[email protected]>
@flyte-bot
Copy link
Collaborator

flyte-bot commented Jan 18, 2025

Code Review Agent Run #4749b8

Actionable Suggestions - 4
  • flytepropeller/pkg/controller/nodes/executor.go - 1
  • flytepropeller/pkg/controller/executors/failure_node_lookup_test.go - 1
    • Missing test assertion for error field · Line 30-33
  • flytepropeller/pkg/controller/nodes/resolve_test.go - 1
    • Consider extracting literal creation logic · Line 474-525
  • flytepropeller/pkg/controller/nodes/resolve.go - 1
Review Details
  • Files reviewed - 8 · Commit Range: bde2f15..4414f89
    • flytepropeller/pkg/controller/executors/failure_node_lookup.go
    • flytepropeller/pkg/controller/executors/failure_node_lookup_test.go
    • flytepropeller/pkg/controller/nodes/executor.go
    • flytepropeller/pkg/controller/nodes/resolve.go
    • flytepropeller/pkg/controller/nodes/resolve_test.go
    • flytepropeller/pkg/controller/nodes/subworkflow/subworkflow.go
    • flytepropeller/pkg/controller/workflow/executor.go
    • flytepropeller/pkg/utils/assert/literals.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

Comment on lines 772 to 775
if originalErr != nil {
ResolveOnFailureNodeInput(ctx, nodeInputs, node.GetID(), originalErr)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider nil check for nodeInputs

Consider checking if nodeInputs is nil before attempting to resolve failure node input. The current code could potentially cause a nil pointer dereference if nodeInputs is nil.

Code suggestion
Check the AI-generated fix before applying
Suggested change
if originalErr != nil {
ResolveOnFailureNodeInput(ctx, nodeInputs, node.GetID(), originalErr)
}
}
if originalErr != nil && nodeInputs != nil {
ResolveOnFailureNodeInput(ctx, nodeInputs, node.GetID(), originalErr)
}
}

Code Review Run #4749b8


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +30 to +33
execErr := &core.ExecutionError{
Message: "node failure",
}
nodeLoopUp := NewFailureNodeLookup(nl, en, ns, execErr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test assertion for error field

Consider adding assertions to verify that the execErr is correctly set in the FailureNodeLookup struct. Currently, the test doesn't validate the OriginalError field.

Code suggestion
Check the AI-generated fix before applying
 @@ -35,4 +35,5 @@
 	typed := nodeLoopUp.(FailureNodeLookup)
 	assert.Equal(t, nl, typed.NodeLookup)
 	assert.Equal(t, en, typed.FailureNode)
 	assert.Equal(t, ns, typed.FailureNodeStatus)
 +	assert.Equal(t, execErr, typed.OriginalError)

Code Review Run #4749b8


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 474 to 525
noneLiteral, _ := coreutils.MakeLiteral(nil)
inputLiterals := make(map[string]*core.Literal, 1)
inputLiterals["err"] = &core.Literal{
Value: &core.Literal_Scalar{
Scalar: &core.Scalar{
Value: &core.Scalar_Union{
Union: &core.Union{
Value: noneLiteral,
Type: &core.LiteralType{
Type: &core.LiteralType_Simple{
Simple: core.SimpleType_NONE,
},
Structure: &core.TypeStructure{
Tag: "none",
},
},
},
},
},
},
}
inputLiteralMap := &core.LiteralMap{
Literals: inputLiterals,
}
nID := "fn"
execErr := &core.ExecutionError{
Message: "node failure",
}
expectedLiterals := make(map[string]*core.Literal, 1)
errorLiteral, _ := coreutils.MakeLiteral(&core.Error{Message: execErr.GetMessage(), FailedNodeId: nID})
expectedLiterals["err"] = &core.Literal{
Value: &core.Literal_Scalar{
Scalar: &core.Scalar{
Value: &core.Scalar_Union{
Union: &core.Union{
Value: errorLiteral,
Type: &core.LiteralType{
Type: &core.LiteralType_Simple{
Simple: core.SimpleType_ERROR,
},
Structure: &core.TypeStructure{
Tag: "FlyteError",
},
},
},
},
},
},
}
expectedLiteralMap := &core.LiteralMap{
Literals: expectedLiterals,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting literal creation logic

Consider extracting the literal creation logic into a helper function to improve readability and maintainability. The current implementation has deeply nested literal construction that could be simplified.

Code suggestion
Check the AI-generated fix before applying
Suggested change
noneLiteral, _ := coreutils.MakeLiteral(nil)
inputLiterals := make(map[string]*core.Literal, 1)
inputLiterals["err"] = &core.Literal{
Value: &core.Literal_Scalar{
Scalar: &core.Scalar{
Value: &core.Scalar_Union{
Union: &core.Union{
Value: noneLiteral,
Type: &core.LiteralType{
Type: &core.LiteralType_Simple{
Simple: core.SimpleType_NONE,
},
Structure: &core.TypeStructure{
Tag: "none",
},
},
},
},
},
},
}
inputLiteralMap := &core.LiteralMap{
Literals: inputLiterals,
}
nID := "fn"
execErr := &core.ExecutionError{
Message: "node failure",
}
expectedLiterals := make(map[string]*core.Literal, 1)
errorLiteral, _ := coreutils.MakeLiteral(&core.Error{Message: execErr.GetMessage(), FailedNodeId: nID})
expectedLiterals["err"] = &core.Literal{
Value: &core.Literal_Scalar{
Scalar: &core.Scalar{
Value: &core.Scalar_Union{
Union: &core.Union{
Value: errorLiteral,
Type: &core.LiteralType{
Type: &core.LiteralType_Simple{
Simple: core.SimpleType_ERROR,
},
Structure: &core.TypeStructure{
Tag: "FlyteError",
},
},
},
},
},
},
}
expectedLiteralMap := &core.LiteralMap{
Literals: expectedLiterals,
}
inputLiterals := createTestInputLiterals()
inputLiteralMap := &core.LiteralMap{
Literals: inputLiterals,
}
nID := "fn"
execErr := &core.ExecutionError{
Message: "node failure",
}
expectedLiterals := createExpectedLiterals(execErr, nID)
expectedLiteralMap := &core.LiteralMap{
Literals: expectedLiterals,
}

Code Review Run #4749b8


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines 110 to 111
literals := nodeInputs.GetLiterals()
if literal, exists := literals["err"]; exists {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider nil check for execErr parameter

Consider adding error handling for the case when execErr is nil. Currently, the function assumes execErr is non-nil when accessing GetMessage().

Code suggestion
Check the AI-generated fix before applying
Suggested change
literals := nodeInputs.GetLiterals()
if literal, exists := literals["err"]; exists {
literals := nodeInputs.GetLiterals()
if literal, exists := literals["err"]; exists {
if execErr == nil {
return
}

Code Review Run #4749b8


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Signed-off-by: Alex Wu <[email protected]>
@flyte-bot
Copy link
Collaborator

flyte-bot commented Jan 19, 2025

Code Review Agent Run #f0fe6a

Actionable Suggestions - 1
  • flytepropeller/pkg/utils/assert/literals.go - 1
    • Consider additional nil check edge cases · Line 23-27
Review Details
  • Files reviewed - 3 · Commit Range: 4414f89..3ed4177
    • flytepropeller/pkg/controller/nodes/executor.go
    • flytepropeller/pkg/controller/nodes/resolve_test.go
    • flytepropeller/pkg/utils/assert/literals.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

Comment on lines 23 to 27
structure1 := lt1.GetStructure()
structure2 := lt2.GetStructure()
if structure1 != nil && structure2 != nil {
assert.Equal(t, structure1.GetTag(), structure2.GetTag())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider additional nil check edge cases

Consider adding a test case to verify behavior when one structure is nil and the other is not. Currently, the code only checks when both are nil or both are non-nil.

Code suggestion
Check the AI-generated fix before applying
Suggested change
structure1 := lt1.GetStructure()
structure2 := lt2.GetStructure()
if structure1 != nil && structure2 != nil {
assert.Equal(t, structure1.GetTag(), structure2.GetTag())
}
structure1 := lt1.GetStructure()
structure2 := lt2.GetStructure()
assert.Equal(t, structure1 == nil, structure2 == nil, "Both structures should be either nil or non-nil")
if structure1 != nil && structure2 != nil {
assert.Equal(t, structure1.GetTag(), structure2.GetTag())
}

Code Review Run #f0fe6a


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Signed-off-by: Alex Wu <[email protected]>
@flyte-bot
Copy link
Collaborator

flyte-bot commented Jan 19, 2025

Code Review Agent Run #46f37b

Actionable Suggestions - 3
  • flytepropeller/pkg/controller/nodes/resolve.go - 1
  • flytepropeller/pkg/controller/nodes/executor.go - 1
    • Consider improving error handling for ResolveOnFailureNodeInput · Line 775-778
  • flytepropeller/pkg/controller/nodes/resolve_test.go - 1
Review Details
  • Files reviewed - 3 · Commit Range: 3ed4177..bdafc54
    • flytepropeller/pkg/controller/nodes/executor.go
    • flytepropeller/pkg/controller/nodes/resolve.go
    • flytepropeller/pkg/controller/nodes/resolve_test.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

execErr := &core.ExecutionError{
Message: "node failure",
}
nodeLoopUp := NewFailureNodeLookup(nl, en, ns, execErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add an assert below

assert.Equal(t, execErr, typed.OriginalError)

Comment on lines +23 to +24
structure1 := lt1.GetStructure()
structure2 := lt2.GetStructure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one of the structure is nil, we should also raise an error

Signed-off-by: Alex Wu <[email protected]>
@popojk
Copy link
Contributor Author

popojk commented Jan 28, 2025

Hi @pingsutw , I've added the test cases you mentioned in this PR. Could you please review it and let me know if you have any additional comments, thanks!

@flyte-bot
Copy link
Collaborator

flyte-bot commented Jan 28, 2025

Code Review Agent Run #66823e

Actionable Suggestions - 0
Review Details
  • Files reviewed - 2 · Commit Range: bdafc54..a9d3e3a
    • flytepropeller/pkg/controller/executors/failure_node_lookup_test.go
    • flytepropeller/pkg/utils/assert/literals.go
  • Files skipped - 0
  • Tools
    • Golangci-lint (Linter) - ✖︎ Failed
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful

AI Code Review powered by Bito Logo

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! Could you also update this doc when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass an error message to the failure node
3 participants