-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added replaceGcsFilesWithLocalFiles #33006
Conversation
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles added as fallback since no labels match configuration Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This is a working solution.
Alternatively, I am wondering if the client could initiate gcs copy call from source path to the actual staging path when staging happens, this would eliminate the need to download actual file to local. Could be a separate task
ResourceId source = FileSystems.matchNewResource(filePath, false); | ||
try (ReadableByteChannel reader = FileSystems.open(source); | ||
FileOutputStream writer = new FileOutputStream(tempFile)) { | ||
ByteStreams.copy(Channels.newInputStream(reader), writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't totally understand the use of this method, but surely you should use Filesystems.copy
method that does this:
public static void copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our Filesystems.copy
has too many layers. Here, we simply downloads the file to a given temp file, which is also removed with deleteOnExit
. I think we do not need to use our internal tool since it is quite straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you implemented is just the same as copy
but now we have two copies to maintain, really. The layers should be robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expect srcResourceIds and destResourceIds have the same scheme, but received gs, file.
java.lang.IllegalArgumentException: Expect srcResourceIds and destResourceIds have the same scheme, but received gs, file.
Looks like our copy needs the same scheme. :)
Will close this PR and create a new one to do this when staging the GCS files. |
some offline discussions with @kennknowles. Adding this to the staging phase needs more time and more places to be changed since the current staging logic relies on local files (e.g., computing hash). So we decided to merge this PR for now. The improvements on the staging phase could be done later. |
Fixes #32531
Improve GCS file handling in DataflowRunner
The proposed changes are only limited to DataflowRunner and are kept simple. When a GCS file is detected in
filesToStage
, we first download this to a local temp file and replace the GCS file with this local temp file.Changes
replaceGcsFilesWithLocalFiles()
methodNotes
Tests
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.