-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TextIO not fully reading a GCS file when decompressive transcoding happens #33384
Conversation
For GCS, we determine the splittability based on whether the file meets decompressive transcoding criteria. When decompressive transcoding occurs, the size returned from metadata (gzip file size) does not match the size of the content returned (original data). In this case, we set the source to unsplittable to ensure all its content is read.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #33384 +/- ##
=========================================
Coverage 57.38% 57.39%
Complexity 1475 1475
=========================================
Files 973 973
Lines 154978 154997 +19
Branches 1076 1076
=========================================
+ Hits 88939 88956 +17
- Misses 63829 63831 +2
Partials 2210 2210
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
… in gcs client lib
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is so simple now. Nice! If the GCS client library breaks us later, then we can issue an update, but I just wanted to ask if it was going to be stable.
c710c92
to
78b30d9
Compare
Add a cross-link to googleapis/python-storage#1406, as we will need to clean up the workaround after the previous issue is fixed. |
Assigning reviewers. If you would like to opt out of this review, comment R: @tvalentyn for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
When decompressive transcoding occurs, the size returned from metadata (i.e. the gzipped file size) does not match the size of the content returned (i.e. original data size). This causes data loss problem.
In this case, we force the source to be unsplittable to ensure all its content is read.To address this, we leverage the GCS client library's ability to retrieve raw data, even when the object meets the criteria for decompressive transcoding. By setting raw_download=True when initializing the BlobReader, we ensure the complete data is retrieved
This change should not impact performance. The GCS client library already retrieves raw data from GCS and performs any necessary decompression client-side, mimicking the effects of server-side decompressive transcoding. Therefore, the decompression workload always occurs on the client side, which is consistent both before and after the fix.
fixes #18390
fixes #31040
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.