Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a bug with the new version of boto used by s3fs that prevents writes to non-AWS S3 buckets #1546

Open
ryanovas opened this issue Jan 21, 2025 · 14 comments

Comments

@ryanovas
Copy link

ryanovas commented Jan 21, 2025

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Here is a link to the relevant boto issue: boto/boto3#4398

Attempting to use table.append or table.overwrite when using the 1.36.x version installed by default with s3fs causes a very confusing error botocore.exceptions.ClientError: An error occurred (MissingContentLength) when calling the PutObject operation: None

The workaround is to downgrade back to botocore 1.35.99 manually.

I'm unsure if there's work on the pyiceberg side to resolve this but I am posting this issue for others like me googling this issue and struggling to find answers.

I might also suggest pyiceberg consider a library that doesn't rely on AWS controlled boto and uses something more like s2cmd that is more S3-agnostic for those of us using things like Digital Ocean, minio, backblaze, etc.

@Fokko
Copy link
Contributor

Fokko commented Jan 21, 2025

Thanks for raising awareness here @ryanovas. It looks like the lock is at 1.36.1:

iceberg-python/poetry.lock

Lines 406 to 415 in c84dd8d

[[package]]
name = "boto3"
version = "1.36.1"
description = "The AWS SDK for Python"
optional = false
python-versions = ">=3.8"
files = [
{file = "boto3-1.36.1-py3-none-any.whl", hash = "sha256:eb21380d73fec6645439c0d802210f72a0cdb3295b02953f246ff53f512faa8f"},
{file = "boto3-1.36.1.tar.gz", hash = "sha256:258ab77225a81d3cf3029c9afe9920cd9dec317689dfadec6f6f0a23130bb60a"},
]

This is interesting since we test against minio in our tests, and it didn't show up. Do you know how to replicate this?

@ryanovas
Copy link
Author

In my case I was using a digital ocean spaces bucket and lakekeeper then tried to follow the getting started guide from the docs with the taxi data. When I got to the append part I started getting the errors.

@maarek
Copy link

maarek commented Jan 31, 2025

This occurs for me as well against an OCI Object Storage using S3 compat api. With Boto3 request I can set the config to the necessary config to disable the checksum validation they added. boto/boto3#4398 (comment)

 config = Config(
   request_checksum_calculation='WHEN_REQUIRED', # required for Content-Length error
   response_checksum_validation='WHEN_REQUIRED', # required for Content-Length error,
)

@ryanovas
Copy link
Author

The issue on botocore was closed because there are apparently params you can pass to turn off the new signing stuff. But I'm not sure how we can apply those flags here given botocore is so many layers deep. Perhaps we just need to wait until S3-compatible buckets update their headers to match? That doesn't seem ideal though. We can't stay on 1.35 forever

@kevinjqliu
Copy link
Contributor

The issue on botocore was closed because there are apparently params you can pass to turn off the new signing stuff. But I'm not sure how we can apply those flags here given botocore is so many layers deep.

we pass configs down to s3fs, i think it would be useful to allow passing arbitrary configs to s3fa so that we dont need to explicit define all of s3fs S3FileSystem args

@ryanovas do you know if that option is available in s3fs yet? i couldnt find a match

@ryanovas
Copy link
Author

I think that would be the move as well, let those of us who are on s3-compatible pass in anything extra we need.

Sadly I don't know much about s3fs. A quick search of their issues brings up this: fsspec/s3fs#855 but I don't think that's the same error (or that just turning off SSL like they suggest is ever a good idea).

It might be necessary to open an issue with them if the option isn't present in their docs. I can try to help out with this more next week.

@maarek
Copy link

maarek commented Feb 1, 2025

It is discussed briefly here fsspec/s3fs#931 though it might be best to track as a separate issue if necessary. It looks from the doc that it should be passed into config_kwargs in s3fs and if arbitrary args are allowed in pyiceberg they can be left up to the user.

@Fokko
Copy link
Contributor

Fokko commented Feb 3, 2025

I'm open to locking it to <1.36 since a lot of people are experiencing issues here.

@ryanovas
Copy link
Author

ryanovas commented Feb 3, 2025

IMO locking the version seems like an ok temp fix, but the better fix is likely to figure out how to get the extra params down to botocore. I commented in the s3fs issue with a link to the issue in botocore that mentions the configs to disable this on the botocore side.

Edit: According to the s3fs people there is an env var we can try, and if it fixes the issue they will figure out a way to expose it programmatically - then on the pyiceberg side it should be possible to resolve this without locking to an issue. I will attempt to try this in the next day or 2 unless someone beats me to it.

@ryanovas
Copy link
Author

ryanovas commented Feb 3, 2025

Just an update, setting the environment variable as such fixes the issue: export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'

I updated the s3fs issue as well. I guess now we wait for next steps. This will fix it for people in the meantime however without having to downgrade.

@ryanovas
Copy link
Author

ryanovas commented Feb 6, 2025

In the s3fs issue it looks like there's a config we can pass? I believe it was suggested already by @maarek but confirmed in this comment on the issue: fsspec/s3fs#931 (comment)

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2025

Thanks for keeping track of this @ryanovas. I think we can do an if-else based on the s3fs version that's installed. In the case of ≥1.36, then we should pass in this parameter.

The relevant code can be found here:

fs = S3FileSystem(client_kwargs=client_kwargs, config_kwargs=config_kwargs)

@ryanovas
Copy link
Author

ryanovas commented Feb 6, 2025

I might suggest we allow for a way to set a flag maybe when we init a catalog? or something similar as the issue is only present for s3-compatible storage and AWS peeps probably want to keep the extra signing headers. Also, it's botocore that has the version incompatibility which is a couple layers deep (s3fs -> aibotocore -> botocore) so it feels a little fragile imo to tack it to the version of botocore.

If you agree, and can point me to where we could add this extra flag as well, I could try to make a PR to address this. Also if there's a contributing guide anywhere.

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2025

@ryanovas That's a good point. I'm fine with a flag at the catalog as well.

We have a contributing guide here. We want to document the flag here (source can be found here), and the config would need to be set here.

Happy to help, let me know if you run into anything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants