Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use COPY TO to write Parquet file to S3 if IAM only enabled on bucket/account - workaround/fix included - allows for local parquet file writes #460

Open
2 tasks done
sysadminmike opened this issue Nov 28, 2024 · 5 comments
Labels
COPY Issues related to COPY enhancement New feature or request

Comments

@sysadminmike
Copy link

sysadminmike commented Nov 28, 2024

What happens?

If the buckets policy only allows IAM enabled access it is not possible to use COPY TO s3://... to write out a parquet file.

The below diff make it possible to write to a local filesystem (or s3 and IAM working with something like https://github.com/awslabs/mountpoint-s3)

This could be used to support any format duckdb COPY TO supports instead of postgres as this is just passed down to duckdb just by adding "local://" at the start of the file location eg:

COPY (SELECT * FROM mytable) TO 'local://tmp/test.parquet' (FORMAT 'parquet')

The diff:

# diff pg_duckdb/src/utility/copy.cpp-orig pg_duckdb/src/utility/copy.cpp
4d3
<
30a30,31
> static constexpr char local_filename_prefix[] = "local://";
>
202a204,205
> #include ‹string›
>
>212a216
>             !starts_with(copy_stmt->filename, local_filename_prefix) &&
219a224,229
>       }
>
>         if (starts_with(copy_stmt-›filename, local_filename_prefix)) {
>           std::string filename(copy_stmt->filename);
>           filename.erase(0, 7); // 7 is the length of "local:/" 
>           strcpy (copy_stmt->filename, filename.c_str());

To Reproduce

Setup an s3 bucket with IAM policy only and try to do a COPY TO it

OS:

Linux

pg_duckdb Version (if built from source use commit hash):

na

Postgres Version (if built from source use commit hash):

17

Hardware:

No response

Full Name:

Michael Wolman

Affiliation:

Not applicable

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have
@sysadminmike sysadminmike changed the title Unable to use COPY TO to write Parquet file to S3 if IAM only enabled on bucket/account - workaround/fix included Unable to use COPY TO to write Parquet file to S3 if IAM only enabled on bucket/account - workaround/fix included - allows for local parquet file writes Nov 28, 2024
@JelteF
Copy link
Collaborator

JelteF commented Nov 28, 2024

I think this is pretty much the same issue as #265

I think a better way than to detect the local:// prefix is to detect format 'parquet' in the options, and then forward the COPY to duckdb.

Possibly also detect the .parquet extension in the filename.

@JelteF JelteF added the enhancement New feature or request label Nov 28, 2024
@sysadminmike
Copy link
Author

The local://gives you the option to use duckdb engine or postgres engine for the COPY TO for json/cvs or other formats they either or both support (and may have different option?). Perhaps easier to extend for other file type in the future compared to just .parquet and allow for any file name

Maybe better to replace local:// with localduck:// perhaps so you know your invoking the duckdb engine to handle it.

@ilanpillemer
Copy link

ilanpillemer commented Nov 29, 2024

we use parquet files with the extension of .pq so if we get the local:// or localduck:// added that is much better for our generalised use cases too (it gives us more options to just use engine for other formats)

@JelteF
Copy link
Collaborator

JelteF commented Dec 2, 2024

I think a better approach would be to add a copy option to allow forcing duckdb usage for COPY.

COPY (SELECT * FROM mytable) TO '/tmp/test.csv' (duckdb true)

@sysadminmike
Copy link
Author

That is a good option to allow user to switch between duckdb and postgres to do the copy.

The issue preventing access to an S3 bucket with just IAM enabled still exists - I have found using s3-mount very good especially with local cache enabled for speeding up queries to parquet files already accessed but it may not be an option for everyone.

@JelteF JelteF added the COPY Issues related to COPY label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
COPY Issues related to COPY enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants