-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #7 split tables to separate files for import #15
Conversation
* postgres_to_redshift.rb (#copy_table): Separate files by arbitrary chunk size when uploading to S3 for import by Redshift. (#upload_table): New parameter `chunk`.
@@ -83,28 +83,41 @@ def bucket | |||
def copy_table(table) | |||
tmpfile = Tempfile.new("psql2rs") | |||
zip = Zlib::GzipWriter.new(tmpfile) | |||
chunksize = 5 * 1024 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to throw these constants in the top of the file, and use them for readability:
KILOBYTE = 1024
MEGABYTE = KILOBYTE * 1024
GIGABYTE = MEGABYTE * 1024
This is great! Yeah, I'm not sure how much it helps on its own, but it is pretty much a prerequisite if you start doing Let me take a look at this and get back to you. |
Also, wanted to say thanks again for all the great contributions! |
Ok, I added a commit to define the byte size constants, and also made some separate cleanup commits. I'm happy to squash them together. |
Sorry, I haven't had time to test this yet. I don't have a sandbox to use at the moment, so I need to spin one up again. I'll check in a cloud formation for integration testing this once I get one set up. |
thanks! |
Cool, thanks for merging and releasing! Bummer we can't figure out why the Postgres connections become stale and unusable. Further research can continue in #16 though |
Indeed. I'm about to head out on vacation, but I'll see what I can do about that. |
The only change to get this working was uploading multiple files to S3 with a number suffix (
.1
,.2
, ...). I chose an arbitrary chunk size which produced ~250M files for me.For now, all the tables are split up in this way, not just the excessively large ones. The Redshift
COPY
command didn't need to modified to notice the new convention, since Redshift auto-magically loads all files matching a prefix. And folders for every table are not required as I had pondered in #7.I preserved the gem's current file naming convention as the prefix for the S3 files --
export\TABLE.psv.gz
. Before uploading a table to S3, I delete all the previous uploads with this prefix. Previously, there was only one file with the name, so only the exact S3 file ("key") would be deleted. Since the old name matches the prefix, any files of the old sort will be expunged, as well. So this helps with S3 buckets with pre-existing, un-split, un-suffixed files uploaded by this script. Any and all old-style S3 files will be deleted and be replaced by the new number-suffixed files. The fear was that an old-style S3 file would stick around and be re-imported to Redshift. That won't be a problem it seems.The code I wrote is a bit ugly and could use some re-work, but it shows it can be done.
Unfortunately, it didn't solve my problems with 66GB tables. I'm getting an
Invalid connection! (PG::Error)
from the source Postgres database (at RDS for me).