Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #7 split tables to separate files for import #15

Merged
merged 4 commits into from
Jun 12, 2016

Conversation

ashawley
Copy link
Contributor

The only change to get this working was uploading multiple files to S3 with a number suffix (.1, .2, ...). I chose an arbitrary chunk size which produced ~250M files for me.

For now, all the tables are split up in this way, not just the excessively large ones. The Redshift COPY command didn't need to modified to notice the new convention, since Redshift auto-magically loads all files matching a prefix. And folders for every table are not required as I had pondered in #7.

I preserved the gem's current file naming convention as the prefix for the S3 files -- export\TABLE.psv.gz. Before uploading a table to S3, I delete all the previous uploads with this prefix. Previously, there was only one file with the name, so only the exact S3 file ("key") would be deleted. Since the old name matches the prefix, any files of the old sort will be expunged, as well. So this helps with S3 buckets with pre-existing, un-split, un-suffixed files uploaded by this script. Any and all old-style S3 files will be deleted and be replaced by the new number-suffixed files. The fear was that an old-style S3 file would stick around and be re-imported to Redshift. That won't be a problem it seems.

The code I wrote is a bit ugly and could use some re-work, but it shows it can be done.

Unfortunately, it didn't solve my problems with 66GB tables. I'm getting an Invalid connection! (PG::Error) from the source Postgres database (at RDS for me).

	* postgres_to_redshift.rb (#copy_table): Separate files by
	arbitrary chunk size when uploading to S3 for import by Redshift.
	(#upload_table): New parameter `chunk`.
@@ -83,28 +83,41 @@ def bucket
def copy_table(table)
tmpfile = Tempfile.new("psql2rs")
zip = Zlib::GzipWriter.new(tmpfile)
chunksize = 5 * 1024 * 1024 * 1024
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to throw these constants in the top of the file, and use them for readability:

  KILOBYTE = 1024
  MEGABYTE = KILOBYTE * 1024
  GIGABYTE = MEGABYTE * 1024

@toothrot
Copy link
Owner

This is great! Yeah, I'm not sure how much it helps on its own, but it is pretty much a prerequisite if you start doing LIMIT / OFFSET in the COPY.

Let me take a look at this and get back to you.

@toothrot
Copy link
Owner

Also, wanted to say thanks again for all the great contributions!

@ashawley
Copy link
Contributor Author

Ok, I added a commit to define the byte size constants, and also made some separate cleanup commits. I'm happy to squash them together.

@toothrot
Copy link
Owner

Sorry, I haven't had time to test this yet. I don't have a sandbox to use at the moment, so I need to spin one up again.

I'll check in a cloud formation for integration testing this once I get one set up.

@toothrot toothrot merged commit 13f063e into toothrot:master Jun 12, 2016
@ashawley ashawley deleted the issue-7-split-imports branch June 13, 2016 03:44
@toothrot
Copy link
Owner

thanks!

@ashawley
Copy link
Contributor Author

Cool, thanks for merging and releasing! Bummer we can't figure out why the Postgres connections become stale and unusable. Further research can continue in #16 though

@toothrot
Copy link
Owner

Indeed. I'm about to head out on vacation, but I'll see what I can do about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants