Fix #7 split tables to separate files for import #15

ashawley · 2016-05-19T23:03:58Z

The only change to get this working was uploading multiple files to S3 with a number suffix (.1, .2, ...). I chose an arbitrary chunk size which produced ~250M files for me.

For now, all the tables are split up in this way, not just the excessively large ones. The Redshift COPY command didn't need to modified to notice the new convention, since Redshift auto-magically loads all files matching a prefix. And folders for every table are not required as I had pondered in #7.

I preserved the gem's current file naming convention as the prefix for the S3 files -- export\TABLE.psv.gz. Before uploading a table to S3, I delete all the previous uploads with this prefix. Previously, there was only one file with the name, so only the exact S3 file ("key") would be deleted. Since the old name matches the prefix, any files of the old sort will be expunged, as well. So this helps with S3 buckets with pre-existing, un-split, un-suffixed files uploaded by this script. Any and all old-style S3 files will be deleted and be replaced by the new number-suffixed files. The fear was that an old-style S3 file would stick around and be re-imported to Redshift. That won't be a problem it seems.

The code I wrote is a bit ugly and could use some re-work, but it shows it can be done.

Unfortunately, it didn't solve my problems with 66GB tables. I'm getting an Invalid connection! (PG::Error) from the source Postgres database (at RDS for me).

* postgres_to_redshift.rb (#copy_table): Separate files by arbitrary chunk size when uploading to S3 for import by Redshift. (#upload_table): New parameter `chunk`.

toothrot · 2016-05-20T15:17:15Z

lib/postgres_to_redshift.rb

@@ -83,28 +83,41 @@ def bucket
  def copy_table(table)
    tmpfile = Tempfile.new("psql2rs")
    zip = Zlib::GzipWriter.new(tmpfile)
+    chunksize = 5 * 1024 * 1024 * 1024


Might be nice to throw these constants in the top of the file, and use them for readability:

KILOBYTE = 1024 MEGABYTE = KILOBYTE * 1024 GIGABYTE = MEGABYTE * 1024

toothrot · 2016-05-20T15:19:24Z

This is great! Yeah, I'm not sure how much it helps on its own, but it is pretty much a prerequisite if you start doing LIMIT / OFFSET in the COPY.

Let me take a look at this and get back to you.

toothrot · 2016-05-20T15:21:22Z

Also, wanted to say thanks again for all the great contributions!

ashawley · 2016-05-20T18:27:31Z

Ok, I added a commit to define the byte size constants, and also made some separate cleanup commits. I'm happy to squash them together.

toothrot · 2016-05-25T17:04:07Z

Sorry, I haven't had time to test this yet. I don't have a sandbox to use at the moment, so I need to spin one up again.

I'll check in a cloud formation for integration testing this once I get one set up.

toothrot · 2016-06-13T14:12:25Z

thanks!

ashawley · 2016-06-13T15:01:55Z

Cool, thanks for merging and releasing! Bummer we can't figure out why the Postgres connections become stale and unusable. Further research can continue in #16 though

toothrot · 2016-06-13T15:41:46Z

Indeed. I'm about to head out on vacation, but I'll see what I can do about that.

Fix toothrot#7 split tables to separate files for import

be61199

* postgres_to_redshift.rb (#copy_table): Separate files by arbitrary chunk size when uploading to S3 for import by Redshift. (#upload_table): New parameter `chunk`.

toothrot reviewed May 20, 2016
View reviewed changes

ashawley added 3 commits May 20, 2016 12:08

Add constants for byte sizes to improve readability

6b78db4

Remove parens from call to zip.pos

f0cb9f6

Take out unnecessary delete of S3 object that doesn't exist

cd97a20

ashawley mentioned this pull request May 21, 2016

Invalid connection! (PG::Error) #16

Open

toothrot merged commit 13f063e into toothrot:master Jun 12, 2016

ashawley deleted the issue-7-split-imports branch June 13, 2016 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #7 split tables to separate files for import #15

Fix #7 split tables to separate files for import #15

ashawley commented May 19, 2016

toothrot May 20, 2016

toothrot commented May 20, 2016

toothrot commented May 20, 2016

ashawley commented May 20, 2016

toothrot commented May 25, 2016

toothrot commented Jun 13, 2016

ashawley commented Jun 13, 2016

toothrot commented Jun 13, 2016

Fix #7 split tables to separate files for import #15

Fix #7 split tables to separate files for import #15

Conversation

ashawley commented May 19, 2016

toothrot May 20, 2016

Choose a reason for hiding this comment

toothrot commented May 20, 2016

toothrot commented May 20, 2016

ashawley commented May 20, 2016

toothrot commented May 25, 2016

toothrot commented Jun 13, 2016

ashawley commented Jun 13, 2016

toothrot commented Jun 13, 2016