Index compaction for S3 backed databases #853

SimoneLazzaris · 2021-07-08T14:37:04Z

What would you like to be added or enhanced

If an immudb database is using S3 storage, it is now impossible to perform a index compaction:

$ immuadmin database clean
rpc error: code = Unknown desc = comapction is unsupported when remote storage is used

I'd like to have some form of compaction for those databases too.

Why is this needed

Index on immudb grows big; normally one should periodically compact them using immuadmin database clean, but this is not the case with S3 storage.

In many cases, index is growing faster than actual data, so without index compaction, the actual local
disk occupied by immudb databases is bigger when using S3

Additional context

As an example, a database on local storage containing 1 million log lines occupies on disk 298MB after compaction.
Same database, using S3, is taking 998 MB of disk space, most of them for the indexes.

The text was updated successfully, but these errors were encountered:

byo · 2021-07-09T11:10:55Z

@SimoneLazzaris apart from the unsupported index cleanup, there was also a bug with index not being uploaded to s3, I've fixed this with #855 - that way the local disk size should not an issue anymore.

The implementation of index compaction should be also added because fragmented index may slow down over time. It's not trivial though.
When done on a local disk, a new index is created in a separate folder and then it's renamed to a location where the currently used index is stored. S3 does not have such renaming built-in - such operation would take a significant amount of time and would not be atomic.

I was thinking about an alternative solution that do not require renaming:

The index would not have a strict folder location such as <db_name>/index, instead it could be placed either in this folder (backwards compatibility) or a folder suffixed with an increasing number: <db_name>/index_0000000x.
When opening a DB a proper folder for the index is selected (the highest number suffix with extra check if the compaction finished).
When a compaction is started - a new folder with a prefix higher than the previous one should be created, once the compacted index is stored there, the new one should be used and the old folder removed. That would require putting some guard file into the folder during index compaction to ensure it is skipped if the compaction did not finalize yet.
By doing so, a rename operation wont be needed.

@jeroiraz What do you think?

SimoneLazzaris added the enhancement New feature or request label Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index compaction for S3 backed databases #853

Index compaction for S3 backed databases #853

SimoneLazzaris commented Jul 8, 2021 •

edited

Loading

byo commented Jul 9, 2021

Index compaction for S3 backed databases #853

Index compaction for S3 backed databases #853

Comments

SimoneLazzaris commented Jul 8, 2021 • edited Loading

byo commented Jul 9, 2021

SimoneLazzaris commented Jul 8, 2021 •

edited

Loading