Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index compaction for S3 backed databases #853

Open
SimoneLazzaris opened this issue Jul 8, 2021 · 1 comment
Open

Index compaction for S3 backed databases #853

SimoneLazzaris opened this issue Jul 8, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@SimoneLazzaris
Copy link
Collaborator

SimoneLazzaris commented Jul 8, 2021

What would you like to be added or enhanced

If an immudb database is using S3 storage, it is now impossible to perform a index compaction:

$ immuadmin database clean
rpc error: code = Unknown desc = comapction is unsupported when remote storage is used

I'd like to have some form of compaction for those databases too.

Why is this needed

Index on immudb grows big; normally one should periodically compact them using immuadmin database clean, but this is not the case with S3 storage.

In many cases, index is growing faster than actual data, so without index compaction, the actual local
disk occupied by immudb databases is bigger when using S3

Additional context

As an example, a database on local storage containing 1 million log lines occupies on disk 298MB after compaction.
Same database, using S3, is taking 998 MB of disk space, most of them for the indexes.

@SimoneLazzaris SimoneLazzaris added the enhancement New feature or request label Jul 8, 2021
@byo
Copy link
Contributor

byo commented Jul 9, 2021

@SimoneLazzaris apart from the unsupported index cleanup, there was also a bug with index not being uploaded to s3, I've fixed this with #855 - that way the local disk size should not an issue anymore.

The implementation of index compaction should be also added because fragmented index may slow down over time. It's not trivial though.
When done on a local disk, a new index is created in a separate folder and then it's renamed to a location where the currently used index is stored. S3 does not have such renaming built-in - such operation would take a significant amount of time and would not be atomic.

I was thinking about an alternative solution that do not require renaming:

The index would not have a strict folder location such as <db_name>/index, instead it could be placed either in this folder (backwards compatibility) or a folder suffixed with an increasing number: <db_name>/index_0000000x.
When opening a DB a proper folder for the index is selected (the highest number suffix with extra check if the compaction finished).
When a compaction is started - a new folder with a prefix higher than the previous one should be created, once the compacted index is stored there, the new one should be used and the old folder removed. That would require putting some guard file into the folder during index compaction to ensure it is skipped if the compaction did not finalize yet.
By doing so, a rename operation wont be needed.

@jeroiraz What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants