Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shares-storage: replace GOB encoding with SSZ, deduplicate validator Index field #1837

Merged
merged 26 commits into from
Jan 21, 2025

Conversation

iurii-ssv
Copy link
Contributor

@iurii-ssv iurii-ssv commented Nov 3, 2024

Rebased onto #1778 (until it's merged - this diff is more representative of the contents of this PR) now rebased onto latest commit from stage

This PR additionally closes #1473 (see this comment for why below), but primarily

this PR closes #1575 (targeting both v2 and genesis code - hence it should work regardless of whether it's deployed before/after Alan fork - this is because both v2 and genesis use the same code from package storage, as far as I can tell, and we don't use gob anywhere else)

we could, but I'm assuming we don't want to re-sync everything from Ethereum smart contract just so we can re-populate database with shares in ssz format,

The question is - do we want to support a proper full-blown rollback back to gob,

just so we have some basic "safety" for production (and once we are confident there were no issues with migration - maybe couple of weeks later - we can get rid of gob completely and even erase the old shares key-range with a separate easy-to-apply DB migration); but also when I'm gonna roll this PR out to stage - if somebody redeploys something else to the same operator/cluster after this change has been running there for a while (and before his branch includes my change) - whole cluster will just break for him (stage branch too, until this PR is merged),

hence the approach I'm taking in this PR (for now) covers most of rollback scenarios (but not all):

  • DB migration creates a copy of old shares key-range (if successful - applies only once), this is kind of necessary waste in terms of space until we delete old shares prefix range
  • and then, newer code uses just ssz while older code uses only gob (and works with old shares data that are still present in DB)

the issue with rolling back to gob this PR doesn't address is that new shares might have been added during the time we were running this new ssz code version - hence for proper rollback to gob we'd need to copy those shares over to the old key-range too; and vise versa can also happen if node versions run are gob(current)->ssz(migration successful)->gob(rollback)->(added new shares to DB)->ssz - we won't see those newly added shares because they'll end up in old gob-related key-range (while migration intended to move them over if successful - applies only once, and has been previously run to success)

it's unlikely we'll need to "rollback with a delay" once everything has been proven to work on stage (and actually needing 100% rollback-supporting approach in practice) ... but you never know, WDYT ?

Before merging:

  • test on stage (can we add new validator to make sure this use-case works fine ?)
  • (assuming we aren't doing 100% rollback-supporting approach - if we do, this will be done at the same time the next step below "delete old shares prefix range" is done, after this code has been running in prod for a while - both can be added to migration_5) once we are done with testing on stage add a commit to enable commented out completed(txn) code in this PR
  • (after we are 100% ssz is working and we are not going back to gob) delete old shares prefix range in a separate migration (migration_6 or higher) since share data isn't super precious we decided to delete it right away as a part of migration_5

@iurii-ssv
Copy link
Contributor Author

I probably should have asked sooner - why we want SSZ for encoding shares in the first place, it doesn't feel like the simplest option we can go for (see one, two, three) - why not just go with json ? It doesn't look like we pass Share(s) around in p2p messages (hence compressing it as much as possible is probably not a concern ... and if it really does consume a lot of DB storage as json perhaps there is a simpler way to compress it than using ssz).

Anyway, this was my 1st time using ssz - just saying might not be the best option if we are to use it more / somewhere else (for future reference, here is a list of all kinds of alternatives I guess - https://github.com/alecthomas/go_serialization_benchmarks).

Another relevant question, I see other entities in storage package are using JSON encoding, could you clarify why those didn't use gob and whether we'll want to migrate those to ssz as well ?

@iurii-ssv
Copy link
Contributor Author

iurii-ssv commented Nov 4, 2024

the issue with rolling back to gob this PR doesn't address is that new shares might have been added during the time we were running this new ssz code version - hence for proper rollback to gob we'd need to copy those shares over to the old key-range too; and vise versa can also happen if node versions run are gob(current)->ssz(migration successful)->gob(rollback)->(added new shares to DB)->ssz - we won't see those newly added shares because they'll end up in old gob-related key-range (while migration intended to move them over if successful - applies only once, and has been previously run to success)

Just for future reference (or in case we'll need it), the way to properly solve it would be:

gob shares prefix rage will be the source of truth (because this is the only way older code works), new szz supporting code needs to

  • drop and re-populate shares_ssz prefix range every time from shares prefix range
  • update both shares and shares_ssz at the same time (same txn) for every update there is to a Share
  • in DB migration, only do complete(txn) after we are actually done with it and getting rid of gob for good (removing a chance to get back to it "properly") - that will finalize migration 5; might as well update DB migration 5 to delete shares prefix range too in the same commit

@iurii-ssv iurii-ssv force-pushed the shares-storage-replace-gob-encoding-with-ssz branch from 34bef36 to c279ecf Compare November 14, 2024 12:24
@iurii-ssv
Copy link
Contributor Author

iurii-ssv commented Nov 14, 2024

Added a commit 2565389 that addresses #1473, bundling it with this PR because it needs DB migration (and it's just simpler to do both #1575 and #1473 as 1 DB migration).

The genearal idea is to "flatten" SSVShare struct. Primarily, we want to flatten SSVShare.Metadata struct (because we want to get rid of SSVShare.Metadata.BeaconMetadata to get rid of duplicate Index value that's now represented by SSVShare.ValidatorIndex),

but I'm going 1 step further and also inlining Metadata struct fields into SSVShare struct because:

  • it's not 100% clear what counts as metadata and what doesn't not (from SSVShare struct perspective there isn't a real benefit to having that distinction)
  • .Metadata modifier only adds bulk to the code making it harder to read

since Metadata is embedded into SSVShare this additional step ^ is trivial to implement (and if we really want to designate certain fields of SSVShare as metadata - we can make a couple of code-lines change on top of what I have in this PR, I just think it makes this code less readable/understandable).

@iurii-ssv iurii-ssv changed the title shares-storage: replace GOB encoding with SSZ shares-storage: replace GOB encoding with SSZ, deduplicate validator Index field Nov 14, 2024
@iurii-ssv iurii-ssv force-pushed the shares-storage-replace-gob-encoding-with-ssz branch from 47c5e6a to efcc1f5 Compare January 9, 2025 11:23
@Tom-ssvlabs Tom-ssvlabs requested a review from y0sher January 14, 2025 11:21
@iurii-ssv iurii-ssv force-pushed the shares-storage-replace-gob-encoding-with-ssz branch from efcc1f5 to 5be2303 Compare January 14, 2025 17:23
Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting change. I agree we need to clarify next time what is the rationale behind such change and it'll help define the scope and make it easier to do decisions.

@iurii-ssv iurii-ssv force-pushed the shares-storage-replace-gob-encoding-with-ssz branch 3 times, most recently from b25008e to 8addd97 Compare January 16, 2025 14:19
@iurii-ssv iurii-ssv force-pushed the shares-storage-replace-gob-encoding-with-ssz branch 2 times, most recently from 82c64d7 to 3cd1e7b Compare January 16, 2025 18:22
Comment on lines 77 to 78
ValidatorPubKey []byte `ssz-max:"48"`
SharePubKey []byte `ssz-max:"48"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why its needed? the size is actually fixed usually and not a max..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should follow the spec here, and this seems to be in line

Copy link
Contributor

@y0sher y0sher Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@iurii-ssv iurii-ssv Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why its needed? the size is actually fixed usually and not a max..

I think I saw various tests failing with exact size (as opposed to max) - so I thought maybe it is meant to support pubkeys of different format (with size lower than 48 as well),

eg:

=== RUN   TestSubmitProposal
2025-01-20T10:36:10.106079Z	DEBUG	logger is ready	{"level": "debug", "encoder": "capital", "format": "console", "file_options": null}
2025-01-20T10:36:10.106420Z	DEBUG	logger is ready	{"level": "debug", "encoder": "capital", "format": "console", "file_options": null}
    controller_test.go:157: 
        	Error Trace:	/Users/iurii/work/ssv/operator/fee_recipient/controller_test.go:157
        	            				/Users/iurii/work/ssv/operator/fee_recipient/controller_test.go:45
        	Error:      	Received unexpected error:
        	            	failed to serialize share: marshal ssz: storageShare.SharePubKey (bytes array does not have the correct length): expected 48 and 0 found
        	Test:       	TestSubmitProposal
--- FAIL: TestSubmitProposal (0.00s)

but if you are saying the size always must be 48 I guess we simply need to adjust every test to have those pubkeys initialized

let me see if I can do that ^

on the other hand, it could be quite a bit of work to try and maintain this kind of validation on DB-encoding level (it's a burden on all unit-tests that use Share) - so perhaps leaving it as max here and validating it at the moment of "share creation" is a simpler way to go about it, WDYT ?

Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! one comment
will do spec alignment once @moshe-blox approves

// Share represents a storage share.
// The better name of the struct is storageShareGOB,
// but we keep the name Share to avoid conflicts with gob encoding.
type Share struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible its better to not export on migrations package

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the idea was to copy whatever structures we have from "the original code" and adjust as necessary (keeping it as close to originals as possible so it's easy to drop-in replace it),

but now since all adjustments are made we can make it package-private - done in 04d5c63

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe-blox I guess we cannot change this after all since gob complains it cannot decode data fetched from DB (the error goes away once I revert back to using public fields 0219b21):

{
  "level": "\u001b[31mFATAL\u001b[0m",
  "time": "2025-01-20T14:35:08.440521Z",
  "msg": "could not setup db",
  "error": "failed to run migrations: migration \"migration_5_change_share_format_from_gob_to_ssz\" failed: GetAll: decode gob share: decode storageShareGOB: gob: type mismatch: no fields matched compiling decoder for storageShareGOB",
  "errorVerbose": "GetAll: decode gob share: decode storageShareGOB: gob: type mismatch: no fields matched compiling decoder for storageShareGOB\nmigration \"migration_5_change_share_format_from_gob_to_ssz\" failed\ngithub.com/ssvlabs/ssv/migrations.Migrations.Run\n\t/go/src/github.com/ssvlabs/ssv/migrations/migrations.go:98\ngithub.com/ssvlabs/ssv/migrations.Run\n\t/go/src/github.com/ssvlabs/ssv/migrations/migrations.go:33\ngithub.com/ssvlabs/ssv/cli/operator.setupDB\n\t/go/src/github.com/ssvlabs/ssv/cli/operator/node.go:515\ngithub.com/ssvlabs/ssv/cli/operator.init.func2\n\t/go/src/github.com/ssvlabs/ssv/cli/operator/node.go:136\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:989\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041\ngithub.com/ssvlabs/ssv/cli.Execute\n\t/go/src/github.com/ssvlabs/ssv/cli/cli.go:27\nmain.main\n\t/go/src/github.com/ssvlabs/ssv/cmd/ssvnode/main.go:20\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695\nfailed to run migrations\ngithub.com/ssvlabs/ssv/cli/operator.setupDB\n\t/go/src/github.com/ssvlabs/ssv/cli/operator/node.go:517\ngithub.com/ssvlabs/ssv/cli/operator.init.func2\n\t/go/src/github.com/ssvlabs/ssv/cli/operator/node.go:136\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:989\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041\ngithub.com/ssvlabs/ssv/cli.Execute\n\t/go/src/github.com/ssvlabs/ssv/cli/cli.go:27\nmain.main\n\t/go/src/github.com/ssvlabs/ssv/cmd/ssvnode/main.go:20\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"
}

@moshe-blox
Copy link
Contributor

moshe-blox commented Jan 19, 2025

great work 💪

i think we should have a unit test for the migration or perhaps at least on the conversion funcs, to make sure we're not losing/mutating any field by accident during migration (keep in mind that some won't perform this migration in the first release version due to upgrading late to a future version)

for testing the migration, i have an idea:

  1. duplicate subset of the old registry/storage package into register/storage/gob or a subfolder in migrations
  2. create the old gob sharesStorage with an in-memory database
  3. save some slice of shares
  4. Migrations{migration_5}.Run()
  5. create the new ssz sharesStorage with the same database
  6. load all shares and compare to the originally saved shares

unless we expect non-test-related code to change, this can be done in parallel as we progress with merging and QA'ing this PR

however we never tested migrations before, so pls lmk if you have another idea

other than that, i would've preferred separation like before, except for Liquidated which as you've noted is unrelated:

type SSVShare struct {
    spectypes.Share // Contract data, with the unfortunate exception of ValidatorIndex however it's only set once
    Liquidated bool // Contract data, but the spec is missing it and its unrelated to Beacon
    BeaconMetadata  // Beacon data only
}

i think it's a bit less confusing to work with given the metadata and contract are unrelated and update in parallel, plus eventually we'd likely store metadata in separate database collection (to optimize updates & prevent race conditions)

however i suspect it would be a significant refactor, and i'm not strongly convicted on this (would love to hear counter-arguments), so perhaps this should be postponed to a future PR (maybe together with the metadata database separation)

@iurii-ssv
Copy link
Contributor Author

i would've preferred separation like before, except for Liquidated which as you've noted is unrelated:

type SSVShare struct {
    spectypes.Share // Contract data, with the unfortunate exception of ValidatorIndex however it's only set once
    Liquidated bool // Contract data, but the spec is missing it and its unrelated to Beacon
    BeaconMetadata  // Beacon data only
}

i think it's a bit less confusing to work with given the metadata and contract are unrelated and update in parallel, plus eventually we'd likely store metadata in separate database collection (to optimize updates & prevent race conditions)

however i suspect it would be a significant refactor, and i'm not strongly convicted on this (would love to hear counter-arguments), so perhaps this should be postponed to a future PR (maybe together with the metadata database separation)

I don't really see any obvious way to group Share fields that would be strongly preferable over all other ones we can go with, the one you've suggested is OK (but I don't think it 100% accurately reflects what actually happens - the way it's currently implemented, and I haven't changed it - ValidatorIndex gets updated every time UpdateValidatorsMetadata is called, meaning even though it shouldn't change - it might, unless we implement additional measures to prevent it),

and I'm not sure grouping by "where data comes from" (especially with ValidatorIndex coming from Beacon metadata) is better than grouping by "whether data is mutable or not" (I would prefer this one),

which is why to me it seems the most "straightforward" way to go about it is to flatten Share struct as much as possible, possibly even removing the dependency on spectypes.Share (essentially leaving it up to code reader to find out on his own how each of those fields changes by simply following the code, without "giving any hints by grouping" because they are inaccurate),

at least for now, if we really want to group Share fields somehow - it would probably be better to separate it in PR on top of this one (since it's pretty large as it is)

@y0sher y0sher merged commit 3afee55 into stage Jan 21, 2025
5 of 7 checks passed
@y0sher y0sher deleted the shares-storage-replace-gob-encoding-with-ssz branch January 21, 2025 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants