Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add offline scaling support for background ddl with arrangement backfill #20006

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

shanicky
Copy link
Contributor

@shanicky shanicky commented Jan 2, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

WIP STATE

It seems that No Shuffle backfill is still difficult to support in offline scaling, and this PR is still in a WIP state.

I found it somewhat challenging to directly support background jobs in online scaling, so I proposed this PR to add support for background jobs in offline scaling. At the same time, I attempted to skip jobs related to background DDL jobs in online scaling instead of completely omitting them.

This PR makes several specific changes.

  1. Due to changes in the stream job fragment after offline scaling, it needs to be retrieved again.
  2. Since the scaling occurs before the created event, the FE will receive the scaling mapping first.

Checklist

  • I have written necessary rustdoc comments.

@shanicky shanicky force-pushed the peng/dirty-scale-bkgd-ddl branch from df36762 to 868aa26 Compare January 16, 2025 10:06
Copy link

gru-agent bot commented Jan 23, 2025

This pull request has been modified. If you want me to regenerate unit test for any of the files related, please find the file in "Files Changed" tab and add a comment @gru-agent. (The github "Comment on this file" feature is in the upper right corner of each file in "Files Changed" tab.)

..
} in fragments
{
let mut stream_node = stream_node.to_protobuf();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use fragment_type_mask instead to reduce memory usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the fragment_type_mask cannot determine the type of stream scan; at least it couldn't in previous versions. We only have the stream node's node body to make that judgment. cc @kwannoel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can shortcut for snapshot backfill

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot backfill uses fragment_type_mask iirc

@shanicky
Copy link
Contributor Author

shanicky commented Jan 23, 2025

I found that performing offline scaling directly during the recovery process causes the background DDL job status to remain at 100%. Do I need to do anything else? cc @kwannoel

 Id |                   Statement                   |     Progress
----+-----------------------------------------------+-------------------
  7 | CREATE MATERIALIZED VIEW m AS SELECT * FROM t | 100.00% (100/100)
(1 row)

@kwannoel
Copy link
Contributor

kwannoel commented Jan 23, 2025

I found that performing offline scaling directly during the recovery process causes the background DDL job status to remain at 100%. Do I need to do anything else? cc @kwannoel

 Id |                   Statement                   |     Progress
----+-----------------------------------------------+-------------------
  7 | CREATE MATERIALIZED VIEW m AS SELECT * FROM t | 100.00% (100/100)
(1 row)

Did the jobs get created successfully? If not, is the state inside the CreateMviewProgressTracker up to date? Did all the new actors get included in it?

Can also add some tracing logs to see which actor did not report that they're finished yet. You can share the test, I'll try to take a look tomorrow.

@shanicky
Copy link
Contributor Author

I found that performing offline scaling directly during the recovery process causes the background DDL job status to remain at 100%. Do I need to do anything else? cc @kwannoel

 Id |                   Statement                   |     Progress
----+-----------------------------------------------+-------------------
  7 | CREATE MATERIALIZED VIEW m AS SELECT * FROM t | 100.00% (100/100)
(1 row)

Did the jobs get created successfully? If not, is the state inside the CreateMviewProgressTracker up to date? Did all the new actors get included in it?

Can also add some tracing logs to see which actor did not report that they're finished yet. You can share the test, I'll try to take a look tomorrow.

Ah, I've found the problem. If the meta undergoes recovery without a restart, the states in the progress tracker are not updated. These states should be reconstructed.

@shanicky shanicky changed the title feat: Add support for background ddl in recovery and certain cases of online scaling feat: Add offline scaling support for background ddl with arrangement backfill Jan 24, 2025
Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants