Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse ReplicatedReplacingMergeTree Support #67

Open
bsidio opened this issue Dec 27, 2024 · 4 comments
Open

Clickhouse ReplicatedReplacingMergeTree Support #67

bsidio opened this issue Dec 27, 2024 · 4 comments

Comments

@bsidio
Copy link

bsidio commented Dec 27, 2024

Any plans to support Clickhouse Clusters?

@bakwc
Copy link
Owner

bakwc commented Dec 28, 2024

I can add support for clusters (using Distributed engine to make parallel queries), but not sure about ReplicatedReplacingMergeTree. For the first iteration probably it will be a regular ReplacingMergeTree on every node + Distributed table on top of them + sharding on mysql_ch_replicator level (it will insert records to a correct node).
Will it works?

@bsidio
Copy link
Author

bsidio commented Jan 7, 2025

@bakwc That would be one way to do it, I was thinking of adding cluster details through config and let clickhouse handle write to cluster nodes

if cluster = enabled
cluster_name = clusterName
cluster_engine = ReplicatedMergeTree

For example - if cluster is enabled

self.execute_command(f'DROP DATABASE IF EXISTS {db_name}')

would become
self.execute_command(f'DROP DATABASE IF EXISTS {db_name} ON CLUSTER {cluster name}')

and another example
CREATE_TABLE_QUERY = '''
CREATE TABLE {if_not_exists} {db_name}.{table_name} ON CLUSTER {cluster_name}
(
{fields},
_version UInt64,
{indexes}
)
ENGINE = ReplacingMergeTree(_version)
{partition_by}ORDER BY {primary_key}
SETTINGS index_granularity = 8192
'''
The write to individual nodes will be handled by clickhouse.

The ON CLUSTER {cluster_name} and another feature could be to allow user to define Engine via config {cluster_engine}

Please let me know if you want to additional info, happy to contribute :)

@bakwc
Copy link
Owner

bakwc commented Jan 7, 2025

Ah, I see. Yeah, it could be implemented, the only downside is that replication itself could become a bottleneck as it will be only 1 replication process for a large cluster. But we could start with this approach, yeah.

@ufou
Copy link

ufou commented Jan 17, 2025

We also have this requirement, we have one large table to sync from mysql to CH, at the moment the table is created on CH with 2 steps:

Firstly with (eg):

CREATE TABLE test.sharded on CLUSTER ch1
(...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/sharded', '{replica}')

Secondly with:

CREATE TABLE test.distrib ON CLUSTER ch1
(...)
ENGINE = Distributed('ch1', 'test', sharded, intHash64(key_id))

I guess the other option (which may not work) would be to allow for the database/table on CH to preexist? A flag in the config could allude to this and skip trying to create database/table?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants