Clickhouse ReplicatedReplacingMergeTree Support #67

bsidio · 2024-12-27T23:55:31Z

Any plans to support Clickhouse Clusters?

bakwc · 2024-12-28T12:30:22Z

I can add support for clusters (using Distributed engine to make parallel queries), but not sure about ReplicatedReplacingMergeTree. For the first iteration probably it will be a regular ReplacingMergeTree on every node + Distributed table on top of them + sharding on mysql_ch_replicator level (it will insert records to a correct node).
Will it works?

bsidio · 2025-01-07T06:09:43Z

@bakwc That would be one way to do it, I was thinking of adding cluster details through config and let clickhouse handle write to cluster nodes

if cluster = enabled
cluster_name = clusterName
cluster_engine = ReplicatedMergeTree

For example - if cluster is enabled

self.execute_command(f'DROP DATABASE IF EXISTS {db_name}')

would become
self.execute_command(f'DROP DATABASE IF EXISTS {db_name} ON CLUSTER {cluster name}')

and another example
CREATE_TABLE_QUERY = '''
CREATE TABLE {if_not_exists} {db_name}.{table_name} ON CLUSTER {cluster_name}
(
{fields},
_version UInt64,
{indexes}
)
ENGINE = ReplacingMergeTree(_version)
{partition_by}ORDER BY {primary_key}
SETTINGS index_granularity = 8192
'''
The write to individual nodes will be handled by clickhouse.

The ON CLUSTER {cluster_name} and another feature could be to allow user to define Engine via config {cluster_engine}

Please let me know if you want to additional info, happy to contribute :)

bakwc · 2025-01-07T07:16:08Z

Ah, I see. Yeah, it could be implemented, the only downside is that replication itself could become a bottleneck as it will be only 1 replication process for a large cluster. But we could start with this approach, yeah.

ufou · 2025-01-17T09:32:04Z

We also have this requirement, we have one large table to sync from mysql to CH, at the moment the table is created on CH with 2 steps:

Firstly with (eg):

CREATE TABLE test.sharded on CLUSTER ch1
(...)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/sharded', '{replica}')

Secondly with:

CREATE TABLE test.distrib ON CLUSTER ch1
(...)
ENGINE = Distributed('ch1', 'test', sharded, intHash64(key_id))

I guess the other option (which may not work) would be to allow for the database/table on CH to preexist? A flag in the config could allude to this and skip trying to create database/table?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clickhouse ReplicatedReplacingMergeTree Support #67

Clickhouse ReplicatedReplacingMergeTree Support #67

bsidio commented Dec 27, 2024

bakwc commented Dec 28, 2024

bsidio commented Jan 7, 2025

bakwc commented Jan 7, 2025

ufou commented Jan 17, 2025

Clickhouse ReplicatedReplacingMergeTree Support #67

Clickhouse ReplicatedReplacingMergeTree Support #67

Comments

bsidio commented Dec 27, 2024

bakwc commented Dec 28, 2024

bsidio commented Jan 7, 2025

bakwc commented Jan 7, 2025

ufou commented Jan 17, 2025