-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[clickhouse] Long running QA tests for replicated cluster #6953
Labels
Comments
karencfv
added
clickhouse
Related to the ClickHouse metrics DBMS
Testing & Analysis
Tests & Analyzers
labels
Oct 30, 2024
21 tasks
karencfv
added a commit
that referenced
this issue
Nov 6, 2024
## Overview This commit implements a new clickhouse-admin endpoint to retrieve and parse information from `[system.distributed_ddl_queue](https://clickhouse.com/docs/en/operations/system-tables/distributed_ddl_queue)`. ## Purpose As part of [stage 1](https://rfd.shared.oxide.computer/rfd/0468) of rolling out replicated ClickHouse, we'll be needing to monitor the installed but not visible replicated cluster on our dogfood rack. This endpoint will aid in said monitoring by providing information about distributed ddl queries. ## Testing ```console $ cargo run --bin=clickhouse-admin-server -- run -c ./smf/clickhouse-admin-server/config.toml -a [::1]:8888 -l [::1]:22001 -b /Users/karcar/src/omicron/out/clickhouse/clickhouse Compiling omicron-clickhouse-admin v0.1.0 (/Users/karcar/src/omicron/clickhouse-admin) Finished `dev` profile [unoptimized + debuginfo] target(s) in 3.79s Running `target/debug/clickhouse-admin-server run -c ./smf/clickhouse-admin-server/config.toml -a '[::1]:8888' -l '[::1]:22001' -b /Users/karcar/src/omicron/out/clickhouse/clickhouse` note: configured to log to "/dev/stdout" {"msg":"listening","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:41.746887Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:197"} {"msg":"accepted connection","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:55.674044Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105","remote_addr":"[::1]:59735"} {"msg":"Retrieved data from `system.distributed_ddl_queue`","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148241Z","hostname":"ixchel","pid":55286,"component":"ClickhouseCli","file":"clickhouse-admin/types/src/lib.rs:1018","output":"\"{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"3\\\"}\\n{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n\""} {"msg":"request completed","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148459Z","hostname":"ixchel","pid":55286,"uri":"/distributed-ddl-queue","method":"GET","req_id":"fb46b182-2573-4daa-a791-118dad20a3c3","remote_addr":"[::1]:59735","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950","latency_us":473916,"response_code":"200"} ``` ```console $ curl http://[::1]:8888/distributed-ddl-queue [{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"3"},{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"}] ``` ### Caveat I have purposely not written an integration test as the port situation is getting out of hand 😅 I'm working on a better implementation of running these integration tests that involves less hardcoded ports. I will include an integration test for this endpoint in that PR Related: #6953
karencfv
added a commit
that referenced
this issue
Nov 19, 2024
…ic_log` tables (#7100) ## Overview In order to reliably roll out a replicated ClickHouse cluster, we'll be running a set of long running testing (stage 1 of RFD [#468](https://rfd.shared.oxide.computer/rfd/0468#_stage_1)). ClickHouse provides a `system` database with several tables with information about the system. For monitoring purposes, the [`system.asynchronous_metric_log`](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metric_log) and [`system.metric_log`](https://clickhouse.com/docs/en/operations/system-tables/metric_log) tables are particularly useful. With them we can retrieve information about queries per second, CPU usage, memory usage etc. Full lists of available metrics [here](https://clickhouse.com/docs/en/operations/system-tables/metrics) and [here](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metrics) During our long running testing I'd like to give these tables a TTL of 30 days. Once we are confident the system is stable and we roll out the cluster to all racks, we can reduce TTL to 7 or 14 days. This PR will only enable the tables themselves. There will be follow up PRs to actually retrieve the data we'll be monitoring ## Manual testing ### Queries per second ```console oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query) FROM system.metric_log WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400 GROUP BY t ORDER BY t WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' SELECT toStartOfInterval(event_time, toIntervalSecond(60)) AS t, avg(ProfileEvent_Query) FROM system.metric_log WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) GROUP BY t ORDER BY t ASC WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' Query id: 1b91946b-fe8b-4074-bc94-f071f72f55f5 ┌────────────────────t─┬─avg(ProfileEvent_Query)─┐ │ 2024-11-18T06:40:00Z │ 1.3571428571428572 │ │ 2024-11-18T06:41:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:42:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:43:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:44:00Z │ 1.3666666666666667 │ ``` ### Disk usage ```console oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(value) FROM system.asynchronous_metric_log WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400 AND metric = 'DiskUsed_default' GROUP BY t ORDER BY t WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' SELECT toStartOfInterval(event_time, toIntervalSecond(60)) AS t, avg(value) FROM system.asynchronous_metric_log WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) AND (metric = 'DiskUsed_default') GROUP BY t ORDER BY t ASC WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' Query id: bcf9cd9b-7fea-4aea-866b-d69e60a7c0b6 ┌────────────────────t─┬─────────avg(value)─┐ │ 2024-11-18T06:42:00Z │ 860941425.7777778 │ │ 2024-11-18T06:43:00Z │ 865134523.7333333 │ │ 2024-11-18T06:44:00Z │ 871888896 │ │ 2024-11-18T06:45:00Z │ 874408891.7333333 │ │ 2024-11-18T06:46:00Z │ 878761984 │ │ 2024-11-18T06:47:00Z │ 881646933.3333334 │ │ 2024-11-18T06:48:00Z │ 883998788.2666667 │ ``` Related #6953
karencfv
added a commit
that referenced
this issue
Nov 26, 2024
## Overview This commit adds an endpoint to retrieve timeseries from the `system` database. For the time being we will only add support for the `metric_log` and `asynchronous_metric_log` tables. This endpoint is still a bit bare bones, but will be a good start to begin monitoring the ClickHouse servers. ## Examples ### Queries per second ```console $ curl "http://[::1]:8888/timeseries/metric_log/ProfileEvent_Query/avg" | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2883 100 2883 0 0 5912 0 --:--:-- --:--:-- --:--:-- 5907 [ { "time": "2024-11-20T04:09:00Z", "value": 0.0 }, { "time": "2024-11-20T04:10:00Z", "value": 0.0 }, { "time": "2024-11-20T04:11:00Z", "value": 0.06666666666666667 } ] ``` ### Disk usage ```console $ curl "http://[::1]:8888/timeseries/asynchronous_metric_log/DiskUsed_default/avg?interval=120&time_range=3600×tamp_format=unix_epoch" | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1786 100 1786 0 0 3728 0 --:--:-- --:--:-- --:--:-- 3728 [ { "time": "1732513320", "value": 120491427254.85716 }, { "time": "1732513440", "value": 120382774033.06668 }, { "time": "1732513560", "value": 120364752622.93332 } ] ``` Related: #6953
karencfv
added a commit
that referenced
this issue
Dec 19, 2024
### Overview As part of Stage 1 of [RFD468](https://rfd.shared.oxide.computer/rfd/0468) we'll be observing how a ClickHouse cluster behaves in comparison with a single node server. This commit introduces a basic tool that lets us visualize internal ClickHouse metric information. As a starting point, Clickana only has 4 charts, and the user may not choose what these are. Additionally, it is only capable of rendering data by making API calls. I'd like to make the tool more flexible; other capabilities will be added in follow up PRs. ### Usage ```console clickana --help Usage: clickana [OPTIONS] --clickhouse-addr <CLICKHOUSE_ADDR> Options: -l, --log-path <LOG_PATH> Path to the log file [env: CLICKANA_LOG_PATH=] [default: /tmp/clickana.log] -a, --clickhouse-addr <CLICKHOUSE_ADDR> Address where a clickhouse admin server is listening on -s, --sampling-interval <SAMPLING_INTERVAL> The interval to collect monitoring data in seconds [default: 60] -t, --time-range <TIME_RANGE> Range of time to collect monitoring data in seconds [default: 3600] -r, --refresh-interval <REFRESH_INTERVAL> The interval at which the dashboards will refresh [default: 60] -h, --help Print help ``` ### Manual Testing ``` root@oxz_clickhouse_015f9c34:~# /opt/oxide/clickana/bin/clickana -a [fd00:1122:3344:101::e]:8888 ``` <img width="1208" alt="Screenshot 2024-12-12 at 4 11 15 PM" src="https://github.com/user-attachments/assets/53658b02-3729-4b29-ac28-0a387c3143ac" /> ### Next Steps - Let the user set which metrics they would like to visualise in each chart. This may be nice to do through a TOML file or something. We could let them choose which unit to represent them in as well perhaps. - Have more metrics available. - It'd be nice to have the ability to take the timeseries as JSON instead of calling the API as well. This could be useful in the future to have some insight into our customer's racks for debugging purposes. We could include ClickHouse internal metric timeseries as part of the support bundles and they could be visualised via Clickana. WDYT @smklein ? Related: #6953
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Overview
As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).
We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):
Implementation
We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.
To retrieve the information we need, we can leverage the following native ClickHouse tooling:
system.opentelemetry_span_log
system.replication_queue
andsystem.distributed_ddl_queue
system.system.user_processes
Altinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.
Relevant links
Tasks
system.metric_log
andsystem.asynchronous_metric_log
tables #7100The text was updated successfully, but these errors were encountered: