-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[clickhouse] Clickana monitoring dashboard tool #7207
Conversation
@@ -1200,7 +1200,7 @@ pub enum Timestamp { | |||
#[derive(Debug, Serialize, Deserialize, JsonSchema, PartialEq)] | |||
#[serde(rename_all = "snake_case")] | |||
pub struct SystemTimeSeries { | |||
pub time: Timestamp, | |||
pub time: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is seriously doing my head in. Since Timestamp is an untagged enum, serde is having a hard time deserializing. My custom deserializer didn't work, but I'll see if I can find a way
// The ClickHouse client connects via the TCP port | ||
let ch_address = { | ||
let mut addr = *address; | ||
addr.set_port(CLICKHOUSE_TCP_PORT); | ||
addr.to_string() | ||
}; | ||
|
||
let clickhouse_admin_config = | ||
PropertyGroupBuilder::new("config") | ||
.add_property("http_address", "astring", admin_address) | ||
.add_property( | ||
"ch_address", | ||
"astring", | ||
address.to_string(), | ||
ch_address.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure when this broke, but it wouldn't have been caught as nothing was calling clickhouse_cli
(a wrapper around the clickhouse client
command) yet.
Just want to make a clarification. The timeseries rendered in each dashboard are per clickhouse server, not of the whole cluster aggregated into a single view. I misremembered the system tables' engines. They're MergeTree (not ReplicatedMergeTree as I thought), so unique to each server. |
Ok, so the docs aren't super great, but I've been playing around with this to confirm. They are definitely per server. oximeter_cluster-1 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
SELECT
toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
Query id: a9eed161-d54c-4d11-b23a-21aeca60ef28
┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │ 0 │
│ 2024-12-18T07:03:00Z │ 0 │
│ 2024-12-18T07:04:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:05:00Z │ 0.5666666666666667 │
│ 2024-12-18T07:06:00Z │ 0.6666666666666666 │
│ 2024-12-18T07:07:00Z │ 0.45 │
│ 2024-12-18T07:08:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:09:00Z │ 0 │
│ 2024-12-18T07:10:00Z │ 0 │
│ 2024-12-18T07:11:00Z │ 0 │
│ 2024-12-18T07:12:00Z │ 0.03333333333333333 │
│ 2024-12-18T07:13:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:14:00Z │ 0 │
│ 2024-12-18T07:15:00Z │ 0 │
│ 2024-12-18T07:16:00Z │ 0.18333333333333332 │
│ 2024-12-18T07:17:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:18:00Z │ 0.43333333333333335 │
│ 2024-12-18T07:19:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:20:00Z │ 0 │
│ 2024-12-18T07:21:00Z │ 0 │
│ 2024-12-18T07:22:00Z │ 0 │
│ 2024-12-18T07:23:00Z │ 0 │
│ 2024-12-18T07:24:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:25:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:26:00Z │ 0.6333333333333333 │
│ 2024-12-18T07:27:00Z │ 0 │
│ 2024-12-18T07:28:00Z │ 0 │
└──────────────────────┴─────────────────────────┘
27 rows in set. Elapsed: 0.009 sec. Processed 1.57 thousand rows, 10.60 KB (176.72 thousand rows/s., 1.19 MB/s.)
Peak memory usage: 51.12 KiB.
oximeter_cluster-2 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
SELECT
toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'
Query id: 37a3b84f-8844-40b5-ac3a-df4b60ba7b1a
┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │ 0 │
│ 2024-12-18T07:03:00Z │ 0 │
│ 2024-12-18T07:04:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:05:00Z │ 0.5666666666666667 │
│ 2024-12-18T07:06:00Z │ 0.5666666666666667 │
│ 2024-12-18T07:07:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:08:00Z │ 0.2833333333333333 │
│ 2024-12-18T07:09:00Z │ 0 │
│ 2024-12-18T07:10:00Z │ 0 │
│ 2024-12-18T07:11:00Z │ 0 │
│ 2024-12-18T07:12:00Z │ 0.03333333333333333 │
│ 2024-12-18T07:13:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:14:00Z │ 0 │
│ 2024-12-18T07:15:00Z │ 0 │
│ 2024-12-18T07:16:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:17:00Z │ 0 │
│ 2024-12-18T07:18:00Z │ 0.03333333333333333 │
│ 2024-12-18T07:19:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:20:00Z │ 0 │
│ 2024-12-18T07:21:00Z │ 0 │
│ 2024-12-18T07:22:00Z │ 0 │
│ 2024-12-18T07:23:00Z │ 0 │
│ 2024-12-18T07:24:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:25:00Z │ 0.016666666666666666 │
│ 2024-12-18T07:26:00Z │ 0 │
│ 2024-12-18T07:27:00Z │ 0 │
│ 2024-12-18T07:28:00Z │ 0 │
└──────────────────────┴─────────────────────────┘
27 rows in set. Elapsed: 0.021 sec. Processed 2.75 thousand rows, 16.81 KB (134.15 thousand rows/s., 818.93 KB/s.)
Peak memory usage: 64.65 KiB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @karencfv.
Just some relatively minor suggestions from me.
results.len() | ||
); | ||
} | ||
// TODO: Eventually we may want to not have a set amount of charts and make the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future PRs: I think it would be useful to cool to be able to have a little menu of charts on the side of the pane, and then you can scroll and select which ones to show without having to restart the app, or mess with a toml file.
You could also allow toggling between a set of predefined layouts to make it always look nice. So you could show, 1, 2, 4, 6, 8 charts or something and allow selecting which to show in each view. You could even remember which charts to show in each layout, so you could toggle back and forth between different layouts and see all the charts, some with more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO
dev-tools/clickana/src/lib.rs
Outdated
let s = self.clone(); | ||
let c = client.clone(); | ||
|
||
let task = tokio::spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this works, It seems somewhat heavy handed to spawn a task to get api data in parallel for each chart and then immediately join to wait for them all. Spawn is typically used for longer running tasks that stay around.
A more common way to do this is when you want concurrency but don't need to leave the current thread is to use FuturesUnordered. https://betterprogramming.pub/futuresunordered-an-efficient-way-to-manage-multiple-futures-in-rust-a24520abc3f6 has a pretty good overview.
Using FuturesUnordered
would also remove the need to clone self
and client
as they can just be borrowed immutably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the tip
dev-tools/clickana/src/lib.rs
Outdated
let log = self.new_logger()?; | ||
let client = ClickhouseServerClient::new(&admin_url, log.clone()); | ||
|
||
let tick_rate = Duration::from_secs(self.refresh_interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a "rate", but a duration. I'd suggest naming it to tick_interval
.
}; | ||
use std::fmt::Display; | ||
|
||
const GIBIBYTE_F64: f64 = 1073741824.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems really odd to me to represent number of bytes by floats, as they are always whole numbers.
I realize that clickhouse returns floats for timeseries, but I think for types where it makes sense we should instead normalize those to integers rather than normalizing our data and computations to fit the raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. That makes sense. Feel free to ignore these comments then :)
let mid_label_as_unit = | ||
values.avg(lower_label_as_unit, upper_label_as_unit); | ||
|
||
// To nicely display the mid value label for the Y axis, we do the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get rid of this parsing if you just convert all the values to integers at ingestion time.
The only reason I guess you wouldn't want to do this is if there are metrics where there are fractions we actually care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore
.iter() | ||
.map(|ts| { | ||
( | ||
ts.time.trim_matches('"').parse::<f64>().unwrap_or_else( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to parse a timestamp into an f64
? Can't we use an actual time type instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, the datapoints in the dataset need to be f64 for ratatui to render them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @andrewjstone !
}; | ||
use std::fmt::Display; | ||
|
||
const GIBIBYTE_F64: f64 = 1073741824.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly
let mid_label_as_unit = | ||
values.avg(lower_label_as_unit, upper_label_as_unit); | ||
|
||
// To nicely display the mid value label for the Y axis, we do the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore
.iter() | ||
.map(|ts| { | ||
( | ||
ts.time.trim_matches('"').parse::<f64>().unwrap_or_else( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, the datapoints in the dataset need to be f64 for ratatui to render them
dev-tools/clickana/src/lib.rs
Outdated
let s = self.clone(); | ||
let c = client.clone(); | ||
|
||
let task = tokio::spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the tip
results.len() | ||
); | ||
} | ||
// TODO: Eventually we may want to not have a set amount of charts and make the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO
Overview
As part of Stage 1 of RFD468 we'll be observing how a ClickHouse cluster behaves in comparison with a single node server. This commit introduces a basic tool that lets us visualize internal ClickHouse metric information.
As a starting point, Clickana only has 4 charts, and the user may not choose what these are. Additionally, it is only capable of rendering data by making API calls. I'd like to make the tool more flexible; other capabilities will be added in follow up PRs.
Usage
Manual Testing
Next Steps
Related: #6953