-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: metrics for services and checks #519
base: master
Are you sure you want to change the base?
Conversation
Some investigation into the default metrics that come with the Prometheus Go client: 1 List of Metrics from Go
|
A PoC to add a new type of identity ( 1 Manually Add an Identity from a YAML File$ cat identity.yaml
identities:
bob:
access: read
basicauth:
username: foo
password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity. 2 Start Pebble$ ./pebble run --http=:4000
2024-11-26T14:27:53.682Z [pebble] HTTP API server listening on ":4000".
2024-11-26T14:27:53.682Z [pebble] Started daemon.
2024-11-26T14:27:53.686Z [pebble] POST /v1/services 63.667µs 400
2024-11-26T14:27:53.686Z [pebble] Cannot start default services: no default services 3 Access the Metrics Endpoint with the Newly Created Identity$ curl -u foo:bar localhost:4000/metrics
# HELP my_counter A simple counter
# TYPE my_counter counter
my_counter 4 4 Access without Identity or with an Invalid Username/Password$ curl localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ curl -u invalid:invalid localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}} |
According to the last spec review, the following changes have been made:
After the first round of refactoring, here are some results: 1 Baisc Identity Name with Special Characters$ cat identity.yaml
identities:
"bob:asdf":
access: read
basic:
password: bar
$ ./pebble add-identities --from ./identity.yaml
error: identity "bob:asdf" invalid: identity name "bob:asdf" contains invalid characters (only
alphanumeric, underscore, and hyphen allowed) 2 Baisc Identity without Username$ cat identity.yaml
identities:
bob:
access: read
basic:
password: bar
ubuntu@primary:~/work/pebble2$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity. 3 Basic Identity Type "metrics"$ # access type: read
$ cat identity.yaml
identities:
bob:
access: read
basic:
password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # no access on the metrics endpoint
$ curl -u bob:bar localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}} $ # access type: metrics
$ cat identity.yaml
identities:
bob:
access: metrics
basic:
password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # accessing metrics
$ curl -u bob:bar localhost:4000/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 11
my_counter{operation=write,status=success} 22
my_counter{operation=read,status=failed} 11
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 28.12
$ # no access on other endpoints
$ curl -u bob:bar localhost:4000/v1/changes
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}} $ # access type: admin
$ cat identity.yaml
identities:
bob:
access: admin
basic:
password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # admin can read metrics
$ curl -u bob:bar localhost:4000/v1/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 176
my_counter{operation=write,status=success} 352
my_counter{operation=read,status=failed} 176
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 24.48 TODO: hashing password. |
Notes: We need to handle the memory usage issue in the future, to make this easier, after discussion, we decided not to use a self-implemented Prometheus-like module to store the metrics centrally, but rather, store the metrics on existing structs like |
Take some notes before the holiday season: Currently, the check metrics only work when the check is successful. When it fails, both counters are reset to 0 and never increase. We need more debugging. Maybe it has something to do with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments, most importantly the design of the Metric type and the WriteMetrics interface.
@IronCore864 Can you please merge from master now that the |
In the latest commits,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious about this change listening to the mid-cycle roadmap presentation. Nobody asked for this review, please feel free to ignore it.
@@ -3,6 +3,7 @@ module github.com/canonical/pebble | |||
go 1.22 | |||
|
|||
require ( | |||
github.com/GehirnInc/crypt v0.0.0-20230320061759-8cc1b52080c5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this package maintained? Are we confident about introducing it in every charm / pebble deployment out there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hasn't been recently changed, but the part of it we're using is small a straight-forward, and the algorithm being implemented is stable and well-documented (SHA-crypt). So I would say it's "stable" rather than "unmaintained". I'm actually contemplating vendoring this (the core part of it is only ~100 LoC). But open to other ideas/concerns too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've done some diligence on it, see #563 (comment) for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments, though a few structural things. Main thing that needs discussion is whether the perform_check and recover_check counts are actually going to provide the monitoring we want -- let's discuss.
} | ||
|
||
func (r metricsResponse) ServeHTTP(w http.ResponseWriter, req *http.Request) { | ||
openTelemetryWriter := metrics.NewOpenTelemetryWriter(w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still buffer here with a bytes.Buffer
, so that not every individual write/fprintf call in the metrics writer is doing a separate OS call to write to the network. In addition, it'll mean it either all gets written or none gets written in case of errors, which is probably better -- and we can write the internal error properly. So something like:
var buf bytes.Buffer
metricsWriter := metrics.NewOpenTelemetryWriter(&buf)
r.svcMgr.WriteMetrics(metricsWriter) // with error handling
r.chkMgr.WriteMetrics(metricsWriter) // with error handling
buf.WriteTo(w) // with error handling
|
||
err := r.svcMgr.WriteMetrics(openTelemetryWriter) | ||
if err != nil { | ||
logger.Noticef("Cannot write to HTTP response: %v", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With buffering, this is no longer writing to the HTTP response. Also, you don't need err.Error()
with %v or %s. Suggestion (here, and similar below):
logger.Noticef("Cannot write to HTTP response: %v", err.Error()) | |
logger.Noticef("Cannot write service metrics: %v", err) |
case TypeCounterInt: | ||
metricType = "counter" | ||
case TypeGaugeInt: | ||
metricType = "gauge" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For defensive programming, probably best to add a default
case with a panic(fmt.Sprintf("invalid metric type %v", m.Type))
or similar.
Actually, let's put this type int -> string switch in a String() string
method on MetricType
instead, and put the default+panic in there.
_, err := fmt.Fprintf(otw.w, "# HELP %s %s\n", m.Name, m.Comment) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's surround this with if m.Comment != "" { ... }
as presumably the comment is optional.
return err | ||
} | ||
|
||
labels := make([]string, len(m.Labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid memory allocations, let's write these straight to the output (buffer) instead of building a slice of strings and then allocating/copying for the strings.Join
. So something like (with error handling not shown):
io.WriteString(otw.w, m.Name)
if len(m.Labels) > 0 {
io.WriteString(otw.w, "{")
for i, label := range m.Labels {
if i > 0 {
io.WriteString(otw.w, ", ")
}
fmt.Fprintf(otw.w, "%s=%s", label.key, label.value)
}
io.WriteString(otw.w, "}")
}
fmt.Fprintf(otw.w, " %d", m.ValueInt64)
io.WriteString(otw.w, "}")
Name: "pebble_service_active", | ||
Type: metrics.TypeGaugeInt, | ||
ValueInt64: int64(active), | ||
Comment: "Indicates if the service is currently active (1) or not (0)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the same phrasing here as for the checks one, so "Whether the service is..."
return | ||
} | ||
buf := new(bytes.Buffer) | ||
openTelemetryWriter := metrics.NewOpenTelemetryWriter(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: given it's clear from the context, how about just writer := ...
?
openTelemetryWriter := metrics.NewOpenTelemetryWriter(buf) | ||
s.manager.WriteMetrics(openTelemetryWriter) | ||
metrics := buf.String() | ||
c.Assert(metrics, testutil.Contains, "pebble_service_start_count{service=test1} 1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just assert on the entire output, for simplicity, and so we're testing the HELP and TYPE parts?
m.servicesLock.Lock() | ||
defer m.servicesLock.Unlock() | ||
|
||
for _, service := range m.services { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably best to make the order stable, so let's grab the service names into a slice and sort.Strings
them before looping. Otherwise the order of /v1/metrics
will change each time you refresh it, which is not ideal.
Similar for the checks WriteMetrics
.
m.checksLock.Lock() | ||
defer m.checksLock.Unlock() | ||
|
||
for _, info := range m.checks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per similar comment on ServiceManager.WriteMetrics
, let's sort by check name to ensure the output is stable.
@IronCore864 Can you please update the PR description to match our new approach? Also, it probably goes without saying, but let's be sure not to merge this before the underlying identities PR that this builds on (#563) is reviewed for security and merged. |
A self-implemented metrics module for Pebble. I referred to the Prometheus Golang client a bit, not so much to runtime/metrics because it seems it's too complicated.
According to the spec, it seems we only need the counter type and the gauge type, but I added the histogram type anyway. Prometheus Golang client has more types but I don't think we will need it any time soon, so I ignored it.
Binary size doesn't increase (no extra libs used), still ~7.8MB with
-trimpath -ldflags='-s -w'
.Manual test:
Some basic unit test is added to make sure the functionalities work as expected and there is no deadlock issue.
TODO: investigate how the official Prometheus Golang client gathers basic metrics like CPU usage and stuff.