feat: metrics for services and checks #519

IronCore864 · 2024-11-13T09:11:00Z

A self-implemented metrics module for Pebble. I referred to the Prometheus Golang client a bit, not so much to runtime/metrics because it seems it's too complicated.

According to the spec, it seems we only need the counter type and the gauge type, but I added the histogram type anyway. Prometheus Golang client has more types but I don't think we will need it any time soon, so I ignored it.

Binary size doesn't increase (no extra libs used), still ~7.8MB with -trimpath -ldflags='-s -w'.

Manual test:

$ curl localhost:8000/metrics
# HELP my_counter A simple counter
# TYPE my_counter counter
my_counter 3

Some basic unit test is added to make sure the functionalities work as expected and there is no deadlock issue.

TODO: investigate how the official Prometheus Golang client gathers basic metrics like CPU usage and stuff.

IronCore864 · 2024-11-14T01:23:11Z

Trying to integrate with Prometheus to make sure they are compatible:

IronCore864 · 2024-11-14T02:48:18Z

Some investigation into the default metrics that come with the Prometheus Go client:

1 List of Metrics from Go `runtime/metrics`

c.a. 80 items.

/cgo/go-to-c-calls:calls
/cpu/*
/gc/cycles/*
/gc/gogc:percent
/gc/gomemlimit:bytes
/gc/heap/*
/gc/limiter/last-enabled:gc-cycle
/gc/pauses:seconds
/gc/scan/*
/gc/stack/starting-size:bytes
/godebug/*
/memory/*
/sched/gomaxprocs:threads
/sched/goroutines:goroutines
/sched/latencies:seconds
/sched/pauses/*
/sync/mutex/wait/total:seconds

2 List of Default Metrics from `prometheus/client_golang`

c.a. 40 items.

go_gc_duration_seconds{quantile="*"}
go_gc_duration_seconds_sum
go_gc_duration_seconds_count
go_gc_gogc_percent
go_gc_gomemlimit_bytes
go_goroutines
go_info{version="go1.23.1"}
go_memstats_*
go_sched_gomaxprocs_threads
go_threads
promhttp_metric_handler_requests_in_flight
promhttp_metric_handler_requests_total{code="200"}
promhttp_metric_handler_requests_total{code="500"}
promhttp_metric_handler_requests_total{code="503"}

3 How Are the Default Metrics from `prometheus/client_golang` Fetched/Calculated

3.1 From `runtime` and `runtime/debug`

go_goroutines
go_threads
go_gc_duration_seconds
go_memstats_last_gc_time_seconds
go_info

3.2 From `runtime.MemStats` with Some Calculations

go_memstats_alloc_bytes
go_memstats_alloc_bytes_total
go_memstats_buck_hash_sys_bytes
go_memstats_frees_total
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_objects
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_last_gc_time_seconds
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_other_sys_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes

3.3 Directly from runtime/metrics but Go-Version-Dependant

go_gc_gogc_percent
go_gc_gomemlimit_bytes
go_sched_gomaxprocs_threads

3.4 Prometheus-Related

promhttp_metric_handler_requests_in_flight
promhttp_metric_handler_requests_total{code="200"}
promhttp_metric_handler_requests_total{code="500"}
promhttp_metric_handler_requests_total{code="503"}

Self-processed.

4 Summary

If we want to include these metrics in our self-implemented metrics module, a lot of duplicated work (duplicated with Prometheus) needs to be done, some of which seem to be Go-version-dependent, which means extra work and tests. This creates a bunch of operational overhead.

If these metrics are not necessary, we probably should go with a self-implemented module but if they are nice to have, a 2.2MB size increase of the Pebble binary is worth the price in the long run.

IronCore864 · 2024-11-26T14:34:56Z

A PoC to add a new type of identity (basicauth) for the metrics endpoint.

1 Manually Add an Identity from a YAML File

$ cat identity.yaml
identities:
    bob:
        access: read
        basicauth:
            username: foo
            password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.

2 Start Pebble

$ ./pebble run --http=:4000
2024-11-26T14:27:53.682Z [pebble] HTTP API server listening on ":4000".
2024-11-26T14:27:53.682Z [pebble] Started daemon.
2024-11-26T14:27:53.686Z [pebble] POST /v1/services 63.667µs 400
2024-11-26T14:27:53.686Z [pebble] Cannot start default services: no default services

3 Access the Metrics Endpoint with the Newly Created Identity

$ curl -u foo:bar localhost:4000/metrics
# HELP my_counter A simple counter
# TYPE my_counter counter
my_counter 4

4 Access without Identity or with an Invalid Username/Password

$ curl localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ curl -u invalid:invalid localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}

…the spec

IronCore864 · 2024-12-09T12:16:22Z

According to the last spec review, the following changes have been made:

The identity type is renamed from "basicauth" to just "basic".
The username field in the basic identity is removed -- we can just use the name of the identity instead.
Adding a new access type "metrics".
Passwords are hashed using OpenSSL (TODO).

After the first round of refactoring, here are some results:

1 Baisc Identity Name with Special Characters

$ cat identity.yaml
identities:
    "bob:asdf":
        access: read
        basic:
            password: bar
$ ./pebble add-identities --from ./identity.yaml
error: identity "bob:asdf" invalid: identity name "bob:asdf" contains invalid characters (only
       alphanumeric, underscore, and hyphen allowed)

2 Baisc Identity without Username

$ cat identity.yaml
identities:
    bob:
        access: read
        basic:
            password: bar
ubuntu@primary:~/work/pebble2$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.

3 Basic Identity Type "metrics"

$ # access type: read
$ cat identity.yaml
identities:
    bob:
        access: read
        basic:
            password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # no access on the metrics endpoint
$ curl -u bob:bar localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}

$ # access type: metrics
$ cat identity.yaml
identities:
    bob:
        access: metrics
        basic:
            password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # accessing metrics
$ curl -u bob:bar localhost:4000/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 11
my_counter{operation=write,status=success} 22
my_counter{operation=read,status=failed} 11
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 28.12
$ # no access on other endpoints
$ curl -u bob:bar localhost:4000/v1/changes
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}

$ # access type: admin
$ cat identity.yaml
identities:
    bob:
        access: admin
        basic:
            password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # admin can read metrics
$ curl -u bob:bar localhost:4000/v1/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 176
my_counter{operation=write,status=success} 352
my_counter{operation=read,status=failed} 176
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 24.48

TODO: hashing password.

internals/overlord/state/state.go

IronCore864 · 2025-01-24T02:24:48Z

Notes:

We need to handle the memory usage issue in the future, to make this easier, after discussion, we decided not to use a self-implemented Prometheus-like module to store the metrics centrally, but rather, store the metrics on existing structs like serviceData and CheckInfo. In the last commit, the service-related metrics are implemented on serviceData. Check-related metrics are to be implemented.

IronCore864 · 2025-01-24T10:35:59Z

Take some notes before the holiday season:

Currently, the check metrics only work when the check is successful. When it fails, both counters are reset to 0 and never increase. We need more debugging. Maybe it has something to do with updateCheckInfo.

benhoyt

Several comments, most importantly the design of the Metric type and the WriteMetrics interface.

internals/daemon/api_metrics.go

internals/daemon/daemon.go

internals/overlord/checkstate/manager.go

internals/overlord/servstate/handlers.go

benhoyt · 2025-02-11T13:34:47Z

@IronCore864 Can you please merge from master now that the interface{} to any PR is merged? #568

IronCore864 · 2025-02-12T12:20:41Z

In the latest commits, m.checks now store pointers so that it's easier to update the metrics. Plus, tests are added:

OpenTelemetryWriter: some basic tests are added.
Check metrics: Start the checks, assert that the counters are correct, and then stop the checks. This calls updateCheckInfo, which should not reset metrics.
Service metrics: Since the metrics are stored in serviceData, which isn't exported, testing them is difficult. As a workaround, I created a writer that writes to a buffer and then directly checks the open telemetry format result.

gruyaume

I was curious about this change listening to the mid-cycle roadmap presentation. Nobody asked for this review, please feel free to ignore it.

gruyaume · 2025-02-12T14:16:50Z

go.mod

@@ -3,6 +3,7 @@ module github.com/canonical/pebble
 go 1.22

 require (
+	github.com/GehirnInc/crypt v0.0.0-20230320061759-8cc1b52080c5


Is this package maintained? Are we confident about introducing it in every charm / pebble deployment out there?

It hasn't been recently changed, but the part of it we're using is small a straight-forward, and the algorithm being implemented is stable and well-documented (SHA-crypt). So I would say it's "stable" rather than "unmaintained". I'm actually contemplating vendoring this (the core part of it is only ~100 LoC). But open to other ideas/concerns too.

We've done some diligence on it, see #563 (comment) for example.

benhoyt

Mostly minor comments, though a few structural things. Main thing that needs discussion is whether the perform_check and recover_check counts are actually going to provide the monitoring we want -- let's discuss.

benhoyt · 2025-02-13T02:04:10Z

internals/daemon/api_metrics.go

+}
+
+func (r metricsResponse) ServeHTTP(w http.ResponseWriter, req *http.Request) {
+	openTelemetryWriter := metrics.NewOpenTelemetryWriter(w)


I think we should still buffer here with a bytes.Buffer, so that not every individual write/fprintf call in the metrics writer is doing a separate OS call to write to the network. In addition, it'll mean it either all gets written or none gets written in case of errors, which is probably better -- and we can write the internal error properly. So something like:

var buf bytes.Buffer metricsWriter := metrics.NewOpenTelemetryWriter(&buf) r.svcMgr.WriteMetrics(metricsWriter) // with error handling r.chkMgr.WriteMetrics(metricsWriter) // with error handling buf.WriteTo(w) // with error handling

benhoyt · 2025-02-13T02:06:15Z

internals/daemon/api_metrics.go

+
+	err := r.svcMgr.WriteMetrics(openTelemetryWriter)
+	if err != nil {
+		logger.Noticef("Cannot write to HTTP response: %v", err.Error())


With buffering, this is no longer writing to the HTTP response. Also, you don't need err.Error() with %v or %s. Suggestion (here, and similar below):

Suggested change

logger.Noticef("Cannot write to HTTP response: %v", err.Error())

logger.Noticef("Cannot write service metrics: %v", err)

benhoyt · 2025-02-13T02:08:51Z

internals/metrics/metrics.go

+	case TypeCounterInt:
+		metricType = "counter"
+	case TypeGaugeInt:
+		metricType = "gauge"


For defensive programming, probably best to add a default case with a panic(fmt.Sprintf("invalid metric type %v", m.Type)) or similar.

Actually, let's put this type int -> string switch in a String() string method on MetricType instead, and put the default+panic in there.

benhoyt · 2025-02-13T02:10:22Z

internals/metrics/metrics.go

+	_, err := fmt.Fprintf(otw.w, "# HELP %s %s\n", m.Name, m.Comment)
+	if err != nil {
+		return err
+	}


Let's surround this with if m.Comment != "" { ... } as presumably the comment is optional.

benhoyt · 2025-02-13T02:18:16Z

internals/metrics/metrics.go

+		return err
+	}
+
+	labels := make([]string, len(m.Labels))


To avoid memory allocations, let's write these straight to the output (buffer) instead of building a slice of strings and then allocating/copying for the strings.Join. So something like (with error handling not shown):

io.WriteString(otw.w, m.Name) if len(m.Labels) > 0 { io.WriteString(otw.w, "{") for i, label := range m.Labels { if i > 0 { io.WriteString(otw.w, ", ") } fmt.Fprintf(otw.w, "%s=%s", label.key, label.value) } io.WriteString(otw.w, "}") } fmt.Fprintf(otw.w, " %d", m.ValueInt64) io.WriteString(otw.w, "}")

benhoyt · 2025-02-13T03:06:44Z

internals/overlord/servstate/handlers.go

+		Name:       "pebble_service_active",
+		Type:       metrics.TypeGaugeInt,
+		ValueInt64: int64(active),
+		Comment:    "Indicates if the service is currently active (1) or not (0)",


Let's use the same phrasing here as for the checks one, so "Whether the service is..."

benhoyt · 2025-02-13T03:10:17Z

internals/overlord/servstate/manager_test.go

+		return
+	}
+	buf := new(bytes.Buffer)
+	openTelemetryWriter := metrics.NewOpenTelemetryWriter(buf)


Nit: given it's clear from the context, how about just writer := ...?

benhoyt · 2025-02-13T03:11:41Z

internals/overlord/servstate/manager_test.go

+	openTelemetryWriter := metrics.NewOpenTelemetryWriter(buf)
+	s.manager.WriteMetrics(openTelemetryWriter)
+	metrics := buf.String()
+	c.Assert(metrics, testutil.Contains, "pebble_service_start_count{service=test1} 1")


Should we just assert on the entire output, for simplicity, and so we're testing the HELP and TYPE parts?

benhoyt · 2025-02-13T03:13:12Z

internals/overlord/servstate/manager.go

+	m.servicesLock.Lock()
+	defer m.servicesLock.Unlock()
+
+	for _, service := range m.services {


It's probably best to make the order stable, so let's grab the service names into a slice and sort.Strings them before looping. Otherwise the order of /v1/metrics will change each time you refresh it, which is not ideal.

Similar for the checks WriteMetrics.

benhoyt · 2025-02-13T03:15:52Z

internals/overlord/checkstate/manager.go

+	m.checksLock.Lock()
+	defer m.checksLock.Unlock()
+
+	for _, info := range m.checks {


Per similar comment on ServiceManager.WriteMetrics, let's sort by check name to ensure the output is stable.

benhoyt · 2025-02-13T03:18:56Z

@IronCore864 Can you please update the PR description to match our new approach?

Also, it probably goes without saying, but let's be sure not to merge this before the underlying identities PR that this builds on (#563) is reviewed for security and merged.

poc: a metrics module for pebble

f76470c

IronCore864 requested a review from benhoyt November 13, 2024 09:11

IronCore864 marked this pull request as ready for review November 13, 2024 09:11

IronCore864 added 4 commits November 14, 2024 11:03

chore: undo unnecessary change

6d8ee59

chore: undo unnecessary change

b4abc9a

chore: undo unnecessary change

a274276

chore: metrics identity basic auth poc

a2c07e6

IronCore864 added 4 commits November 27, 2024 20:06

chore: a poc for metrics with labels

4ebb633

poc: remove adding identities using env vars according to comment in …

7468b95

…the spec

chore: update tests for the metrics lib poc

790a8f9

chore: refactor identities and access according to spec review

272005b

benhoyt reviewed Dec 9, 2024

View reviewed changes

internals/overlord/state/state.go Outdated Show resolved Hide resolved

IronCore864 mentioned this pull request Dec 10, 2024

PoC: prometheus metrics endpoint #518

Closed

IronCore864 added 8 commits January 21, 2025 16:42

feat: use sha512 to verify password

5be3e96

feat: move the metrics api to /v1/metrics

a6c374d

chore: remove Username from apiBasicIdentity

1bd54cb

chore: revert changes on user state

98ea11e

Merge branch 'master' into poc-custom-metrics-lib

68c18b7

chore: fix failed identity tests

7fc255e

test: unit tests for basic identity

31a0617

chore: rework the metrics for services

a57f041

chore: add metrics for checks, not done

b7a442f

IronCore864 marked this pull request as draft January 24, 2025 10:36

benhoyt requested changes Feb 7, 2025

View reviewed changes

IronCore864 added 2 commits February 11, 2025 11:48

chore: refactor metrics, add open telemetry writer

8ebebb8

Merge branch 'master' into poc-custom-metrics-lib

363eaf0

IronCore864 added 6 commits February 12, 2025 07:02

chore: refactor according to review, fix check counter reset issue

a1db1a6

Merge branch 'master' into poc-custom-metrics-lib

047cc42

chore: add a test for check metrics

c527344

test: add tests for open telemetry writer

5476299

test: service metrics

9e5b65a

chore: update tests

d678766

chore: fix linting

c99fa53

IronCore864 requested a review from benhoyt February 12, 2025 12:27

IronCore864 marked this pull request as ready for review February 12, 2025 12:27

IronCore864 changed the title ~~poc: a metrics module for pebble~~ feat: metrics for services and checks Feb 12, 2025

gruyaume reviewed Feb 12, 2025

View reviewed changes

benhoyt requested changes Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: metrics for services and checks #519

feat: metrics for services and checks #519

IronCore864 commented Nov 13, 2024 •

edited by benhoyt

Loading

IronCore864 commented Nov 14, 2024 •

edited

Loading

IronCore864 commented Nov 14, 2024 •

edited

Loading

IronCore864 commented Nov 26, 2024 •

edited

Loading

IronCore864 commented Dec 9, 2024

IronCore864 commented Jan 24, 2025

IronCore864 commented Jan 24, 2025 •

edited

Loading

benhoyt left a comment

benhoyt commented Feb 11, 2025

IronCore864 commented Feb 12, 2025

gruyaume left a comment

gruyaume Feb 12, 2025

benhoyt Feb 12, 2025

benhoyt Feb 12, 2025

benhoyt left a comment

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt Feb 13, 2025

benhoyt commented Feb 13, 2025

	logger.Noticef("Cannot write to HTTP response: %v", err.Error())
	logger.Noticef("Cannot write service metrics: %v", err)

feat: metrics for services and checks #519

Are you sure you want to change the base?

feat: metrics for services and checks #519

Conversation

IronCore864 commented Nov 13, 2024 • edited by benhoyt Loading

IronCore864 commented Nov 14, 2024 • edited Loading

IronCore864 commented Nov 14, 2024 • edited Loading

1 List of Metrics from Go runtime/metrics

2 List of Default Metrics from prometheus/client_golang

3 How Are the Default Metrics from prometheus/client_golang Fetched/Calculated

3.1 From runtime and runtime/debug

3.2 From runtime.MemStats with Some Calculations

3.3 Directly from runtime/metrics but Go-Version-Dependant

3.4 Prometheus-Related

4 Summary

IronCore864 commented Nov 26, 2024 • edited Loading

1 Manually Add an Identity from a YAML File

2 Start Pebble

3 Access the Metrics Endpoint with the Newly Created Identity

4 Access without Identity or with an Invalid Username/Password

IronCore864 commented Dec 9, 2024

1 Baisc Identity Name with Special Characters

2 Baisc Identity without Username

3 Basic Identity Type "metrics"

IronCore864 commented Jan 24, 2025

IronCore864 commented Jan 24, 2025 • edited Loading

benhoyt left a comment

Choose a reason for hiding this comment

benhoyt commented Feb 11, 2025

IronCore864 commented Feb 12, 2025

gruyaume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benhoyt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benhoyt commented Feb 13, 2025

IronCore864 commented Nov 13, 2024 •

edited by benhoyt

Loading

IronCore864 commented Nov 14, 2024 •

edited

Loading

IronCore864 commented Nov 14, 2024 •

edited

Loading

1 List of Metrics from Go `runtime/metrics`

2 List of Default Metrics from `prometheus/client_golang`

3 How Are the Default Metrics from `prometheus/client_golang` Fetched/Calculated

3.1 From `runtime` and `runtime/debug`

3.2 From `runtime.MemStats` with Some Calculations

IronCore864 commented Nov 26, 2024 •

edited

Loading

IronCore864 commented Jan 24, 2025 •

edited

Loading