Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: metrics for services and checks #519

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Nov 13, 2024

A self-implemented metrics module for Pebble. I referred to the Prometheus Golang client a bit, not so much to runtime/metrics because it seems it's too complicated.

According to the spec, it seems we only need the counter type and the gauge type, but I added the histogram type anyway. Prometheus Golang client has more types but I don't think we will need it any time soon, so I ignored it.

Binary size doesn't increase (no extra libs used), still ~7.8MB with -trimpath -ldflags='-s -w'.

Manual test:

$ curl localhost:8000/metrics
# HELP my_counter A simple counter
# TYPE my_counter counter
my_counter 3

Some basic unit test is added to make sure the functionalities work as expected and there is no deadlock issue.

TODO: investigate how the official Prometheus Golang client gathers basic metrics like CPU usage and stuff.

@IronCore864 IronCore864 requested a review from benhoyt November 13, 2024 09:11
@IronCore864 IronCore864 marked this pull request as ready for review November 13, 2024 09:11
@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 14, 2024

Trying to integrate with Prometheus to make sure they are compatible:

Screenshot-counter Screenshot-gauge

@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 14, 2024

Some investigation into the default metrics that come with the Prometheus Go client:

1 List of Metrics from Go runtime/metrics

c.a. 80 items.

  • /cgo/go-to-c-calls:calls
  • /cpu/*
  • /gc/cycles/*
  • /gc/gogc:percent
  • /gc/gomemlimit:bytes
  • /gc/heap/*
  • /gc/limiter/last-enabled:gc-cycle
  • /gc/pauses:seconds
  • /gc/scan/*
  • /gc/stack/starting-size:bytes
  • /godebug/*
  • /memory/*
  • /sched/gomaxprocs:threads
  • /sched/goroutines:goroutines
  • /sched/latencies:seconds
  • /sched/pauses/*
  • /sync/mutex/wait/total:seconds

2 List of Default Metrics from prometheus/client_golang

c.a. 40 items.

  • go_gc_duration_seconds{quantile="*"}
  • go_gc_duration_seconds_sum
  • go_gc_duration_seconds_count
  • go_gc_gogc_percent
  • go_gc_gomemlimit_bytes
  • go_goroutines
  • go_info{version="go1.23.1"}
  • go_memstats_*
  • go_sched_gomaxprocs_threads
  • go_threads
  • promhttp_metric_handler_requests_in_flight
  • promhttp_metric_handler_requests_total{code="200"}
  • promhttp_metric_handler_requests_total{code="500"}
  • promhttp_metric_handler_requests_total{code="503"}

3 How Are the Default Metrics from prometheus/client_golang Fetched/Calculated

3.1 From runtime and runtime/debug

  • go_goroutines
  • go_threads
  • go_gc_duration_seconds
  • go_memstats_last_gc_time_seconds
  • go_info

3.2 From runtime.MemStats with Some Calculations

  • go_memstats_alloc_bytes
  • go_memstats_alloc_bytes_total
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_frees_total
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_objects
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_last_gc_time_seconds
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_other_sys_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes

3.3 Directly from runtime/metrics but Go-Version-Dependant

  • go_gc_gogc_percent
  • go_gc_gomemlimit_bytes
  • go_sched_gomaxprocs_threads

3.4 Prometheus-Related

  • promhttp_metric_handler_requests_in_flight
  • promhttp_metric_handler_requests_total{code="200"}
  • promhttp_metric_handler_requests_total{code="500"}
  • promhttp_metric_handler_requests_total{code="503"}

Self-processed.

4 Summary

If we want to include these metrics in our self-implemented metrics module, a lot of duplicated work (duplicated with Prometheus) needs to be done, some of which seem to be Go-version-dependent, which means extra work and tests. This creates a bunch of operational overhead.

If these metrics are not necessary, we probably should go with a self-implemented module but if they are nice to have, a 2.2MB size increase of the Pebble binary is worth the price in the long run.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Nov 26, 2024

A PoC to add a new type of identity (basicauth) for the metrics endpoint.

1 Manually Add an Identity from a YAML File

$ cat identity.yaml
identities:
    bob:
        access: read
        basicauth:
            username: foo
            password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.

2 Start Pebble

$ ./pebble run --http=:4000
2024-11-26T14:27:53.682Z [pebble] HTTP API server listening on ":4000".
2024-11-26T14:27:53.682Z [pebble] Started daemon.
2024-11-26T14:27:53.686Z [pebble] POST /v1/services 63.667µs 400
2024-11-26T14:27:53.686Z [pebble] Cannot start default services: no default services

3 Access the Metrics Endpoint with the Newly Created Identity

$ curl -u foo:bar localhost:4000/metrics
# HELP my_counter A simple counter
# TYPE my_counter counter
my_counter 4

4 Access without Identity or with an Invalid Username/Password

$ curl localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ curl -u invalid:invalid localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}

@IronCore864
Copy link
Contributor Author

According to the last spec review, the following changes have been made:

  1. The identity type is renamed from "basicauth" to just "basic".
  2. The username field in the basic identity is removed -- we can just use the name of the identity instead.
  3. Adding a new access type "metrics".
  4. Passwords are hashed using OpenSSL (TODO).

After the first round of refactoring, here are some results:

1 Baisc Identity Name with Special Characters

$ cat identity.yaml
identities:
    "bob:asdf":
        access: read
        basic:
            password: bar
$ ./pebble add-identities --from ./identity.yaml
error: identity "bob:asdf" invalid: identity name "bob:asdf" contains invalid characters (only
       alphanumeric, underscore, and hyphen allowed)

2 Baisc Identity without Username

$ cat identity.yaml
identities:
    bob:
        access: read
        basic:
            password: bar
ubuntu@primary:~/work/pebble2$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.

3 Basic Identity Type "metrics"

$ # access type: read
$ cat identity.yaml
identities:
    bob:
        access: read
        basic:
            password: bar
$ ./pebble add-identities --from ./identity.yaml
Added 1 new identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # no access on the metrics endpoint
$ curl -u bob:bar localhost:4000/metrics
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ # access type: metrics
$ cat identity.yaml
identities:
    bob:
        access: metrics
        basic:
            password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # open access is fine
$ curl -u bob:bar localhost:4000/v1/health
{"type":"sync","status-code":200,"status":"OK","result":{"healthy":true}}
$ # accessing metrics
$ curl -u bob:bar localhost:4000/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 11
my_counter{operation=write,status=success} 22
my_counter{operation=read,status=failed} 11
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 28.12
$ # no access on other endpoints
$ curl -u bob:bar localhost:4000/v1/changes
{"type":"error","status-code":401,"status":"Unauthorized","result":{"message":"access denied","kind":"login-required"}}
$ # access type: admin
$ cat identity.yaml
identities:
    bob:
        access: admin
        basic:
            password: bar
$ ./pebble update-identities --from ./identity.yaml
Updated 1 identity.
$ # admin can read metrics
$ curl -u bob:bar localhost:4000/v1/metrics
# HELP my_counter Total number of something processed.
# TYPE my_counter counter
my_counter{operation=read,status=success} 176
my_counter{operation=write,status=success} 352
my_counter{operation=read,status=failed} 176
# HELP my_gauge Current value of something.
# TYPE my_gauge gauge
my_gauge{sensor=temperature} 24.48

TODO: hashing password.

@IronCore864
Copy link
Contributor Author

Notes:

We need to handle the memory usage issue in the future, to make this easier, after discussion, we decided not to use a self-implemented Prometheus-like module to store the metrics centrally, but rather, store the metrics on existing structs like serviceData and CheckInfo. In the last commit, the service-related metrics are implemented on serviceData. Check-related metrics are to be implemented.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Jan 24, 2025

Take some notes before the holiday season:

Currently, the check metrics only work when the check is successful. When it fails, both counters are reset to 0 and never increase. We need more debugging. Maybe it has something to do with updateCheckInfo.

@IronCore864 IronCore864 marked this pull request as draft January 24, 2025 10:36
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments, most importantly the design of the Metric type and the WriteMetrics interface.

internals/daemon/api_metrics.go Outdated Show resolved Hide resolved
internals/daemon/api_metrics.go Outdated Show resolved Hide resolved
internals/daemon/api_metrics.go Show resolved Hide resolved
internals/daemon/daemon.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
internals/overlord/servstate/handlers.go Outdated Show resolved Hide resolved
@benhoyt
Copy link
Contributor

benhoyt commented Feb 11, 2025

@IronCore864 Can you please merge from master now that the interface{} to any PR is merged? #568

@IronCore864
Copy link
Contributor Author

In the latest commits, m.checks now store pointers so that it's easier to update the metrics. Plus, tests are added:

  • OpenTelemetryWriter: some basic tests are added.
    Check metrics: Start the checks, assert that the counters are correct, and then stop the checks. This calls updateCheckInfo, which should not reset metrics.
    Service metrics: Since the metrics are stored in serviceData, which isn't exported, testing them is difficult. As a workaround, I created a writer that writes to a buffer and then directly checks the open telemetry format result.

@IronCore864 IronCore864 requested a review from benhoyt February 12, 2025 12:27
@IronCore864 IronCore864 marked this pull request as ready for review February 12, 2025 12:27
@IronCore864 IronCore864 changed the title poc: a metrics module for pebble feat: metrics for services and checks Feb 12, 2025
Copy link

@gruyaume gruyaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious about this change listening to the mid-cycle roadmap presentation. Nobody asked for this review, please feel free to ignore it.

@@ -3,6 +3,7 @@ module github.com/canonical/pebble
go 1.22

require (
github.com/GehirnInc/crypt v0.0.0-20230320061759-8cc1b52080c5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this package maintained? Are we confident about introducing it in every charm / pebble deployment out there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hasn't been recently changed, but the part of it we're using is small a straight-forward, and the algorithm being implemented is stable and well-documented (SHA-crypt). So I would say it's "stable" rather than "unmaintained". I'm actually contemplating vendoring this (the core part of it is only ~100 LoC). But open to other ideas/concerns too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've done some diligence on it, see #563 (comment) for example.

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, though a few structural things. Main thing that needs discussion is whether the perform_check and recover_check counts are actually going to provide the monitoring we want -- let's discuss.

}

func (r metricsResponse) ServeHTTP(w http.ResponseWriter, req *http.Request) {
openTelemetryWriter := metrics.NewOpenTelemetryWriter(w)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still buffer here with a bytes.Buffer, so that not every individual write/fprintf call in the metrics writer is doing a separate OS call to write to the network. In addition, it'll mean it either all gets written or none gets written in case of errors, which is probably better -- and we can write the internal error properly. So something like:

var buf bytes.Buffer
metricsWriter := metrics.NewOpenTelemetryWriter(&buf)
r.svcMgr.WriteMetrics(metricsWriter) // with error handling
r.chkMgr.WriteMetrics(metricsWriter) // with error handling
buf.WriteTo(w) // with error handling


err := r.svcMgr.WriteMetrics(openTelemetryWriter)
if err != nil {
logger.Noticef("Cannot write to HTTP response: %v", err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With buffering, this is no longer writing to the HTTP response. Also, you don't need err.Error() with %v or %s. Suggestion (here, and similar below):

Suggested change
logger.Noticef("Cannot write to HTTP response: %v", err.Error())
logger.Noticef("Cannot write service metrics: %v", err)

case TypeCounterInt:
metricType = "counter"
case TypeGaugeInt:
metricType = "gauge"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For defensive programming, probably best to add a default case with a panic(fmt.Sprintf("invalid metric type %v", m.Type)) or similar.

Actually, let's put this type int -> string switch in a String() string method on MetricType instead, and put the default+panic in there.

Comment on lines +74 to +77
_, err := fmt.Fprintf(otw.w, "# HELP %s %s\n", m.Name, m.Comment)
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's surround this with if m.Comment != "" { ... } as presumably the comment is optional.

return err
}

labels := make([]string, len(m.Labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid memory allocations, let's write these straight to the output (buffer) instead of building a slice of strings and then allocating/copying for the strings.Join. So something like (with error handling not shown):

io.WriteString(otw.w, m.Name)
if len(m.Labels) > 0 {
    io.WriteString(otw.w, "{")
    for i, label := range m.Labels {
        if i > 0 {
            io.WriteString(otw.w, ", ")
        }
        fmt.Fprintf(otw.w, "%s=%s", label.key, label.value)
    }
    io.WriteString(otw.w, "}")
}
fmt.Fprintf(otw.w, " %d", m.ValueInt64)
io.WriteString(otw.w, "}")

Name: "pebble_service_active",
Type: metrics.TypeGaugeInt,
ValueInt64: int64(active),
Comment: "Indicates if the service is currently active (1) or not (0)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same phrasing here as for the checks one, so "Whether the service is..."

return
}
buf := new(bytes.Buffer)
openTelemetryWriter := metrics.NewOpenTelemetryWriter(buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: given it's clear from the context, how about just writer := ...?

openTelemetryWriter := metrics.NewOpenTelemetryWriter(buf)
s.manager.WriteMetrics(openTelemetryWriter)
metrics := buf.String()
c.Assert(metrics, testutil.Contains, "pebble_service_start_count{service=test1} 1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just assert on the entire output, for simplicity, and so we're testing the HELP and TYPE parts?

m.servicesLock.Lock()
defer m.servicesLock.Unlock()

for _, service := range m.services {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably best to make the order stable, so let's grab the service names into a slice and sort.Strings them before looping. Otherwise the order of /v1/metrics will change each time you refresh it, which is not ideal.

Similar for the checks WriteMetrics.

m.checksLock.Lock()
defer m.checksLock.Unlock()

for _, info := range m.checks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per similar comment on ServiceManager.WriteMetrics, let's sort by check name to ensure the output is stable.

@benhoyt
Copy link
Contributor

benhoyt commented Feb 13, 2025

@IronCore864 Can you please update the PR description to match our new approach?

Also, it probably goes without saying, but let's be sure not to merge this before the underlying identities PR that this builds on (#563) is reviewed for security and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants