Skip to content
This repository has been archived by the owner on Nov 7, 2022. It is now read-only.

Reimplement prometheus receiver #572

Conversation

fivesheep
Copy link
Contributor

Problems with the current implementation of prometheus receiver

The current prometheus receiver is based on the https://github.com/orijtech/promreceiver library. in the past couple of months, we have found a couple of issues with this implementation and has reported either in gitter or the library's git repo, which including:

Why reimplementing instead of submitting fixes

So far, some of the above issues were fixed, some not. However, this is not the whole story. After diving deeper into the code, we have found that the the library actually has a number of serious bugs, we conclude that without a complete rewrite, it's not going to work properly:

  1. A global appender is used, which can easily run into race conditions when multiple scrape target is configured
  2. Failed to group metrics in the original form, such as a single group of metric from prometheus can be divided into multiple. This is a big issue if one wants to config a prometheus exporter on the other end of the agent to a prometheus server, as metric group with same name is not allowed.
  3. The library is trying to cache all the data scraped from remote metrics endpoints, then feed the delta to downstream, this doesn't seem like what is expected. even worst, it has used a global cache to store the data, which can be easily corrupted when scraping multiple targets
  4. The library is also trying to use the cache to generate STDDEV for histogram which is a mission impossible as the original data doesn't provide any original sampled points.

More details for this PR trying to address can be found in the following document, which is also part of the PR: https://github.com/fivesheep/opencensus-service/blob/reimplement-prometheus-receiver/receiver/prometheusreceiver/README.md

an example of issues (NaN) values:

with the following ocagent config

receivers:
    prometheus:
        config:
            global:
                scrape_interval: 120s
                scrape_timeout: 8s
            scrape_configs:
            - job_name: 'cadvisor'
              scrape_interval: 15s
              static_configs:
              - targets: ['localhost:8888']

exporters:
    prometheus:
            namespace: "promdemo"
            address: "localhost:8889"

compare the outputs from original prometheus endpoint and the ocagent prometheus exporter

original (cadvisor)

...
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 5.6446976e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.56037276274e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.501503488e+09

from ocagent promethues exporter

...
# HELP promdemo_process_resident_memory Resident memory size in bytes.
# TYPE promdemo_process_resident_memory counter
promdemo_process_resident_memory NaN
# HELP promdemo_process_start_time Start time of the process since unix epoch in seconds.
# TYPE promdemo_process_start_time counter
promdemo_process_start_time NaN
# HELP promdemo_process_virtual_memory Virtual memory size in bytes.
# TYPE promdemo_process_virtual_memory counter
promdemo_process_virtual_memory NaN

@fivesheep fivesheep requested review from pjanotti and a team as code owners June 12, 2019 23:07
@codecov
Copy link

codecov bot commented Jun 13, 2019

Codecov Report

Merging #572 into master will increase coverage by 2.03%.
The diff coverage is 94.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #572      +/-   ##
==========================================
+ Coverage   66.41%   68.45%   +2.03%     
==========================================
  Files          86       91       +5     
  Lines        5518     5934     +416     
==========================================
+ Hits         3665     4062     +397     
- Misses       1645     1660      +15     
- Partials      208      212       +4
Impacted Files Coverage Δ
...iver/prometheusreceiver/internal/metricsbuilder.go 100% <100%> (ø)
receiver/prometheusreceiver/internal/ocastore.go 71.42% <71.42%> (ø)
receiver/prometheusreceiver/metrics_receiver.go 70.17% <74.19%> (-4.34%) ⬇️
receiver/prometheusreceiver/internal/metadata.go 75% <75%> (ø)
receiver/prometheusreceiver/internal/logger.go 77.77% <77.77%> (ø)
...eceiver/prometheusreceiver/internal/transaction.go 96.36% <96.36%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8a2872...4efb1f1. Read the comment docs.

@songy23 songy23 requested a review from dinooliva June 13, 2019 21:31
@pjanotti
Copy link

Hi @fivesheep thanks for providing the context for this PR. It brings the whole functionality so it is on the larger side. We will need a bit of time to review all pieces, but, I will try to provide some of the feedback as we go.

@songy23
Copy link
Contributor

songy23 commented Jun 13, 2019

Agree this PR is a bit too large. Consider splitting it to multiple smaller PRs, like one for metricbuilder, one for logger, etc.

@odeke-em
Copy link
Member

Thank you for working on this @fivesheep and for the fixes and overhaul! @pjanotti pinged me about it and my apologies for not being responsive on my repo as I've been swamped with work. I think that to get this implementation in here, some few suggestions:
a) @fivesheep perhaps you can put the new exporter under your repos, license it for OpenCensus Authors under Apache 2.0 for the time being and when @songy23 has finished creating a repository under https://github.com/census-ecosystem it can be migrated there and reviewed independently
b) Concurrently with step a) @songy23 please help create the vanity URL contrib.go.opencensus.io/receiver/prometheus and look at adding @fivesheep's exporter behind that
c) When https://github.com/census-ecosystem/opencensus-go-prometheus-receiver has been created, then you can start transferring the code piecemeal and have the respective small code reviews
d) When step c) is completed, the new receiver can reside behind the vanity URL and the repository from a) can be sunset.

I say this because the logic will take some time to review but during that this opencensus-service will have lots of changes in it. Working on it independently ensures you won't be blocked, you can iterate faster, make test suites etc but also that correctness supersedes :)

Hope that this can help with increasing the velocity of getting things in.

@fivesheep
Copy link
Contributor Author

@pjanotti @songy23 and @odeke-em thanks for the response. I would love to create an individual repo to host the library, however, I cannot make this decision by myself because of my company's opensource policy. I would need to ask our opensource committee if I am allowed to do that, or I need to go through the other opensource request, which can take a very long time to get approval.

@pjanotti
Copy link

pjanotti commented Jun 14, 2019

@fivesheep we would like to eventually make this part of the core as we are going to leverage OC service to be the starting point of https://github.com/open-telemetry/opentelemetry-service/tree/master/docs and Prometheus is part of the core. The contrib was proposed as a temporary stage to make things speedier meanwhile, if that is not the case then we make this work directly on core.

Copy link
Contributor

@rghetia rghetia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments so far. Still more to review.



### Metric Value Mapping
In OpenCensus, metrics value types can be either `int64` or `float64`, while in in Prometheus the value can be safety assume it's always `float64` based on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/safety/safely/

}
```

*Note: `tsOc` is a timestamp object representing the current ts*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartTimestamp cannot be current timestamp for counters. It should be the timestamp when metric collection started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the current ts here was actually referring to the same thing, I will update the description as per your comment to reduce the confusion.

I was trying to differentiate timestamps provided from the original metric endpoint and the ones generated by the scraperLoop which you were referring to, and I meant current.

An example of such metric output is shown below:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

once the honor_timestamps flag of scrapping config is set to true, which is the default value, this timestamp is respected, the logic can be found from: https://github.com/prometheus/prometheus/blob/0c0638b080cf1565fc4e5b7ee3fd35c36ae5832a/scrape/scrape.go#L1084-L1091

In the implantation of prometheus receiver, whichever timestamp is provided from the Add/AddFast method will be used.

receiver/prometheusreceiver/README.md Show resolved Hide resolved
receiver/prometheusreceiver/README.md Outdated Show resolved Hide resolved
receiver/prometheusreceiver/internal/internal_test.go Outdated Show resolved Hide resolved
receiver/prometheusreceiver/internal/metricsbuilder.go Outdated Show resolved Hide resolved
receiver/prometheusreceiver/README.md Show resolved Hide resolved
Type: metricspb.MetricDescriptor_CUMULATIVE_DISTRIBUTION,
LabelKeys: []*metricspb.LabelKey{
{Key: "method"},
want2 := []data.MetricsData{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want2 and want1 are same.

for _, want := range wantPermutations {
if !reflect.DeepEqual(got, want) {
t.Errorf("different metric got:\n%v\nwant:\n%v\n", string(exportertest.ToJSON(got)),
string(exportertest.ToJSON(want)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the purpose here to fail if either want1 or want2 doesn't match? (assuming that want1 and want2 are different. see prev comment. they don't seem to be different)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was inherited from the original implementation of pmreceiver, @odeke-em might have more insight on this test.

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@fivesheep fivesheep force-pushed the reimplement-prometheus-receiver branch from 0cc9c12 to 7a8d3c9 Compare June 18, 2019 04:35
@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

groups := b.currentDpGroupOrdered()
if len(groups) == 0 {
// this can happen if only sum or count is added, but not following data points
return errors.New("no data point added to summary")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Python, the Prometheus client code is incomplete for summaries and only exports sum and count - consider changing this to a log message and returning nil otherwise the entire scrape will be lost for any Python application with a Summary metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I have observed similar behavior very recently from a MicroMeter' prometheus endpoint that a summary only contains sum and count.
Returning nil, as suggested, is one way to solve the issue, however, the down side is that we might loss some important metrics which customers are interested in. Or we can convert these kind of metrics differently into something like two counters, or dose OpenCensus format allow to have Summary Snapshots with no Percentile values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the Prometheus Go client now exports empty summaries by default.

I believe that exporting summary snapshot with empty percentile values makes the most sense.

@songy23 @rghetia - wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with empty percentile values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. I am also going to drop gaugehistogram support from this pr until an official spec of this data type is provided. as hint by @dinooliva , I have checked some code from the python prometheus client, and found it had some code related to this gaugehistogram type from this PR prometheus/client_python#306 , and it seems like its a bit different from regular histograms, instead of using _count/_sum it uses _gcount/_gsum as postfixes. the PR has also attached a draft spec link to openmetrics, the next of prometheus, which can be found from https://docs.google.com/document/d/1KwV0mAXwwbvvifBvDKH_LU1YjyXE_wxCkHNoCGq1GX0/edit#heading=h.1cvzqd4ksd23

it also has some interesting new features like _created to indicate the start time of metrics, however, this spec is still in draft stage, and I am not sure when it will be published.

}
})

t.Run("Rollback dose nothing", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does

`

// https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#text-format-example
var testData1 = `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider splitting the different metric types into different tests and add any missing metric types (e.g. gauge).

Also consider adding more edge cases (e.g. histogram with 1 bucket).

@dinooliva
Copy link
Contributor

We'd like to use this code in a project that we're currently working on so I've created two issues (#588 and #589) for tracking the identified issues with the code.

We plan to merge the code and will fix the issues in subsequent pr's.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants