-
Notifications
You must be signed in to change notification settings - Fork 63
Reimplement prometheus receiver #572
Reimplement prometheus receiver #572
Conversation
Codecov Report
@@ Coverage Diff @@
## master #572 +/- ##
==========================================
+ Coverage 66.41% 68.45% +2.03%
==========================================
Files 86 91 +5
Lines 5518 5934 +416
==========================================
+ Hits 3665 4062 +397
- Misses 1645 1660 +15
- Partials 208 212 +4
Continue to review full report at Codecov.
|
Hi @fivesheep thanks for providing the context for this PR. It brings the whole functionality so it is on the larger side. We will need a bit of time to review all pieces, but, I will try to provide some of the feedback as we go. |
Agree this PR is a bit too large. Consider splitting it to multiple smaller PRs, like one for |
Thank you for working on this @fivesheep and for the fixes and overhaul! @pjanotti pinged me about it and my apologies for not being responsive on my repo as I've been swamped with work. I think that to get this implementation in here, some few suggestions: I say this because the logic will take some time to review but during that this opencensus-service will have lots of changes in it. Working on it independently ensures you won't be blocked, you can iterate faster, make test suites etc but also that correctness supersedes :) Hope that this can help with increasing the velocity of getting things in. |
@pjanotti @songy23 and @odeke-em thanks for the response. I would love to create an individual repo to host the library, however, I cannot make this decision by myself because of my company's opensource policy. I would need to ask our opensource committee if I am allowed to do that, or I need to go through the other opensource request, which can take a very long time to get approval. |
@fivesheep we would like to eventually make this part of the core as we are going to leverage OC service to be the starting point of https://github.com/open-telemetry/opentelemetry-service/tree/master/docs and Prometheus is part of the core. The contrib was proposed as a temporary stage to make things speedier meanwhile, if that is not the case then we make this work directly on core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments so far. Still more to review.
|
||
|
||
### Metric Value Mapping | ||
In OpenCensus, metrics value types can be either `int64` or `float64`, while in in Prometheus the value can be safety assume it's always `float64` based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/safety/safely/
} | ||
``` | ||
|
||
*Note: `tsOc` is a timestamp object representing the current ts* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StartTimestamp cannot be current timestamp for counters. It should be the timestamp when metric collection started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the current ts
here was actually referring to the same thing, I will update the description as per your comment to reduce the confusion.
I was trying to differentiate timestamps provided from the original metric endpoint and the ones generated by the scraperLoop which you were referring to, and I meant current
.
An example of such metric output is shown below:
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"} 3 1395066363000
once the honor_timestamps
flag of scrapping config is set to true, which is the default value, this timestamp is respected, the logic can be found from: https://github.com/prometheus/prometheus/blob/0c0638b080cf1565fc4e5b7ee3fd35c36ae5832a/scrape/scrape.go#L1084-L1091
In the implantation of prometheus receiver, whichever timestamp is provided from the Add/AddFast method will be used.
Type: metricspb.MetricDescriptor_CUMULATIVE_DISTRIBUTION, | ||
LabelKeys: []*metricspb.LabelKey{ | ||
{Key: "method"}, | ||
want2 := []data.MetricsData{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want2 and want1 are same.
for _, want := range wantPermutations { | ||
if !reflect.DeepEqual(got, want) { | ||
t.Errorf("different metric got:\n%v\nwant:\n%v\n", string(exportertest.ToJSON(got)), | ||
string(exportertest.ToJSON(want))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the purpose here to fail if either want1 or want2 doesn't match? (assuming that want1 and want2 are different. see prev comment. they don't seem to be different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was inherited from the original implementation of pmreceiver, @odeke-em might have more insight on this test.
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
…ing the sum/count from snapshot to global
…metheus-receiver # Conflicts: # go.mod
0cc9c12
to
7a8d3c9
Compare
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
groups := b.currentDpGroupOrdered() | ||
if len(groups) == 0 { | ||
// this can happen if only sum or count is added, but not following data points | ||
return errors.New("no data point added to summary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Python, the Prometheus client code is incomplete for summaries and only exports sum and count - consider changing this to a log message and returning nil otherwise the entire scrape will be lost for any Python application with a Summary metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I have observed similar behavior very recently from a MicroMeter' prometheus endpoint that a summary only contains sum and count.
Returning nil, as suggested, is one way to solve the issue, however, the down side is that we might loss some important metrics which customers are interested in. Or we can convert these kind of metrics differently into something like two counters, or dose OpenCensus format allow to have Summary Snapshots with no Percentile values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with empty percentile values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do. I am also going to drop gaugehistogram support from this pr until an official spec of this data type is provided. as hint by @dinooliva , I have checked some code from the python prometheus client, and found it had some code related to this gaugehistogram
type from this PR prometheus/client_python#306 , and it seems like its a bit different from regular histograms, instead of using _count/_sum it uses _gcount/_gsum as postfixes. the PR has also attached a draft spec link to openmetrics, the next of prometheus, which can be found from https://docs.google.com/document/d/1KwV0mAXwwbvvifBvDKH_LU1YjyXE_wxCkHNoCGq1GX0/edit#heading=h.1cvzqd4ksd23
it also has some interesting new features like _created to indicate the start time of metrics, however, this spec is still in draft stage, and I am not sure when it will be published.
} | ||
}) | ||
|
||
t.Run("Rollback dose nothing", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does
` | ||
|
||
// https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#text-format-example | ||
var testData1 = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider splitting the different metric types into different tests and add any missing metric types (e.g. gauge).
Also consider adding more edge cases (e.g. histogram with 1 bucket).
Problems with the current implementation of prometheus receiver
The current prometheus receiver is based on the https://github.com/orijtech/promreceiver library. in the past couple of months, we have found a couple of issues with this implementation and has reported either in gitter or the library's git repo, which including:
Why reimplementing instead of submitting fixes
So far, some of the above issues were fixed, some not. However, this is not the whole story. After diving deeper into the code, we have found that the the library actually has a number of serious bugs, we conclude that without a complete rewrite, it's not going to work properly:
More details for this PR trying to address can be found in the following document, which is also part of the PR: https://github.com/fivesheep/opencensus-service/blob/reimplement-prometheus-receiver/receiver/prometheusreceiver/README.md
an example of issues (NaN) values:
with the following ocagent config
compare the outputs from original prometheus endpoint and the ocagent prometheus exporter
original (cadvisor)
from ocagent promethues exporter