-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wathola Tracing for upgrade tests #6219
Conversation
Codecov Report
@@ Coverage Diff @@
## main #6219 +/- ##
=======================================
Coverage 82.18% 82.18%
=======================================
Files 231 231
Lines 7787 7787
=======================================
Hits 6400 6400
Misses 937 937
Partials 450 450 Continue to review full report at Codecov.
|
/retest |
2 similar comments
/retest |
/retest |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mgencur The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
The failure in reconciler-tests is unrelated. I will re-run the tests later after getting some feedback on this PR. |
@mgencur: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cc @cardil |
Thanks, @mgencur, for doing this. I will review it shortly... |
@cardil gentle ping. Two weeks have passed... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @mgencur! A lot of ❤️ for doing this.
I found only some minor nits.
test/lib/client.go
Outdated
tracingEnv corev1.EnvVar | ||
loggingEnv *corev1.EnvVar | ||
TracingCfg string | ||
LoggingCfg string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change?
Is it the usage in config.toml
? If so, It might have been used there as (only make it public):
tracingConfig = '{{- .TracingEnv.Value -}}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It was for that reason. I somehow find it cleaner if the client holds the config instead of an EnvVar. The function that produces the config is called getTracingConfig
. It can then create the EnvVar (which is only done in a single place anyway) or it can pass the config to Wathola. This change doesn't bring any complexity. But I don't have a strong opinion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not forcing this to rollback.
// Give time to send tracing information. | ||
time.Sleep(5 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this static wait, as it may introduce failures regarding grace period at later time.
It should be possible to know if tracing is already sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's an easy way to do it. Except for querying back the Zipkin endpoint if the data was stored there (but I wouldn't really like to do this).
There's an open issue in opencensus census-instrumentation/opencensus-go#862
Anyway, the Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second.
Would that be alright? We're already thinking about not shutting down the Sender but using a different way to give it a signal to send the Finished event. So, having it implemented should then relax the need for the 1-second wait time at the end.
And I think waiting 1 second at the end is not too bad - it's a workaround for the missing feature in opencensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I think Knative should move to OpenTelemetry as soon as possible. Maybe it will give us greater capabilities as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. There's an open issue for moving to OpenTelemetry. It's been there for a while...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this might help: census-instrumentation/opencensus-go#862 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. It's a different API for exporting "metrics" which we don't need now. And it looks like it can only export those metrics via ReadAndExport. Not the traces :-/ But I'm not 100% sure.
@@ -154,14 +183,22 @@ func (h httpSender) Supports(endpoint interface{}) bool { | |||
} | |||
|
|||
func (h httpSender) SendEvent(ce cloudevents.Event, endpoint interface{}) error { | |||
return h.SendEventWithContext(context.Background(), ce, endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line creates a new context object for every event sent.
We should use one context
object, and it should be supporting signals (knative.dev/pkg/signals
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Each event requires a new context because the tracing information is stored in it (via the opencensus exporter.
The "main" branch uses context.Background()
as well: https://github.com/knative/eventing/blob/main/test/upgrade/prober/wathola/sender/services.go#L162
I could possibly use return h.SendEventWithContext(signals.NewContext(), ce, endpoint)
but I am not sure what it brings. The context we're using here is for setting up information that is sent with the event - the event is sent and we're done with that (we're not waiting for a shutdown signal there). When I look at where signals.NewContext() is used it's mainly in long running "main" funcitons for catching signals. But the Sender itself has its own loop for handling signals in SendContinually.
So my changes do not change the original behaviour (we have been using context.Background().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. We could shift to using signals.NewContext()
by removing manual signals handling in SendContinually, and then create subcontext for each send. But, you got it right. It's not worth it.
The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second.
7aae60f
to
c3798cf
Compare
Rebased, made |
* The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes.
The failures in E2E tests here is unrelated:
|
if err != nil { | ||
Log.Warn("Tracing configuration is invalid, using the no-op default", zap.Error(err)) | ||
} | ||
if err = tracing.SetupStaticPublishing(Log, "", config); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that maybe returning here the *tracing.OpenCensusTracer
allows later to call Finish
method.
func SetupStaticPublishing(logger *zap.SugaredLogger, serviceName string, cfg *config.Config) (*tracing.OpenCensusTracer, error) {
oct := tracing.NewOpenCensusTracer(tracing.WithExporter(serviceName, logger))
if err := oct.ApplyConfig(cfg); err != nil {
return nil, fmt.Errorf("unable to set OpenCensusTracing config: %w", err)
}
return oct, nil
}
This method should call Close()
method of HTTP reporter, which should break the send loop, and try sending the last batch before shutdown. I think it should be enough to send all registered spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let me try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this whole approach doesn't work because on this line there's nil passed to the function: , and then on this line it throws "panic: runtime error: invalid memory address or nil pointer dereference"
It is also not possible to call tracing.SetupTracing() again (to force calling the reporter Close method) because the tracer is already registered and can't be done again.
As a side note, I've verified that 1.5 seconds is enough for the spans/traces to be sent properly.
I recommend using 1.5 seconds sleep time and call it done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed knative/pkg#2475
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Looks great! Thanks @mgencur for doing this and for knative/pkg#2475 as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor and non-blocking comment.
/lgtm
/approve
if c.loggingCfg != "" { | ||
pod.Containers[i].Env = append(pod.Containers[i].Env, corev1.EnvVar{Name: ti.ConfigLoggingEnv, Value: c.loggingCfg}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can refactor logging and tracing in [1] to be consistent when ""
is provided because it looks weird that we're handling the empty string only for the logging config.
[1]:
eventing/test/test_images/utils.go
Lines 59 to 85 in eb2bfff
// ConfigureTracing can be used in test-images to configure tracing | |
func ConfigureTracing(logger *zap.SugaredLogger, serviceName string) error { | |
tracingEnv := os.Getenv(ConfigTracingEnv) | |
if tracingEnv == "" { | |
return tracing.SetupStaticPublishing(logger, serviceName, config.NoopConfig()) | |
} | |
conf, err := config.JSONToTracingConfig(tracingEnv) | |
if err != nil { | |
return err | |
} | |
return tracing.SetupStaticPublishing(logger, serviceName, conf) | |
} | |
// ConfigureTracing can be used in test-images to configure tracing | |
func ConfigureLogging(ctx context.Context, name string) context.Context { | |
loggingEnv := os.Getenv(ConfigLoggingEnv) | |
conf, err := logging.JSONToConfig(loggingEnv) | |
if err != nil { | |
logging.FromContext(ctx).Warn("Error while trying to read the config logging env: ", err) | |
return ctx | |
} | |
l, _ := logging.NewLoggerFromConfig(conf, name) | |
return logging.WithLogger(ctx, l) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the other way around, support in ConfigureLogging
an empty loggingEnv
so that we can drop if c.loggingCfg != "" {
in this PR
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cardil, mgencur, pierDipi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue
* Wathola Tracing for upgrade tests (#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (#6289) * Print traces for missed events in upgrade tests (#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (#6331) * NPE fix (#6343) Co-authored-by: Chris Suszynski <[email protected]>
* wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (#6289) * Print traces for missed events in upgrade tests (#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (#6331) * NPE fix (#6343) Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]> Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]> Co-authored-by: Martin Gencur <[email protected]> Co-authored-by: Chris Suszynski <[email protected]>
* Wathola Tracing for upgrade tests (knative#6219) * wathola exposing trace information * Run update-deps.sh * Fix license * Fix import * Ensure backwards compatibility * Assert ParentID not nil in test * Separate old and new events sender APIs * Make loggingCfg in client private * Wait only 1 second for flushing tracing info The Reporter is created with a default batch interval 1 second. So, it should be enough to wait just 1 second because the data is flushed every 1 second. * Increase the sleep time to 1.5 seconds to be safe * The ticker runs every 100ms so it could be 1100 ms until the buffer really flushes. * Use Log.Fatal when tracing is not set up properly * Increase the sleep time to 5 seconds and reference knative/pkg issue * Process empty tracing config in test images (knative#6289) * Print traces for missed events in upgrade tests (knative#6249) * Upgrade tests reporting Trace information for missed events * TMP: Induce missed event * Revert "TMP: Induce missed event" This reverts commit 2fec7c7. * Report trace also for Duplicated events * TMP: Induce missed event * TMP: Simulate duplicate events * Fix readme * Unify path for duplicate and missed events * Revert "TMP: Simulate duplicate events" This reverts commit c126521. * Revert "TMP: Induce missed event" This reverts commit fcd9185. * Do not fail upgrade tests if tracing is not configured (knative#6299) * Do not fail upgrade tests if tracing is not configured * TMP: Do not deploy Knative Monitoring * Revert "TMP: Do not deploy Knative Monitoring" This reverts commit 086a8f9. * Limit the number of exported traces (knative#6329) Exporting traces for a large number of events can exceed the timeout of the whole test suite, leading to all upgrade tests being reported as failed. * Cleanup Zipkin tracing only once in upgrade test suite (knative#6331) * NPE fix (knative#6343) Co-authored-by: Chris Suszynski <[email protected]>
Partially fixes #4481
Fixes points 1. and 2. from #6145 (comment)
The automated reporting of broken traces would be done in a separate PR.
Proposed Changes
Pre-review Checklist
Release Note
Docs