-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve batch overload protection and log handler #671
base: main
Are you sure you want to change the base?
Improve batch overload protection and log handler #671
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #671 +/- ##
==========================================
- Coverage 74.48% 74.47% -0.02%
==========================================
Files 56 62 +6
Lines 1862 2014 +152
==========================================
+ Hits 1387 1500 +113
- Misses 475 514 +39
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
It has been a week so wanted to say a big thanks! And let you know I hoped to get around to reviewing this next week at the latest. |
Hey @SergeTupchiy, sorry I haven't gotten this merged yet! Can you update to resolve the merge conflicts? |
Hi @tsloughter, |
Great, thanks. And yes, please remove that commit but don't close that PR bc I don't think a final decision has been made. |
15ea883
to
d9afb2d
Compare
Re-based, dropped 'cleanup persistent terms' commit from this branch and added a workaround fix for the failing test: https://github.com/open-telemetry/opentelemetry-erlang/pull/671/files#diff-ca895028100a6d11170ba82e246533e4bcaf1ffa2b15cfce32777a56e7cf8af4R122-R124 |
@SergeTupchiy how does this PR align with #637? Is it building on that one? |
Yes, this one includes changes done in #637 and makes the latter unnecessary. |
@SergeTupchiy sorry this still hasn't gotten a real review. I really want to get it in but it is large and changes a lot since it doesn't just change the logs handler to add olp but changes the span batch processor too. I'm wondering about breaking it up into smaller PRs so we can move it forward in pieces. What do you think? |
Ooh, so I think I get your ets table lookup strategy now. At first it worried me because I recalled we used to do similar (2-tuple with which to use kept in a counter) but then moved to the persistent term but I think I see now that it sort of blends the changes we made with the use of a reg_name with the original 2-tuple design. Because we use reg_name now we don't have to put pid in the name anymore to make them unique but then require a name lookup instead of storing in the config/state. Really wish we already had benchmarks so we could at least guarantee this doesn't have any regressions. Thinking about it I don't see why it would, but always better to measure of course :). Maybe I need to do some quick work on that before we can change the span batch processor. |
This is quite huge PR, a short summary:
otel_batch_olp
(olp - for overload protection).otel_batch_olp
implementation is quite different from the previousotel_batch_processor
and deserves additional explanations:max_queue_size
implemented as a periodic ETS tab size check) couldn't react to the changes timely due to its periodic nature. Even if the check interval is quite low, e.g. 1s, an abrupt burst of events (be it trace spans or log messages) can insert much more records to the ETS table than the configuredmax_queue_size
. The issue can be mitigated by draining ETS tab up tomax_queue_size
during the export, but it would only confirm the fact that overload protection is not perfect.The performance of atomic ops is comparable to persistent_term (at least on my basic hardware, more benchmarks are welcome), they are meant to be mutable and the proposed olp mechanism ensures that the table doesn't grow beyond
max_queue_size
value.otel_batch_olp
is re-used inotel_batch_processor
andotel_log_handler