Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnEnding should make spans readonly, except within the processor. #1740

Open
dmathieu opened this issue Oct 3, 2024 · 11 comments
Open

OnEnding should make spans readonly, except within the processor. #1740

dmathieu opened this issue Oct 3, 2024 · 11 comments
Labels
bug Something isn't working keep

Comments

@dmathieu
Copy link
Member

dmathieu commented Oct 3, 2024

The OnEnding spec mentions:
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#onending

The SDK MUST guarantee that the span can no longer be modified by any other thread before invoking OnEnding of the first SpanProcessor

The current OnEnding implementation doesn't prevent the span from being modified in another thread, concurrently with the processor's action.

@dmathieu dmathieu added the bug Something isn't working label Oct 3, 2024
@dmathieu
Copy link
Member Author

dmathieu commented Oct 3, 2024

cc @pantuza

@dmathieu
Copy link
Member Author

dmathieu commented Oct 3, 2024

Two ways this can be done:

Java does it by storing the ID of the thread that runs OnEnding, and prevents any other thread from doing so.
https://github.com/open-telemetry/opentelemetry-java/pull/6367/files#diff-d716139d47f651d6b6558f3e6f03aa59c98e52723f472b2c6a24bd02ec5fe2ceR338-R342

Go is going to setup a wrapper span that allows bypassing the span being read-only, and will be provided to the processor.

Copy link
Contributor

github-actions bot commented Nov 3, 2024

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@github-actions github-actions bot added the stale label Nov 3, 2024
@dmathieu
Copy link
Member Author

dmathieu commented Nov 4, 2024

Should this be reported elsewhere?

@arielvalentin arielvalentin added keep and removed stale labels Nov 4, 2024
@arielvalentin
Copy link
Contributor

@dmathieu this is the appropriate place, though I am not sure this is the bug we will run into when our users start to make changes to the span.

Span mutation methods leverage a mutex instance variable that prevents multiple threads or fibers from mutating the span at the same time.

In Span#finish, all span processors will have to have completed their on_finish blocks before the lock is released and the span may be mutated again:

That is where I think the real problem is. I assert that when a span processor attempts to mutate the underlying span, it will run into a thread deadlock situation.

Has anyone implemented one of these processors in the wild yet?

cc: @pantuza @mwear @kaylareopelle #1713

@pantuza
Copy link
Contributor

pantuza commented Nov 6, 2024

Hi folks, I agree that span.finish() method safely calls the processors on_finishing() method. Thus, prevents other threads from trying to modify the span. Although, if any user calls the processor.on_finishing() method directly inside many threads, indeed it can generate a concurrency issue.

Probably, calling this processor method directly isn't desirable, but we are never fully aware of how users will use the library. Thus, might be necessary to protect this method individually. Does it make sense?

@arielvalentin , I did not understand how the mutex.synchronize would generate the deadlock? Can you share an example, please? The only situation I can foresee this is the one where one on_finishing() implementation of any given processor get stuck and never releases the Mutex for the next thread trying to acquire it. Is it what you were saying?

@arielvalentin
Copy link
Contributor

It's likely not a deadlock situation as much as an error. I am concerned that mutexes are not re-entrant.

Assuming that is the case then when a span processor attempts to mutate it, then it will hit an additional synchronize block and that may result in errors.

@pantuza
Copy link
Contributor

pantuza commented Nov 6, 2024

It's likely not a deadlock situation as much as an error. I am concerned that mutexes are not re-entrant.

Assuming that is the case then when a span processor attempts to mutate it, then it will hit an additional synchronize block and that may result in errors.

You are totally right! If any given processor tries to mutate the span, for example setting attributes, it will try to acquire the same Mutex as the finish() method just did. Therefore, the thread gets blocked, cause the Ruby mutexes are not re-entrants as you mentioned before.

I see two alternatives:

  1. Use a separated Mutex that would be used on span Mutations such as:
    a. add_attribute, set_attribute, add_link, add_event
    b. Current mutex being created
  2. Use some other logic that do not require the same Mutex on both operations: finish and set_attributes
    a. As the one @ended variable we already use.
    b. As other languages did: Java cited by @dmathieu above in this conversation.

@ioquatix
Copy link

If I understand correctly, the span should store Fiber.current and then compare that when making modifications, e.g.

class Span
  def initialize(fiber = Fiber.current)
    @fiber = fiber
  end

  def update
    raise RuntimeError, "Cannot modify span on non-current fiber!" unless @fiber.equal?(Fiber.current)
    # ...

@arielvalentin
Copy link
Contributor

Not exactly. We want the ability for spans to be mutable in a fiber and thread safe way.

Spans may be writable from any thread or fiber as long as access is synchronized. When a span is "finished", then it changes to a state that is Immutable, a.k.a Read Only.

The span has an OnEnding stage where it notifies processors that it's about to transition to Read Only state. This is the final opportunity to make changes to the span before enqueing it for export.

So it's a case where a fiber would have an exclusive re-entrant lock on the span when it's in its transitioning state. If we capture the fiber that initialized it and "locked" on that then the span could not be mutated at any other time or could not be finished by a different fiber.

@fbogsany
Copy link
Contributor

That's a good starting point for a workable solution, though. Similar to Go, you can pass a wrapper to the callback that has "special" access to the Span's internals, but which bypasses the mutex and instead checks that the owning Thread (or Fiber) is the current one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working keep
Projects
None yet
Development

No branches or pull requests

5 participants