Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Linux by tracking context with ServiceContext #438

Closed
wants to merge 5 commits into from

Conversation

semicoleon
Copy link
Contributor

@semicoleon semicoleon commented Jul 17, 2023

This PR adds support for Linux, and will close #38. Firstly this involved doing quite a lot of work on the package manifest(s) to get the dependencies that wouldn't build on Linux filtered out (If anyone has suggestions on better ways to structure the platform specific stuff for the Package.swift files I would love to hear them!). Second, I modified the ContextManager protocol to support the closure based context APIs required when using ServiceContext, and created a ServiceContextManager type. ContextManagers which can support arbitrary start and end locations can implement the ManualContextManager to opt in to that capability.

Supporting the old style of context management is probably a requirement, but it's worth noting that it can never work with ServiceContext. That makes existence of thoes APIs a bit of a footgun for libraries which don't control which context manager they may run with. Regardless the current shape of the API in this PR is unquestionably a breaking change.

LoggerSdk was previously using os_log to notify users of a configuration issue. I changed this to print to support Linux, but that doesn't seem ideal. Would some sort of configurable error handler on the OpenTelemetry instance make sense?

I ended up needing to fix #420 to get tests to pass without even more changes. ActivityContextManager now ensures the value passed to the remove method has been seen by the context manager before, and if not immediately returns. This requires storing the activity identifier along with the activity state so the manager isn't blindly removing the current activity.

@linux-foundation-easycla
Copy link

CLA Missing ID CLA Not Signed

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 18, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Patch coverage: 57.79% and project coverage change: -0.29 ⚠️

Comparison is base (a109095) 66.33% compared to head (ada12ce) 66.04%.

❗ Current head ada12ce differs from pull request most recent head efdc2d5. Consider uploading reports for the commit efdc2d5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #438      +/-   ##
==========================================
- Coverage   66.33%   66.04%   -0.29%     
==========================================
  Files         331      335       +4     
  Lines       14104    14379     +275     
==========================================
+ Hits         9356     9497     +141     
- Misses       4748     4882     +134     
Impacted Files Coverage Δ
...porters/OpenTelemetryProtocolHttp/HTTPClient.swift 93.75% <ø> (ø)
...enTelemetryProtocolHttp/OtlpHttpExporterBase.swift 91.66% <ø> (ø)
...lemetryProtocolHttp/logs/OtlpHttpLogExporter.swift 63.15% <ø> (ø)
...ryProtocolHttp/metric/OltpHTTPMetricExporter.swift 59.61% <ø> (ø)
...etryProtocolHttp/trace/OtlpHttpTraceExporter.swift 58.00% <ø> (ø)
...enTelemetryApi/Context/DefaultContextManager.swift 0.00% <0.00%> (ø)
Sources/OpenTelemetryApi/Trace/Span.swift 17.91% <0.00%> (-7.63%) ⬇️
...elemetrySdk/Metrics/Stable/Data/ExemplarData.swift 44.82% <0.00%> (ø)
...etryApi/Context/OpenTelemetryContextProvider.swift 40.90% <37.00%> (-59.10%) ⬇️
...rces/OpenTelemetryApi/Baggage/BaggageBuilder.swift 50.00% <50.00%> (ø)
... and 14 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@GNMoseke
Copy link

This would be extremely valuable to us and I am happy to help take this PR over the finish line if needed, are there any updates on the status right now?

@bryce-b
Copy link
Member

bryce-b commented Sep 14, 2023

@semicoleon
Sorry for the delay on this PR. We have some concerns with this PR due to some unnecessary changes, e.g. In the examples. We'd prefer to see new examples for the linux variation.
Additionally, we are concerned about the changes made to the Activity context manager, as well as changes that don't seem like they are necessary to support linux (e.g. changing API names: setActiveContext to withActiveContext)
We'd like to see more incremental changes, to make things easier for use to evaluate.

@@ -111,6 +111,40 @@ public extension Span {
func setAttribute(key: SemanticAttributes, value: Bool) {
return setAttribute(key: key.rawValue, value: AttributeValue.bool(value))
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fundamentally change the context by running actions in a block, meaning context association cannot be made using the current method on iOS. We'd prefer this to be in a separate #ifdef for linux, to avoid users from mixing these APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow why this wouldn't work on iOS? Could you elaborate a bit?

@@ -130,4 +133,33 @@ public extension SpanBuilder {
@discardableResult func setAttribute(key: String, value: Bool) -> Self {
return setAttribute(key: key, value: AttributeValue.bool(value))
}

/// Builds and starts a span, setting it as active for the duration of the closure. The span is ended when when the closure exits (even if an error is thrown).
Copy link
Member

@bryce-b bryce-b Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above. Also, it appears these are not part of the OTEL API Spec, and probably shouldn't be added.

@@ -22,7 +22,7 @@ public class StablePeriodicMetricReaderSdk : StableMetricReader {
scheduleTimer = DispatchSource.makeTimerSource(flags: DispatchSource.TimerFlags(), queue: scheduleQueue)

scheduleTimer.setEventHandler { [weak self] in
autoreleasepool {
maybeAutoreleasepool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this will effect things. We'd be more comfortable if this was also in an IFDEF

@@ -5,17 +5,17 @@

import Foundation
import OpenTelemetryApi

import Atomics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removed atomics from the framework due to compatibility issues with XCFrameworks iirc.

@@ -15,7 +15,7 @@ class PropagatedSpan: Span {
var context: SpanContext

func end() {
OpenTelemetry.instance.contextProvider.removeContextForSpan(self)
_ = OpenTelemetry.instance.contextProvider.tryRemoveContextForSpan(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the name? this implies it might throw.

@@ -67,18 +70,42 @@ class ActivityContextManager: ContextManager {
return (currentActivityId, activityState)
}

func removeContextValue(forKey key: OpenTelemetryContextKeys, value: AnyObject) {
let activityIdent = os_activity_get_identifier(OS_ACTIVITY_CURRENT, nil)
public func removeContextValue(forKey key: OpenTelemetryContextKeys, value: AnyObject) {
Copy link
Member

@bryce-b bryce-b Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be public. The context should be managed automatically, this seems like implementation details leaking.

@@ -4,6 +4,7 @@
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than changing the existing context manager, a separate one should be added for linux.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the bug this was addressing has been fixed upstream so this change can probably be discarded now

@youngde811
Copy link

Greetings all, and thank you for an excellent product. We are using this library on both MacOS and Linux; clearly we have a vested interest in this PR. Would it help if we forked semicoleon:linux, and ran our system tests on that code base, or are you too far from having an acceptable MR? Thank you!

@bryce-b
Copy link
Member

bryce-b commented Sep 27, 2023

@youngde811 If you were willing to pick up this PR, that would be greatly appreciated. There are some critical issue with this PR that I've enumerated that need to be addressed before we can move forward.

@semicoleon
Copy link
Contributor Author

Unfortunately I don't have time to come back to this at the moment. If someone picks it up, I'll do my best to answer any questions they have though.

I don't believe #if os(Linux) will be viable for gating API changes. At least for my use case it's also important that the same code builds on macOS, and I imagine that's true for most other prospective users of Linux support in this library. That way why I aimed to move things towards an API that worked with the ServiceContext mode of context (closure based) as well as the os_activity manager.

@youngde811
Copy link

@youngde811 If you were willing to pick up this PR, that would be greatly appreciated. There are some critical issue with this PR that I've enumerated that need to be addressed before we can move forward.

Yes, I'd be happy to finish up this work. I could use any detail you might have on the issues stalling this PR (if there is any extra information); what to avoid that would violate project requirements; etc. I'll go back over the comments above and pick up as much knowledge as I can, then we'll go from there. Thanks!

@youngde811
Copy link

youngde811 commented Sep 27, 2023 via email

@youngde811
Copy link

FYI, I have approval from leadership to take on this PR and get it done appropriately. I'll keep folks posted here. Cheers.

@youngde811
Copy link

Ok, I see some of what's going on here. So in ActivityContextManager.swift, the code is using a technique akin to a Foreign Function Interface (FFI) to gain access to the os_activity_create native symbol in its shared library. This is perfectly fine, IMHO, and should remain for MacOS users (we're deploying on both MacOS and Linux). I have a suggestion. I should start this work afresh, with an eye towards refactoring whatever is platform-specific (even if the Linux version is a no-op due to no OS support). No modules should be removed (it appears ResourceExtension was removed from the fork, yet our code uses this module as I'm sure many others do; builds on MacOS are unable to find that module, for example). Keep the port as restricted as possible while maintaining MacOS functionality and gaining Linux capability.

I'd like to create a branch from main; work there; then submit a PR that respects the project requirements. What say you all? Thanks!

@youngde811
Copy link

Ok, I see some of what's going on here. So in ActivityContextManager.swift, the code is using a technique akin to a Foreign Function Interface (FFI) to gain access to the os_activity_create native symbol in its shared library. This is perfectly fine, IMHO, and should remain for MacOS users (we're deploying on both MacOS and Linux). I have a suggestion. I should start this work afresh, with an eye towards refactoring whatever is platform-specific (even if the Linux version is a no-op due to no OS support). No modules should be removed (it appears ResourceExtension was removed from the fork, yet our code uses this module as I'm sure many others do; builds on MacOS are unable to find that module, for example). Keep the port as restricted as possible while maintaining MacOS functionality and gaining Linux capability.

I'd like to create a branch from main; work there; then submit a PR that respects the project requirements. What say you all? Thanks!

I've done some research on the Activities subsystem for MacOS (and I guess iOS?), and see that it appears to be used for associating log messages to user events, via a number of library API functions. Within ActivityContextManager, it appears that nothing directly related to Activities is returned to upstream users of this class, at least by examining the opentelemetry-swift code base. I'm wondering if a Linux implementation of ActivityContextManager could use syslog (if installed) on a distribution. Would that be sufficient? It won't be a complementary implementation, as I'm unaware of a Linux library for associating user events to log messages. However, a Linux-based ActivityContextManager that employs syslog would at least produce something useful. Thoughts?

@youngde811
Copy link

Ok, I see some of what's going on here. So in ActivityContextManager.swift, the code is using a technique akin to a Foreign Function Interface (FFI) to gain access to the os_activity_create native symbol in its shared library. This is perfectly fine, IMHO, and should remain for MacOS users (we're deploying on both MacOS and Linux). I have a suggestion. I should start this work afresh, with an eye towards refactoring whatever is platform-specific (even if the Linux version is a no-op due to no OS support). No modules should be removed (it appears ResourceExtension was removed from the fork, yet our code uses this module as I'm sure many others do; builds on MacOS are unable to find that module, for example). Keep the port as restricted as possible while maintaining MacOS functionality and gaining Linux capability.

I'd like to create a branch from main; work there; then submit a PR that respects the project requirements. What say you all? Thanks!

Oh, and never mind about branching from your main. I decided to just fork this project. I'll make the Linux support changes, then submit a PR. Thoughts on my question regarding ActivityContextManager and syslog are appreciated. Cheers!

@youngde811
Copy link

We should move this discussion to another, more appropriate location. Where would that be ideal for everyone?

@semicoleon
Copy link
Contributor Author

The activity system is being used by ActivityContextManager as essentially a unique ID in a thread local. The activity system is integrated into other APIs like libdispatch such that sending work to another thread via a dispatch queue will cause the closure running on the queue to inherit the activity ID. That makes using activity IDs for tracking context really nice because you don't have to manually associate the dispatched closure with the context from the spawning thread. Unfortunately as far as I can tell there's really no equivalent system on Linux, which is why I opted for using a task local via ServiceContext.

@youngde811
Copy link

The activity system is being used by ActivityContextManager as essentially a unique ID in a thread local. The activity system is integrated into other APIs like libdispatch such that sending work to another thread via a dispatch queue will cause the closure running on the queue to inherit the activity ID. That makes using activity IDs for tracking context really nice because you don't have to manually associate the dispatched closure with the context from the spawning thread. Unfortunately as far as I can tell there's really no equivalent system on Linux, which is why I opted for using a task local via ServiceContext.

Agreed; I've found no Linux equivalent and I don't want to kludge something together. But that information really helps, as I had just figured why the activity library is being used. Thank you!

@nachoBonafonte
Copy link
Member

Hi, yes what @semicoleon says is exactly why it is used, it is a bit hacky but it has worked really well in all kind of contexts. The problem with using task local is that it can work for manual instrumentation, but it makes really difficult doing autoinstrumentation or it gets into a nightmare mixing its usage with current os.activity implementation.

For a more interactive discussion I think the otel-swift channel in the cncf slack is a good match: https://cloud-native.slack.com/archives/C01NCHR19SB

@youngde811
Copy link

youngde811 commented Sep 28, 2023 via email

@youngde811
Copy link

So, why can’t we just use Swift’s UUID support for these unique “thread local” identifiers? Each time UUID().uuidString is invoked, a new identifier is created. Right? Isn’t that why this project used the os.activity library? Using a UUID would make the code portable…

@semicoleon
Copy link
Contributor Author

A couple of years ago a thread local might have been sufficient for Linux support, but in a post concurrency world it's basically useless.

A task can move between threads across await points, so the thread local value will simply be incorrect sometimes. That's why task locals exist.

@youngde811
Copy link

youngde811 commented Sep 28, 2023 via email

@semicoleon
Copy link
Contributor Author

How are you storing the UUID such that you can look up the correct UUID for a caller? Using a thread local doesn't really work with concurrency, as I mentioned. Using a task local requires changing the shape of the API as I do in this PR currently.

@youngde811
Copy link

How are you storing the UUID such that you can look up the correct UUID for a caller? Using a thread local doesn't really work with concurrency, as I mentioned. Using a task local requires changing the shape of the API as I do in this PR currently.

I don’t have the code in front of me currently, but doesn’t the context manager use os.activity to get a unique ID and use that as a key into a map? Wouldn’t Swift’s UUID support accomplish the same thing?

@semicoleon
Copy link
Contributor Author

The activity API provides a way for the context manager to request the currently active identifier from the system without having any additional data passed in to the manager. That's the functionality that needs to be replaced, not generation of unique values.

@youngde811
Copy link

youngde811 commented Sep 29, 2023

The activity API provides a way for the context manager to request the currently active identifier from the system without having any additional data passed in to the manager. That's the functionality that needs to be replaced, not generation of unique values.

Hmm. Sounds to me as if we’re talking about the same thing. At the end of the day, we need metadata attached to a “task” that may be used when needed in the context manager. Alright, thanks. I’ll see where this leads.

@youngde811
Copy link

youngde811 commented Sep 29, 2023

Morning folks. Thanks to input from @semicoleon and @nachoBonafonte, I did some research last night and comfortably grok os.activity; how we're employing it; and how we might mimic this for Linux. More prototyping, then I might have a PR for you folks late next week. I appreciate the assistance here; very smart team, and it's a pleasure to work with all of you. Thank you!

@youngde811
Copy link

Update: we have opentelemetry-swift running on Linux (ARM64), and should have AMD64 quickly. We decided to reject @semicoleon’s PR as being too extensive; rather, we wrote a small C library that emulates enough of os.activity to provide the needed functionality. Thus far the OTEL package is behaving correctly, on ARM64. The new C library will quickly support AMD64, and that will complete our fork.

Ultimately, @semicoleon’s use of Tasks may prove the ideal solution, but at this stage that seems too much for a single PR, and as indicated violates the API.

@nachoBonafonte
Copy link
Member

@youngde811 any news here?

@youngde811
Copy link

youngde811 commented Nov 2, 2023

@youngde811 any news here?

Yes, see my comment above from two weeks back. We’ve devised a solution that seems to work well on both arm64 and x86_64, by a) writing a small C library that offers functionality that mimics some of os.activity; and b) refactors some of the context structure in the Swift code. Disruptions to the codebase are minimal.

We’re currently exercising the tracing support to ensure it operates correctly.

@youngde811
Copy link

youngde811 commented Jan 6, 2024

Update on our Linux port. We now have a reliable implementation based on task local variables and the Swift withValue() construct. No OTEL API changes were made to existing classes/structs; rather, we have 1) a few necessary extensions in OpenTelemetryApi; and 2) a small wrapper library (not part of opentelemetry-swift) whose API consists of two principal convenience functions: withTracing() and withSpan() (both of which are basically closures). Peer and nested traces work properly; data is sent to Grafana/Tempo and looks correct. We have not yet tested remote spans, but I’ve no reason to believe they won’t work.

Over the next several weeks we will be integrating the OTEL port, along with our support library, into one of our primary applications and will thoroughly exercise the port. Dunno if you folks are still interested in this fork; just post here if you are. Cheers.

@youngde811
Copy link

youngde811 commented Jan 21, 2024

Update: Our Linux port runs very well on both Linux itself and MacOS. The approach taken is the use of Swift’s structured concurrency; in particular TaskLocal and withValue(). For our own needs, we also wrote a thin veneer library that offers two functions that abstract the underlying implementation: withTracing() and withSpan().

A bonus is we’ve eliminated the dependency on Apple’s poorly documented and OS specific activity library. We’re also in the process of getting rid of os.log dependencies.

Finally, our changes a) do not break the API; b) are self contained and small in size; and c) are not visible to clients of the library.

@youngde811
Copy link

Oh, and if you folks would like us to submit a PR for our Linux port, please let me know. Cheers.

@nachoBonafonte
Copy link
Member

The limitation of that approach is that there is no way to dynamically add instrumentation to existing code or auto-instrument external code, you must explicitly create the traces at the client code directly and cannot add it after.

Similar to that approach there is https://github.com/apple/swift-distributed-tracing, which also has a otlp exporter I think. We may want to import their traces as we do with swift-metrics and might be a solution for Linux also.

I also have concerns that it may be difficult to write the OpenTelemetry api on top of TaskLocal. Do you have a repository where we can check what changes are you doing and how user code is written?

@youngde811
Copy link

youngde811 commented Jan 22, 2024

Hello @nachoBonafonte! Thanks for the reply. So, several thoughts.

  • I'm flying a bit blind here as the limitations you mention are not immediately obvious, at least to me. Are they documented somewhere? Any architecture plans kept in a single place that defines current and future goals?
  • The port work we've done thus far seems to be robust on both Linux and MacOS, and our lightweight client API seems acceptable for our use cases. However, our goal is to return our work to the Open Source community via your project; this is very important for our ethics and loyalties to open source. Therefore, whatever we end up with must satisfy the vision you folks have for this project.
  • Our initial approach had a small Linux C library that relied on GCC-specific extensions to replace Apple's os.activity capabilities. Specifically, there are GCC functions that may be used to return the address of the current stack frame, or the frame's return address. This actually seemed to work for some cases, but I was uncomfortable that it would be sufficient for more complex code. Also, the approach made no allowances for other C compilers.

Currently, the API we've written that fronts the OTEL library is in our private GitLab repository. However, since our goal is to contribute to getting your project running on Linux, I'm pretty certain we could make that project available. Until I've cleared that with leadership, here are two functions from our Observability library:

@discardableResult
public func withTracing<T>(_ closure: @escaping () async throws -> T) async throws -> T {
    guard Tracing.whenEnabled() else {
        return try await closure()
    }

    let t = Task(priority: nil) {
        return try await closure()
    }

    let rval = try await t.value
}

@discardableResult
public func withSpan<T>(_ name: String, _ attributes: [String: String] = [:], closure: @escaping () throws -> T) async rethrows -> T {
    guard Tracing.whenEnabled() else {
        return try closure()
    }

    let span = makeSpan(spanName: name, attributes: attributes)

    defer {
        span.end()
    }

    let rval: T = try await OpenTelemetry.withValueAsync(span) {
        return try closure()
    }

    return rval
}

And here's an example of client usage:

try await withTracing() {
        print("Starting main tracing behavior:")

        try await withSpan(name) {
            for i in 0 ..< maxSpans {
                try await runChild(id: i, maxTime: maxTime)
            }
        }

        print("Finished child tracing")
    }
}

Nested spans are supported, of course. So. What do you feel is the best way to proceed here? Some sort of requirements document for current and future multi-platform goals would help us tremendously, if such a thing is available. Let me know. Thanks! Oh, and this is awesome work, BTW.

@youngde811
Copy link

@nachoBonafonte By the way. When you wrote “add dynamic instrumentation”, are you speaking of code injection? The way VM level Java agents can?

@youngde811
Copy link

youngde811 commented Jan 24, 2024

I am completing log simplification to eliminate any dependencies on os.log; should be just about tested. Once done, I’ll take a screenshot of our Grafana/Tempo data source after a load test and post it here.

@semicoleon
Copy link
Contributor Author

Closing in favor of #546

@semicoleon semicoleon closed this Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linux Support & NIO Integration ActivityContextManager drops an active span when a first child span ends
5 participants