Skip to content

Conversation

@lloeki
Copy link
Member

@lloeki lloeki commented Nov 3, 2025

Why?

Deployment mode is one of the major use cases in the wild.

What does this PR do?

Beat bundler and rubygems into submission via a two-stage injector:

  • set up GEM_HOME and GEM_PATH environment variables in ways compatible with BUNDLE_PATH
  • patch bundler to ignore deployment mode and not reset our modifications
  • patch rubygems to not call Bundler.setup before we do

Since vendored mode is not set, Bundler code that filters out other paths will be able to consider additional paths we add to gem paths.

How to test the change?

CI

Additional Notes:

A tough nut to crack. "Pure" deployment mode is supported (via BUNDLE_DEPLOYMENT=true), but not yet "standalone" vendored mode (e.g via BUNDLE_PATH=/some/where, optionally combined with BUNDLE_FROZEN=true for the same effect as deployment mode)

lloeki added 10 commits November 3, 2025 16:42
Deployment mode comes from vendored mode + frozen bundle
Since the test forwarder is written in Ruby and spawned as a separate
process, it would be subject to injection through RUBYOPT during
injection itself, creating a recursion.
Understanding the state of `Gem.path`, `GEM_PATH`, and `GEM_HOME` is
critical for debugging.
Evaluation on every fetch is costly, especially with a fork.
@lloeki lloeki marked this pull request as ready for review November 14, 2025 15:43
@lloeki lloeki requested a review from a team as a code owner November 14, 2025 15:43

mod = Module.new do
def kernel_exec(*args)
ENV['RUBYOPT'] = ENV['RUBYOPT'].gsub(%r{^(.*)(?:\s+|^)(-r(\s*)\S+/injector\.rb)(.*)$}, '\2 \1 \3')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line attempting to accomplish?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures the injector is required first by Ruby because Bundler itself massages RUBYOPT to make sure rubygems is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Minor: Would it be simpler to prepend the injector to RUBYOPT always unconditionally? That way we'd avoid the regular expression which also took me a bit to follow)

Copy link
Member Author

@lloeki lloeki Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer this way?

  • delete any -r<injector> reference
  • unconditionally prepend -r<injector>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me think that <injector> may or may not be equal to injector.rb because reasons.

Which means that the names differing implies it may be required twice? So this part is important:

delete any -r<injector> reference

Copy link
Member

@p-datadog p-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything that is wrong in the diff but I cannot evaluate it for correctness either. The regexp I commented on for example, I don't understand what it is actually doing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side comment that it's strange that Github does "diffs" like this... (i.e. documenting the change as "these Gemfiles have been renamed to the subsequent version", rather than "the 2.6.0 file changed, the 3.5.0+0 was added, and otherwise the files stayed the same")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that is mighty odd indeed; since (most of?) these files are essentially the same, the rename detector completely falls apart here.

Copy link

@sarahchen6 sarahchen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable!

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given it a big pass. In general... I find this really hard to follow. We're doing very specific things in a very specific order to target very specific behaviors and... I'm dearly missing notes explaining why.

Without such context, I'm just looking at path changes, env variable changes, etc, which make it really hard to review beyond going "well the minimal test is passing so I hope real applications aren't meaningfully different". In particular, it's hard to for instance figure out if there's any gaps in our approach without the context explaining why and what.


def status
{
@status ||= {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since this object gets reused, perhaps worth sprinkling a bunch of .freeze to guard against oops?

Comment on lines +29 to +36
@level ||= case ENV['DD_INTERNAL_RUBY_INJECTOR_LOG_LEVEL']
when 'DEBUG', 0 then DEBUG
when 'INFO', 1 then INFO
when 'WARN', 2 then WARN
when 'ERROR', 3 then ERROR
when 'FATAL', 4 then FATAL
else UNKNOWN
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly... weird? Specifically, ENV effectively behaves as a Hash[String, String?], it won't ever be 0/1/2/3/4. Maybe you meant for those to be strings?

But zooming out a bit, it seems to me the intention here is to have a mode where we log more, for debugging. So perhaps this can be simplified as a more true/false for that... 🤔

In particular, a simple variable would probably by simpler as well when asking customers to enable something to help us troubleshoot some issue...?

Copy link
Member Author

@lloeki lloeki Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't ever be 0/1/2/3/4. Maybe you meant for those to be strings?

You guessed right, I'll fix that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But zooming out a bit, it seems to me the intention here is to have a mode where we log more, for debugging. So perhaps this can be simplified as a more true/false for that...

The amount of intricate logs this thing can emit warrants the granularity.

when asking customers

This is absolutely not geared towards customers but to us, hence DD_INTERNAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's for internal use, I think the more the reason to keep the mechanism as simple as possible -- log everything as verbose as possible, or not. Not sure if the in-between states are worth the extra code/complexity?

Comment on lines +44 to +51
log.info { 'inject:patch' }

bundler = import 'bundler'

bundler.patch!
else
log.info { 'inject:skip' }
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit surprising to me... Why would we patch when DD_INTERNAL_RUBY_INJECTOR == false? Perhaps I'm misunderstanding the semantics here (some comments in the source would be very helpful!)

Copy link
Member Author

@lloeki lloeki Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is indeed a bit lousy:

  • DD_INTERNAL_RUBY_INJECTOR=true => perform the injection steps
  • DD_INTERNAL_RUBY_INJECTOR=false => don't perform the injection steps

The "injection steps" are the "stage 1" injection of dependencies into the user's bundle; this is typically achieved when bundle exec is called. Once that stage 1 is done bundle exec performs a call to Kernel.exec, which starts a new process, which is intercepted by the C injector again, which adds the Ruby injector to RUBYOPT, and so it is invoked again, but this time we're under the correct bundler env as set up by the Ruby injector.

So DD_INTERNAL_RUBY_INJECTOR is set to false beforehand, and the second execution in the exec'd process won't start it all over again and just reuse the current env. As a bonus this also applies to any other spawn, system, exec... call, which will merely reuse the environment (including the new gemfile) and it will all Just Work (as it does for Bundler)

There is an exception to that "It Just Works": deployment mode is persisted in config or in the env, and to break through the grip of deployment mode (or rather vendored mode, which restricts visible paths to only one) we need to monkeypatch Bundler so that the processes do not see the deployment mode config being true.

So, yes, we have to patch Bundler when DD_INTERNAL_RUBY_INJECTOR=false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a comment in the code, not in the review! :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed... sort of! Rather, this should be documentation about how this thing works.

I tried to be detailed in commit messages as well.

Thanks for being my rubber duck / hallway tester about what is unclear to a newcomer! Being head-deep into this problem domain for a while I must admit a can lose sight of what is clear and obvious and what is not...

Comment on lines -59 to 61
if !status[:bundler][:use_system_gems]
result << { :name => 'bundler.use_system_gems', :reason => 'bundler.vendored' }
if !status[:fs][:writable]
result << { :name => 'fs.writable', :reason => 'fs.readonly' }
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was the only place using status[:bundler][:use_system_gems] so it can be removed now :)

Copy link
Member Author

@lloeki lloeki Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly: I'd rather not remove it because of logging.

This allows examining relevant Bundler internals for diagnostic. Indeed that we don't use this state directly doesn't mean it doesn't have an influence on Bundler's behaviour. use_system_gems in particular is used in multiple Bundler conditionals that are really important to understand come debug time.

Comment on lines -55 to 57
if status[:bundler][:deployment]
result << { :name => 'bundler.deployment', :reason => 'bundler.deployment' }
if status[:bundler][:settings][:path]
result << { :name => 'bundler.path', :reason => 'bundler.vendored' }
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted there's still references to bundler.deployment in "report.rb" and "test.rb"... they can (should?) be removed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I will address.

if context[:bundler][:deployment]
app_bundle_path = context[:bundler][:bundle_path]

ENV['DD_INTERNAL_RUBY_INJECTOR_PATCH'] = "mode=deployment,path=#{package_gem_home}:#{app_bundle_path}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I see this being set here to this complex string and in main.rb we check if ENV['DD_INTERNAL_RUBY_INJECTOR_PATCH'] is set and... do nothing with it? What's up with that? I'm clearly missing something... 👀

Copy link
Member Author

@lloeki lloeki Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not missing anything, it is an attempt at passing information from the first stage (bundle exec) to the second stage (the exec'd process).

Reason is:

  • diagnostic logging and inspection from the second stage
  • consistency/sanity checks
  • full support for vendor mode and tackling read-only filesystems will require passing information from one stage to the other

Comment on lines +131 to +136
Gem.paths = { 'GEM_PATH' => "#{package_gem_home}:#{app_bundle_path}" }
ENV['GEM_PATH'] = Gem.path.join(File::PATH_SEPARATOR)
ENV['GEM_HOME'] = app_bundle_path

BUNDLER.patch!
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So similarly to my comment on bundler.rb, it's very hard to follow along why we're doing these changes -- e.g. if someone ever needs to investigate an issue around this code we'll need to effectively start from scratch in reverse engineering all of the logic here again.

This really needs a big comment explaining what we're doing and why...

Also -- why do we .patch! bundler here + again on main.rb? What's up with that?

Copy link
Member Author

@lloeki lloeki Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed about the lack of docs.

Also -- why do we .patch! bundler here + again on main.rb? What's up with that?

This one is during the first stage (bundle exec), the one in main happens during the second stage. Since the second stage is essentially another process (albeit with the same pid because of exec), all patches have been forgotten.

'injection should abort',
'app gemfile should not include datadog',
'app lockfile should not include datadog',
# TODO: disabled due to race condition on naive deletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a "naive deletion" in this context again?

Copy link
Member Author

@lloeki lloeki Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some rm of the new Gemfile+lockfile, possibly even in an ensure.

e.g there's a risk that two Ruby processes ran at about the same time in the same bundle would attempt injection and walk upon each other while one is creating and the other removing files.

In that case there's no problem leaving the files around: a hypothetical rm would happen if injection failed, and so they would not be pointed to.

Similarly two concurrent processes simultaneously attempting and succeeding at injection, while wasteful, would produce the same content idempotently.

The problem only occurs when someone attempts to clean up, and doing the whole creation atomically is tough (e.g we can't operate in a temp folder if the original Gemfile contains relative file references; even moving both Gemfile+lockfile on success isn't atomic)

Something similar happened in testing which made the tests flaky (they would somehow incorrectly pass).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand any of these fixture changes -- why did we need the many files with a single puts, and why do we now need to have them being copy/pasted copies of all of this info?

(It's very hard to follow the why on all these changes...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These stubs are the target Ruby programs that get ran during testing, to be injected into.

Initially they were a mere puts so that there was output visually confirming they were being ran.

Then as I needed more diagnostic output to understand and/or confirm behaviour, each stub got an increasing amount of information. This information is as of today only consumed by humans but ultimately the intent is to have this information captured and checked by tests.

There is one per fixture directory because when tests are ran the fixture is being copied to a unique temporary folder from which things are run. Each directory being self-contained is much easier to understand, manage, and copy around: devising a mechanism to have them shared while being a) copied and b) ran inside of various containers is non-trivial and adds a fair amount of complexity.

In addition it is not inconceivable that the stub contents may need to be specialised depending on the fixture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just generate this instead? E.g. have a "base_fixture.rb" that gets copied and renamed as needed? It seems a lot of copy-paste is going around here

Comment on lines +16 to +18
# To ensure no crash happens on Ruby 2.6 we package the corresponding
# `did_you_mean` version.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment -- specifically, it's not clear if the impact here is in our test code or on in the code in production?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both.

The packages/<foo> directory contains an injection "package" (for lack of a better word), IOW the dependencies that we want injected into the applications.

The injector code is intended to be largely Datadog-agnostic, so the simple package is a minimal one without the datadog baggage and assumptions.

The datadog package mimics what is packaged by our OCI package builder, only minimally so for testing purposes.

This change needs to be reflected (and is, here) in the actual package building process.

Vendored mode (`BUNDLE_PATH` / `use_system_gems?` => `false`) removes
all paths but the vendor path from the Gem path list.

This effectively hides all other gems, making any injection impossible.

Massage `GEM_PATH` and `GEM_HOME` to set them to values that:
- are identical in behaviour with `BUNDLE_PATH='vendor/bundle'`
- allow injection once vendored mode is relaxed

Note: this does not take an arbitrary `BUNDLE_PATH` into account yet,
instead focusing on the default for deployment mode only
(`vendor/bundle`).
When deployment mode is detected we patch Bundler to act as if it
wasn't set.

This:
- makes Bundler not set `use_system_gems?` to `false`
- makes Bundler not set the vendored path to `vendor/bundle`

But thanks to the `GEM_HOME` and `GEM_PATH` variables that have been set
beforehand, gems will be looked up in the appropriate location, all the
while having broken out of being able to see _only_ vendored gems.

Ergo, gem injection can proceed.
Since Ruby 1.9 `rubygems` is automatically required default. Ruby and
`rubygems` contain mechanisms that enforce rubygems is not just loaded,
but that it is being loaded *first*.

Right before executing the Ruby process we make sure that the Ruby
injector is appearing first, and whatever else is present still appears
too, but afterwards.
It turns out rubygems bootstraps bundler straight from `rubygems` code.
This is used so that once, say, `bundle exec` completes, running ruby
programs is also executed in that bundler context.

Indeed withotut this `bundle exec foo` itself would not work! As soon as
it `exec`s to `foo`, any previous Ruby context would be lost. But it
also means that Ruby loads bundler stuff too early.

Unset the undocumented `BUNDLER_SETUP`.
When `bundle exec` happens, it proceeds with processing Bundler things,
then `exec`s into the actual Ruby program to execute.

When it does so, this is a new process, hence any patch that we have
applied goes away, which results in deployment mode being re-armed.

Apply the `bundler` patches to override Bundler behaviours.

We can safely load `bundler` withotu forking in that case since we are
in a bundled case.
Both threads and pipes were otherwise leaking.
When bundle is unlocked `bundle exec` (obviously) fails, and thus makes
the test fail due to the exit code.

Instead, when unlocked, run the stub directly to test guardrails are in
effect.

Note: an extension of this change would be to:

- still run bundle exec and ensure it exits with a non-zero status,
  behaving as expected.
- run without bundle exec in a locked bundle with `RUBYGEMS_GEMDEPS`
  unset (as of this commit).
- run without bundle exec in a locked bundle with `RUBYGEMS_GEMDEPS` set
  to `-` and/or the path to the fixture `Gemfile`.
- `-r` is not supposed to have a space even though it can in some later
  Ruby versions
- Before 1.9 `Gem` isn't present until `rubygems` is required
There's a discrepancy when e.g. `BUNDLE_PATH` is set.
When `BUNDLE_PATH` is set to `/bundle` the result of `bundle install`
is lost, so the bundle is empty come test time.

Store this path in a volume.
There are two cases we don't handle right:
- vendor path by itself: we only patch Bundler to ignore deployment mode
- deployment mode non-default vendor path: we hardcode the path
Bundler doesn't define `Bundler::CLI` outside of `lib/bundler/cli.rb`
and defines commands as e.g `class CLI::Exec` to save nesting.

This makes requiring `lib/bundler/cli/exec.rb` standalone crash.
Test packages were pinned to a problematic version of `libddwaf` that
causes misresolutions due to multiple overlapping binary gem platforms
being available.
Ruby 2.6 decided to activate the default gem via `Kernel.gem` which makes it
subject to isolation, hence breaking under vendored mode.

This `Kernel.gem` activation happens in `gem_prelude` and can only be skipped
with the `--disable=did_you_mean` CLI flag.

Ruby 2.7 reverted that problematic behaviour, instead resorting to a plain
`require` which will either simply load from `$LOAD_PATH` or use a bundled
version.

To ensure no crash happens on Ruby 2.6 we package the corresponding
`did_you_mean` version.

See:
- gem_prelude.rb calls `Kernel.gem` on 2.6, but not 2.7:
  https://github.com/ruby/ruby/blob/ruby_2_6/gem_prelude.rb#L3-L7
  https://github.com/ruby/ruby/blob/ruby_2_7/gem_prelude.rb#L2
- gem_prelude.rb is included as a prelude script
  https://github.com/ruby/ruby/blob/ruby_2_6/common.mk#L158-L161
- prelude scripts get compiled in prelude.c:
  https://github.com/ruby/ruby/blob/ruby_2_6/common.mk#L1050-L1059
- prelude.c is generated from a template:
  https://github.com/ruby/ruby/blob/ruby_2_6/common.mk#L189
- the template embeds ISeqs of scripts
  https://github.com/ruby/ruby/blob/ruby_2_6/template/prelude.c.tmpl#L167-L168
- prelude targets are just prelude.c:
  https://github.com/ruby/ruby/blob/ruby_2_6/common.mk#L1081-L1082
@lloeki lloeki force-pushed the lloeki/deployment-mode branch from 07e0ecd to 6a1e55d Compare November 20, 2025 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants