identity: reload CA root cert channel on file change#1775
identity: reload CA root cert channel on file change#1775istio-testing merged 10 commits intoistio:masterfrom
Conversation
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
Use the new RootCertManager to rebuild the TLS gRPC channel when a root cert change is noted. Need to add some fields to store the information to rebuild the channel. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
|
Hi @jlojosnegros. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
6472679 to
e8b03a0
Compare
|
/ok-to-test |
|
Will you be able to backport this fix to 1.28? |
Yeah sur, no problem. Should be something like #1801 |
We'd typically only cherry-pick bug fixes. This is a grey area IMO. It could be seen as a bug of omission. At any rate, If we do want to consider this a bug we should use the cherry-pick methodology native to the project by labeling this issue for cherry-pick. |
ilrudie
left a comment
There was a problem hiding this comment.
Mostly this looks good. Nice work and thanks for the contribution.
I did want to talk about the locking a little bit to make sure I understand it the same way you do an that it's the right optimization but no need to immediately jump in and start changing that bit.
| manager.mark_dirty(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This isn't requesting a change per se. Sort of reasoning "out loud" about the locking here.
After we take_dirty() the state returns to clean immediately and we enter the await to attempt rebuild_channel. During this time nothing holds the write lock and the state is clean so we may try to use an old client. It might be able to compound as contentious readers may continue to delay being able to take the write, all the while in a "clean" state. I think the advantage is that we do not queue multiple write/rebuilds, which seems like a nice property. In the happier state, the old client is still valid and we keep on running nicely but eventually wind up with a fresh client. In the worse state, the client isn't usable and we produce errors while potentially blocking the resolution by causing read contention on the lock.
Assuming we held the rwlock before marking clean, a bunch of reads could await the write lock to resolve. On success, they get the newly minted client (probably ideal?). On fail, they get the old client and what happens happens. The downside is we could have multiple writelock awaits simultaneously queue since multiple calls to fetch could see the dirty state while we await any held read locks to clear. This might be kind of helpful, since at least write locks to not lead to further read contentions, but we might need to check if the dirty was resolved by someone else before actually rebuilding.
There was a problem hiding this comment.
Hi @ilrudie ,
Thanks for the comments!!!
Sorry for the late response
After we take_dirty() the state returns to clean immediately and we enter the await to attempt rebuild_channel. During this time nothing holds the write lock and the state is clean so we may try to use an old client. It might be able to compound as contentious readers may continue to delay being able to take the write, all the while in a "clean" state
Yeah, you are right, but the contention will be quite limited I think.
If I remember correctly, in Linux the RW lock uses "write-preferring" policy by default, so once a writer ask for the lock it only has to wait for the readers that already has the lock to unlock it, but has preference over any new reader. Given this the contention will be limited to just those readers that asked for the lock before the writer and Read lock is only hold while doing the clone which is a very fast operation (~us), so the contention time should be really small
I think the advantage is that we do not queue multiple write/rebuilds, which seems like a nice property. In the happier state, the old client is still valid and we keep on running nicely but eventually wind up with a fresh client. In the worse state, the client isn't usable and we produce errors while potentially blocking the resolution by causing read contention on the lock.
You are right again. Old client will be used while rebuild is done and that will fail if the old cert is not valid just after rotation, but I was assuming a healthy time window between rotation and certification expiry.
Assuming we held the rwlock before marking clean, a bunch of reads could await the write lock to resolve. On success, they get the newly minted client (probably ideal?). On fail, they get the old client and what happens happens. The downside is we could have multiple writelock awaits simultaneously queue since multiple calls to fetch could see the dirty state while we await any held read locks to clear. This might be kind of helpful, since at least write locks to not lead to further read contentions, but we might need to check if the dirty was resolved by someone else before actually rebuilding.
Yeah, the problem with that is we cannot hold a RWlock through an await operation, so we should change std::async::RWLock to tokio::async::RWLock which is not trivial, and as you have already said, we will have to use a double-checked locking.
Working from this idea I was thinking ...
What if we use a "rebuild_lock" (tokio::sync::Mutex) to serialize channel builds, so we avoid queuing multiple builds, it will be something like:
if take_dirty() is true
get rebuild_lock // only one rebuild in queue
if take_dirty() is true // double-check
new_client = rebuild_channel().await
*self.client.write().unwrap() = new_client // quick write lock
endif
unlock rebuild_lock
endif
This will:
- "serialize" rebuilds, so we only have one at a time
- avoid readers to be block while rebuilding channel, as write lock is only hold for the client swap
- no need to change
std::sync::RWLock
There was a problem hiding this comment.
Thanks for the discussion. I agree we should have time to rotate out the client in most reasonable cases so this is probably a suitable implementation with a good balance of complexity.
It might be nice to add how long we waited for the write lock to the debug message. This way if we suspect contention, we already have some information that can be made available to troubleshoot. I don't think we need to immediately introduce more complexity though.
There was a problem hiding this comment.
Good idea.
I have added the elapsed time in write contention to the debug log.
I have marked the PR as a draft because I was not sure about the process to "backport" this in the project. I assumed we would need to follow some formal process to do that, I just wanna see if the backport was easy to do. Thanks for the rest of the comments. I will review them and will come back asap. |
I started asking release managers in slack. I'm not sure if they will accept it or not but I labeled the PR for cherry-pick to 1.28 and 1.29 and we can sort that out in the release branches. |
Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
| manager.mark_dirty(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Thanks for the discussion. I agree we should have time to rotate out the client in most reasonable cases so this is probably a suitable implementation with a good balance of complexity.
It might be nice to add how long we waited for the write lock to the debug message. This way if we suspect contention, we already have some information that can be made available to troubleshoot. I don't think we need to immediately introduce more complexity though.
Add write_lock_wait_ms to the debug log emitted after a successful root cert hot-reload, so contention on the RwLock is observable in logs without requiring additional instrumentation. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
450f1d2 to
fac3731
Compare
|
According to the discussion on slack we can manually cherry-pick this changes as a bug fix but let’s add a flag to allow user to turn off the watcher. On master we can keep the current behavior |
ilrudie
left a comment
There was a problem hiding this comment.
lgtm, thanks for working on this!
* remove git tag (#1559) * update how io errors are being generated to fix clippy issues (#1564) Signed-off-by: ilrudie <ian.rudie@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1563) * Automator: update common-files@master in istio/ztunnel@master (#1568) * Allow dynamic configuration of thread count (#1566) * Allow dynamic configuration of thread count * fix flakes * don't send to empty address (#1570) * don't send to empty address * add test * Automator: update common-files@master in istio/ztunnel@master (#1571) * Automator: update common-files@master in istio/ztunnel@master (#1572) * remove invalid test cases from parsing of ZTUNNEL_WORKER_THREADS (#1576) Signed-off-by: Ian Rudie <ian.rudie@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1577) * Automator: update common-files@master in istio/ztunnel@master (#1578) * tls: add PQC compliance policy (#1561) * tls: add PQC compliance policy Signed-off-by: Jacek Ewertowski <jacek.ewertowski1@gmail.com> * Add global lazy variable PQC_ENABLED Signed-off-by: Jacek Ewertowski <jacek.ewertowski1@gmail.com> * Add unused_imports and dead_code to PQC_ENABLED declaration Signed-off-by: Jacek Ewertowski <jacek.ewertowski1@gmail.com> --------- Signed-off-by: Jacek Ewertowski <jacek.ewertowski1@gmail.com> * Automator: update common-files@master in istio/ztunnel@master (#1582) * Improved Service Resolution (#1562) * initial idea for improved resolution Signed-off-by: ilrudie <ian.rudie@solo.io> * handle preferred service namespace; unit testing Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: ilrudie <ian.rudie@solo.io> Signed-off-by: Ian Rudie <ian.rudie@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1583) * Multinetwork/Support remote networks for services with waypoints (#1565) * Multinetwork/Support remote networks for services with waypoints Currently `build_request` when it sees a service with a waypoint resolves the waypoint backend and routes request there using regular HBONE. In multi network scenario though the waypoint may have workload on a remote network and to reach it we have to go through E/W gateway and use double HBONE. This change enables handling of services with waypoint on a remote network. Some of the assumptions that were used when I prepared this change: 1. We assume uniformity of configuration (e.g., if service X in local cluster has a waypoint, then service X in remote network also has a waypoint, if waypoint is service addressable, then it's using service to address waypoint both locally and on remote network) 2 Split-horizon representation of waypoint workloads, just like with any regular workloads and services (e.g., in the local cluster instead of an actual waypoint workload pointing to a pod on another network we will have a "proxy" representation that just has network gateway). Both of those can be in hanled by the controlplane (e.g., controlplane can generate split-horizon workloads and when configuration is non-uniform, just filter out remote configs for remote networks), though we don't yet have a complete implementation. Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> * Return an error instead of panicking Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> * Update comments in src/proxy/outbound.rs Co-authored-by: Ian Rudie <ilrudie@gmail.com> * Update comments in src/proxy/outbound.rs Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> * Add a debug assert to provide a bit more context to the error in tests Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> * Fix formatting Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> * Added a few debug logs to be able to trace when a workload on a remote network is picked Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> --------- Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> Co-authored-by: Ian Rudie <ilrudie@gmail.com> * increasing limit for open files (#1586) * increasing limit for open files * suggestion from PR * adding comment * Update src/main.rs Co-authored-by: Daniel Hawton <daniel@hawton.org> --------- Co-authored-by: Daniel Hawton <daniel@hawton.org> * Buffer inner h2 streams (#1580) * Buffer h2 streams * Tests * naming * Review simplify code * Automator: update common-files@master in istio/ztunnel@master (#1589) * Automator: update common-files@master in istio/ztunnel@master (#1595) * Automator: update common-files@master in istio/ztunnel@master (#1596) * Automator: update common-files@master in istio/ztunnel@master (#1597) * Automator: update common-files@master in istio/ztunnel@master (#1598) * adopt rcgen 14 (#1599) * adopt rcgen 14 Signed-off-by: Ian Rudie <ian.rudie@solo.io> * fmt Signed-off-by: Ian Rudie <ian.rudie@solo.io> * fmt fuzz Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: Ian Rudie <ian.rudie@solo.io> * respect IPV6 setting for DNS server (#1601) * Automator: update common-files@master in istio/ztunnel@master (#1604) * chore - clippy cleanup (#1610) * fix build.rs Signed-off-by: Ian Rudie <ian.rudie@solo.io> * clippy Signed-off-by: Ian Rudie <ian.rudie@solo.io> * remove swap Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: Ian Rudie <ian.rudie@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1607) * Update rust version and dependencies (#1614) * Update rust version and dependencies * fix mismatch * drop criterion ver * bump * Automator: update common-files@master in istio/ztunnel@master (#1615) * Automator: update common-files@master in istio/ztunnel@master (#1618) * fix: upgrade pprof to version 0.15.0 to fix GHSA-2gh3-rmm4-6rq5/CVE-2025-53605. Creator of upstream patch to fix CVE issue: tikv/pprof-rs@3d4e696 (#1606) Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev> * Automator: update common-files@master in istio/ztunnel@master (#1619) * Automator: update common-files@master in istio/ztunnel@master (#1621) * Update ztunnel profiling doc (#1620) * Update ztunnel profiling doc * reset the numbers to 1 since markdown auto-renders them in order * Automator: update common-files@master in istio/ztunnel@master (#1622) * Add open file metrics (#1626) * Add open file metrics * Lint * use /dev/fd and subtract one from fd count * import ordering * Automator: update common-files@master in istio/ztunnel@master (#1628) * initial impl for passthrough services (#1627) * initial impl for passthrough services Signed-off-by: Ian Rudie <ian.rudie@solo.io> * make gen Signed-off-by: Ian Rudie <ian.rudie@solo.io> * tests Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: Ian Rudie <ian.rudie@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1635) * retain valid certs on fetch failures (#1567) * retain valid certs on fetch failures * better unit tests. - fetches now records failed attempts as well. - validate that valid certificate are retained across fetch attempts despite ca failures * minor tweaks as per review comments. * Automator: update common-files@master in istio/ztunnel@master (#1638) * Automator: update common-files@master in istio/ztunnel@master (#1639) * Automator: update common-files@master in istio/ztunnel@master (#1640) * admin: allow symbols in jemalloc pprof (#1636) Otherwise its pretty useless. Manually tested this makes heap profiles work * Set timeouts and keepalive (#1641) * set timeouts * Listener takes socket config * Make certificate DER fields public (#1646) * logs: disable spammy DNS logs (#1649) These logs are crazy, like 5 lines per DNS query. We turned them off but an update changed the log target. Not sure what backports we need * Automator: update common-files@master in istio/ztunnel@master (#1653) * Automator: update common-files@master in istio/ztunnel@master (#1654) * Automator: update common-files@master in istio/ztunnel@master (#1655) * fix small typo (#1647) Signed-off-by: Ian Rudie <ian.rudie@solo.io> * Bumps deps (#1623) * Bumps deps * fixup * fix * fmt * rfmt * fmt and 1.90 * their way * Automator: update common-files@master in istio/ztunnel@master (#1661) * Automator: update common-files@master in istio/ztunnel@master (#1670) * Bump cargo deps (#1669) * removing CNAME record answer for wildcards (#1664) * removing CNAME record answer for wildcards * removing unnused trace * fixing unit test * removing unused fild/type * make gen * metrics: fix accept header negotiation (#1681) Basically `get_all` does NOT unconditionally split a header that has been sent on 1 line by the client. This breaks kube-prometheus-stack * Automator: update common-files@master in istio/ztunnel@master (#1685) * Set socket options outside trace! (#1689) * Automator: update common-files@master in istio/ztunnel@master (#1696) * Set keepalives on outbound connections (#1688) * Set keepalives on outbound connections * mut * Update tonic/prost crates to 0.14.x (#1687) The only relevant breaking change is that `prost::Message` is no longer a supertrait of `fmt::Debug`, so that bound is added in the few places that need it. * Automator: update common-files@master in istio/ztunnel@master (#1700) * Fix clippy lint errors (#1701) * Fix clippy lint errors Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Ordering Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> --------- Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Automator: update common-files@master in istio/ztunnel@master (#1702) * Automator: update common-files@master in istio/ztunnel@master (#1705) * provide test to verify socket options set (#1690) * metrics: fix dns histogram (#1706) * canonical wds service (#1704) * initial impl Signed-off-by: Ian Rudie <ian.rudie@solo.io> * improvements to the impl Signed-off-by: Ian Rudie <ian.rudie@solo.io> * implement preferred service namespace handling Signed-off-by: Ian Rudie <ian.rudie@solo.io> * lints Signed-off-by: Ian Rudie <ian.rudie@solo.io> * unit for canonical service Signed-off-by: Ian Rudie <ian.rudie@solo.io> * cleanup old commented reference code Signed-off-by: Ian Rudie <ian.rudie@solo.io> * comment about canonical's purpose Signed-off-by: Ian Rudie <ian.rudie@solo.io> * add issue link for deprecation of preferred_service_namespace Signed-off-by: Ian Rudie <ian.rudie@solo.io> * unit to assert that namespace-local Service definitions are preferred over canonical Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: Ian Rudie <ian.rudie@solo.io> * dry run auth (#1659) * dry run auth * lint * Expose client config builder for workload TLS (#1710) Add WorkloadCertificate::client_config to build a rustls ClientConfig using the existing verifier and root store, so callers can customize SNI/ALPN without duplicating TLS config construction. Existing outbound behavior remains unchanged. * feat: implement crl support in ztunnel (#1660) * feat: implements ca-crl support in zTunnel Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: drains connection only for revoked cert Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * feat: revokes affected inbound connections only Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: refactors crl watcher Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: rejects new connections only Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: clr validation only at the HBONE layer Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: use rustls-webpki for CRL validation Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: validates CRL using webpki instead of custom implementation. Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: addresses review comments Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: implements CRL validation entirely with webpki's verify_for_usage method. Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: reverts verify_for_usage Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: removes comment Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: address review comments Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: fixes lock Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: uses Option rather than extra var Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: addresses review comments Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * chore: fixes merge conflict Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> --------- Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> * add support for TLSv1.2 (#1711) by setting `TLS12_ENABLED` to `true`, ztunnel will negotiate TLSv1.2 or 1.3. Fixes #1296 until we have FIPS-140-3 support in istiod. * Automator: update common-files@master in istio/ztunnel@master (#1716) * enhance ztunnel metrics (#1695) * enhance ztunnel metrics Signed-off-by: Lucas Copi <lucas.copi@solo.io> * make gen Signed-off-by: Lucas Copi <lucas.copi@solo.io> * change socket labels, review fixes Signed-off-by: Lucas Copi <lucas.copi@solo.io> * use gauge metric for socket tracking Signed-off-by: Lucas Copi <lucas.copi@solo.io> * expand connection falure checking Signed-off-by: Lucas Copi <lucas.copi@solo.io> * fix cargo issues Signed-off-by: Lucas Copi <lucas.copi@solo.io> * make gen Signed-off-by: Lucas Copi <lucas.copi@solo.io> * remove unneeded direction label Signed-off-by: Lucas Copi <lucas.copi@solo.io> * remove unused enum Signed-off-by: Lucas Copi <lucas.copi@solo.io> * downcast error for metrics Signed-off-by: Lucas Copi <lucas.copi@solo.io> --------- Signed-off-by: Lucas Copi <lucas.copi@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1721) * Change log level from info to debug for "response received" (#1723) * Embed test data files to enable running tests outside source tree (#1720) * Embed test data files to enable running tests outside source tree Several unit tests depend on external files via relative paths, causing failures when the test binary is executed from a different directory than the source tree (e.g. in CI environments that copy binaries to isolated locations). This change: - Embed test data files at compile time using `include_str!` - Adds `temp_file_with_content()` helper function for tests requiring file paths - Uses `AuthSource::StaticToken` for CA client tests - Moves tempfile from dev-dependencies to optional dependency activated by the testing feature Affected tests: - config::tests::config_from_proxyconfig - identity::caclient::tests::{empty_chain, fetch_certificate, wrong_identity} - state::workload::tests::local_client Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * small change `make gen` Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> --------- Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * Automator: update common-files@master in istio/ztunnel@master (#1726) * Add x forwarded network header (#1728) * Initial plan * Add x-origin-source header to inner CONNECT requests in double HBONE Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Add comment explaining single HBONE codepath and tests for x-origin-source header Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Rename header from x-origin-source to x-istio-origin-source Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Rename header to x-istio-origin-network and refocus test on double HBONE Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Add test back Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Use inbound x-istio-origin-network to know whether traffic originates from the gateway Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * fmt Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Actually check value of origin header Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Cargo fmt Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Remove incorrect TODO Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Update src/proxy.rs Co-authored-by: Ian Rudie <ilrudie@gmail.com> * Complete the rename Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Find/replace was too ambitious Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Fix inverted logic Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * fmt Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> --------- Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> Co-authored-by: Ian Rudie <ilrudie@gmail.com> * Automator: update common-files@master in istio/ztunnel@master (#1737) * Ambient Multicluster Telemetry (#1734) * Add addtl codeowners for experimental (#1732) Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * More baggage support (#1731) * baggage * Use baggage for cross-cluster * fix unit tests * Fix namespaced tests. Remove extra code. * cleanup a bit * Initial plan * Add x-origin-source header to inner CONNECT requests in double HBONE Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Add comment explaining single HBONE codepath and tests for x-origin-source header Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Rename header from x-origin-source to x-istio-origin-source Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Rename header to x-istio-origin-network and refocus test on double HBONE Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> * Add test back Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Use inbound x-istio-origin-network to know whether traffic originates from the gateway Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * fmt Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * lint * Codeowners * Fix rebase --------- Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> Co-authored-by: Keith Mattix II <keithmattix@microsoft.com> * Cargo fmt Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Remove experimental codeowners Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Address review comments (#1738) * Address PR comments Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> * Cargo fmt Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> --------- Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> Co-authored-by: Steven Jin <sjinxuan@microsoft.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> Co-authored-by: Krinkin, Mike <mkrinkin@microsoft.com> * PQC: openssl crypto provider support (#1743) * PQC: openssl crypto provider support ztunnel must be both compiled and ran with openssl >= 3.5.0. Signed-off-by: Zuzana Miklankova <zmiklank@redhat.com> * allow unusual byte groupings for openssl version detection Signed-off-by: Zuzana Miklankova <zmiklank@redhat.com> * running 'make gen' Signed-off-by: Zuzana Miklankova <zmiklank@redhat.com> --------- Signed-off-by: Zuzana Miklankova <zmiklank@redhat.com> * Automator: update common-files@master in istio/ztunnel@master (#1750) * Automator: update common-files@master in istio/ztunnel@master (#1758) * prioritize canonical services on inbound (#1746) * prioritize canonical services on inbound * delete MatchReason * fix itertools thing * continue not return * drop continue for into_iter * as deref * gencheck * dry-run no allow policies match (#1745) * dry-run no allow policies match Signed-off-by: Ian Rudie <ian.rudie@solo.io> * authpol logging macro + env var to allow info log level if desired Signed-off-by: Ian Rudie <ian.rudie@solo.io> * move macro to better location Signed-off-by: Ian Rudie <ian.rudie@solo.io> * clean up Signed-off-by: Ian Rudie <ian.rudie@solo.io> * update tests Signed-off-by: Ian Rudie <ian.rudie@solo.io> * fmt Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: Ian Rudie <ian.rudie@solo.io> * Automator: update common-files@master in istio/ztunnel@master (#1763) * handle peer_addr() failure gracefully (#1764) * proxy: handle peer_addr() failure gracefully instead of panicking Replace .expect("must receive peer addr") with proper error handling in outbound and socks5 proxy paths. A client can send RST immediately after the TCP handshake completes, causing getpeername(2) to return ENOTCONN on the already-queued socket. The previous .expect() converted this transient OS-level error into a panic that killed the Tokio task. - outbound: match on peer_addr(), log debug and return early on error - socks5/handle_socks_connection: same pattern - socks5/negotiate_socks_connection: use ? operator (From<io::Error> for SocksError already exists) Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * i socket: avoid panic in orig_dst_addr diagnostic log peer_addr() and local_addr() inside the warn!() in orig_dst_addr() were called with .unwrap(), which could panic if the socket became unavailable before the log statement was reached. Replace with .map(...).unwrap_or_else() so the warning logs "N.A." instead of crashing when the address cannot be retrieved. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> --------- Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * Automator: update common-files@master in istio/ztunnel@master (#1765) * Automator: update common-files@master in istio/ztunnel@master (#1770) * Bump the cargo group across 2 directories with 4 updates (#1771) Bumps the cargo group with 2 updates in the / directory: [bytes](https://github.com/tokio-rs/bytes) and [time](https://github.com/time-rs/time). Bumps the cargo group with 4 updates in the /fuzz directory: [bytes](https://github.com/tokio-rs/bytes), [tracing-subscriber](https://github.com/tokio-rs/tracing), [time](https://github.com/time-rs/time) and [crossbeam-channel](https://github.com/crossbeam-rs/crossbeam). Updates `bytes` from 1.11.0 to 1.11.1 - [Release notes](https://github.com/tokio-rs/bytes/releases) - [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md) - [Commits](tokio-rs/bytes@v1.11.0...v1.11.1) Updates `time` from 0.3.44 to 0.3.47 - [Release notes](https://github.com/time-rs/time/releases) - [Changelog](https://github.com/time-rs/time/blob/main/CHANGELOG.md) - [Commits](time-rs/time@v0.3.44...v0.3.47) Updates `bytes` from 1.10.0 to 1.11.1 - [Release notes](https://github.com/tokio-rs/bytes/releases) - [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md) - [Commits](tokio-rs/bytes@v1.11.0...v1.11.1) Updates `tracing-subscriber` from 0.3.19 to 0.3.22 - [Release notes](https://github.com/tokio-rs/tracing/releases) - [Commits](tokio-rs/tracing@tracing-subscriber-0.3.19...tracing-subscriber-0.3.22) Updates `time` from 0.3.37 to 0.3.47 - [Release notes](https://github.com/time-rs/time/releases) - [Changelog](https://github.com/time-rs/time/blob/main/CHANGELOG.md) - [Commits](time-rs/time@v0.3.44...v0.3.47) Updates `crossbeam-channel` from 0.5.14 to 0.5.15 - [Release notes](https://github.com/crossbeam-rs/crossbeam/releases) - [Changelog](https://github.com/crossbeam-rs/crossbeam/blob/master/CHANGELOG.md) - [Commits](crossbeam-rs/crossbeam@crossbeam-channel-0.5.14...crossbeam-channel-0.5.15) --- updated-dependencies: - dependency-name: bytes dependency-version: 1.11.1 dependency-type: direct:production dependency-group: cargo - dependency-name: time dependency-version: 0.3.47 dependency-type: direct:production dependency-group: cargo - dependency-name: bytes dependency-version: 1.11.1 dependency-type: indirect dependency-group: cargo - dependency-name: tracing-subscriber dependency-version: 0.3.22 dependency-type: indirect dependency-group: cargo - dependency-name: time dependency-version: 0.3.47 dependency-type: indirect dependency-group: cargo - dependency-name: crossbeam-channel dependency-version: 0.5.15 dependency-type: indirect dependency-group: cargo ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Automator: update common-files@master in istio/ztunnel@master (#1777) * fix TLSv1.2 support by adding missing ciphersuites (#1779) It turns out that Istio's SPIFFE certs use ECDSA certificates, which only became a problem once I started testing against waypoint proxies. This adds the missing CipherSuites (which are still FIPS-compliant of course) to unblock Waypoint->ZTunnel communication. * expand dependabot (#1778) Signed-off-by: Daniel Hawton <daniel.hawton@solo.io> * Dont log error on broken pipe (#1784) * Automator: update common-files@master in istio/ztunnel@master (#1788) * Automator: update common-files@master in istio/ztunnel@master (#1792) * Automator: update common-files@master in istio/ztunnel@master (#1799) * Automator: update common-files@master in istio/ztunnel@master (#1800) * Automator: update common-files@master in istio/ztunnel@master (#1802) * Automator: update common-files@master in istio/ztunnel@master (#1810) * identity: reload CA root cert channel on file change (#1775) * RootCertManager: Add new CrlCertManager-like struct Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * caclient: rebuild channel when root cert changes Use the new RootCertManager to rebuild the TLS gRPC channel when a root cert change is noted. Need to add some fields to store the information to rebuild the channel. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * small adaptations Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * Some unit tests Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * solve some compilation problems Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * delete is_dirty as it is not used Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * some clippy adjustments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * adding some comments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * addressing comments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * log write lock wait time after TLS channel rebuild Add write_lock_wait_ms to the debug log emitted after a successful root cert hot-reload, so contention on the RwLock is observable in logs without requiring additional instrumentation. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> --------- Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * update after merge master branch Signed-off-by: zirain <zirain2009@gmail.com> --------- Signed-off-by: ilrudie <ian.rudie@solo.io> Signed-off-by: Ian Rudie <ian.rudie@solo.io> Signed-off-by: Jacek Ewertowski <jacek.ewertowski1@gmail.com> Signed-off-by: Mikhail Krinkin <mkrinkin@microsoft.com> Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev> Signed-off-by: Keith Mattix II <keithmattix@microsoft.com> Signed-off-by: nilekh <1626598+nilekhc@users.noreply.github.com> Signed-off-by: Lucas Copi <lucas.copi@solo.io> Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> Signed-off-by: Zuzana Miklankova <zmiklank@redhat.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Daniel Hawton <daniel.hawton@solo.io> Signed-off-by: zirain <zirain2009@gmail.com> Co-authored-by: Ian Rudie <ian.rudie@solo.io> Co-authored-by: Istio Automation <istio-testing-bot@google.com> Co-authored-by: John Howard <john.howard@solo.io> Co-authored-by: Steven Landow <steven@landow.dev> Co-authored-by: Jacek Ewertowski <jacek.ewertowski1@gmail.com> Co-authored-by: Krinkin, Mike <krinkin.m.u@gmail.com> Co-authored-by: Ian Rudie <ilrudie@gmail.com> Co-authored-by: Gustavo Meira <grnmeira@users.noreply.github.com> Co-authored-by: Daniel Hawton <daniel@hawton.org> Co-authored-by: Steven Jin <sjinxuan@microsoft.com> Co-authored-by: Kyle Steere <kbsteere@users.noreply.github.com> Co-authored-by: Arka Bhattacharya <21124287+find-arka@users.noreply.github.com> Co-authored-by: deveshdama <87668846+deveshdama@users.noreply.github.com> Co-authored-by: Mantas Matelis <me@mantasmatelis.com> Co-authored-by: Tamir Duberstein <tamird@gmail.com> Co-authored-by: Keith Mattix II <keithmattix@microsoft.com> Co-authored-by: Mike Zappa <michael.zappa@gmail.com> Co-authored-by: Nilekh Chaudhari <1626598+nilekhc@users.noreply.github.com> Co-authored-by: Daniel Grimm <dgrimm@redhat.com> Co-authored-by: lcopi <lucas.copi@solo.io> Co-authored-by: Ram Vennam <ram.vennam@solo.io> Co-authored-by: Jose Luis Ojosnegros <jojosneg@redhat.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: keithmattix <1531662+keithmattix@users.noreply.github.com> Co-authored-by: Krinkin, Mike <mkrinkin@microsoft.com> Co-authored-by: Zuzana Miklánková <zmiklank@redhat.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Daniel Hawton <daniel.hawton@solo.io>
* RootCertManager: Add new CrlCertManager-like struct Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * caclient: rebuild channel when root cert changes Use the new RootCertManager to rebuild the TLS gRPC channel when a root cert change is noted. Need to add some fields to store the information to rebuild the channel. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * small adaptations Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * Some unit tests Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * solve some compilation problems Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * delete is_dirty as it is not used Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * some clippy adjustments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * adding some comments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * addressing comments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * log write lock wait time after TLS channel rebuild Add write_lock_wait_ms to the debug log emitted after a successful root cert hot-reload, so contention on the RwLock is observable in logs without requiring additional instrumentation. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> --------- Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com>
* identity: reload CA root cert channel on file change (#1775) * RootCertManager: Add new CrlCertManager-like struct Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * caclient: rebuild channel when root cert changes Use the new RootCertManager to rebuild the TLS gRPC channel when a root cert change is noted. Need to add some fields to store the information to rebuild the channel. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * small adaptations Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * Some unit tests Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * solve some compilation problems Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * delete is_dirty as it is not used Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * some clippy adjustments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * adding some comments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * addressing comments Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * log write lock wait time after TLS channel rebuild Add write_lock_wait_ms to the debug log emitted after a successful root cert hot-reload, so contention on the RwLock is observable in logs without requiring additional instrumentation. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> --------- Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * minimal fixes from subsequent dependabot work Signed-off-by: Ian Rudie <ian.rudie@solo.io> * add CA_CERT_WATCHER flag to disable root cert hot-reload Introduces a boolean env var CA_CERT_WATCHER (default: true) that allows operators to disable the CA root cert file watcher at runtime. When set to false, no watcher thread is started and the gRPC channel is never rebuilt on cert rotation — the startup-time cert is retained permanently. The flag is only effective when ca_root_cert is a file path; Static and Default certs never start a watcher regardless. Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> * suppress clippy error Signed-off-by: Ian Rudie <ian.rudie@solo.io> --------- Signed-off-by: Jose Luis Ojosnegros Manchón <jojosneg@redhat.com> Signed-off-by: Ian Rudie <ian.rudie@solo.io> Co-authored-by: Jose Luis Ojosnegros <jojosneg@redhat.com>
Problem
ztunnel reads the CA root certificate once at startup and discards it. When an operator rotates the Istio CA (
istio-ca-root-certConfigMap), the file on disk is updated by Kubernetes but ztunnel keeps using the stale cert to verify TLS with istiod. Every subsequent workload certificate renewal fails withUnknownIssuer.Proposed Solution
Add a
RootCertManager, modelled after the existingCrlManager. It holds a filesystem watcher (vianotify-debouncer-full) that sets an atomic dirty flag whenever the root cert directory changes.CaClientnow retains theRootCertand aOption<Arc<RootCertManager>>. At the start of eachfetch_certificatecall, if the dirty flag is set, the TLS gRPC channel is rebuilt from the new cert before proceeding. On rebuild failure the dirty flag is re-armed so the next call retries.Changes
src/tls/control.rsRootCertManager; exposecontrol_plane_client_configaspub(crate)src/identity/caclient.rsCaClient; addrebuild_channel(); hot-reload check infetch_certificate()src/identity/manager.rsCaClient::new()signaturesrc/test_helpers/ca.rsCaServer::spawn()to new signatureNo
Cargo.tomlchanges needed:notifyandnotify-debouncer-fullare already pulled in byCrlManager.Solves: #1768