Add support for tracing profilers like Nvidia NSight System and Intel VTune by vchuravy · Pull Request #2908 · trixi-framework/Trixi.jl

vchuravy · 2026-04-01T13:35:10Z

For GPU-accelerated development we often use external profilers, such as NSight System.

With this PR we automatically annotate and then get inside NSight System:

Time	Total Time	Instances	Avg	Med	Min	Max	StdDev	Style	Range
55.1%	42.245 s	315	134.112 ms	150.237 ms	18.632 ms	1.625 s	112.749 ms	StartEnd	Trixi:volume integral
15.8%	12.121 s	315	38.480 ms	45.344 ms	1.353 ms	336.491 ms	27.790 ms	StartEnd	Trixi:surface integral
10.7%	8.200 s	315	26.031 ms	30.073 ms	968.863 μs	281.449 ms	21.594 ms	StartEnd	Trixi:Jacobian
9.9%	7.601 s	315	24.131 ms	27.088 ms	432.096 μs	439.596 ms	28.846 ms	StartEnd	Trixi:prolong2interfaces
4.5%	3.439 s	315	10.918 ms	9.216 ms	397.425 μs	560.740 ms	40.637 ms	StartEnd	Trixi:interface flux
3.9%	2.989 s	315	9.490 ms	9.622 ms	84.561 μs	405.742 ms	27.915 ms	StartEnd	Trixi:reset ∂u/∂t
0.0%	3.323 ms	315	10.548 μs	9.980 μs	4.230 μs	32.811 μs	3.878 μs	StartEnd	Trixi:source terms
0.0%	2.947 ms	315	9.354 μs	9.450 μs	3.541 μs	27.480 μs	3.219 μs	StartEnd	Trixi:prolong2boundaries
0.0%	1.384 ms	315	4.393 μs	4.090 μs	2.630 μs	15.110 μs	1.366 μs	StartEnd	Trixi:boundary flux
0.0%	1.341 ms	315	4.257 μs	4.070 μs	2.690 μs	12.210 μs	978 ns	StartEnd	Trixi:prolong2mortars
0.0%	1.315 ms	315	4.175 μs	3.870 μs	2.740 μs	20.200 μs	1.308 μs	StartEnd	Trixi:mortar flux

github-actions · 2026-04-01T13:35:22Z

ext/TrixiIntelITTExt.jl

ext/TrixiNVTXExt.jl

vchuravy · 2026-04-01T13:59:58Z

Profiler ran for 956.99 ms, capturing 118539 events.

Host-side activity: calling CUDA APIs took 173.18 ms (18.10% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   61.46% │   588.2 ms │   293 │   2.01 ms ± 6.24   (   0.0 ‥ 33.8)   │ cuStreamSynchronize     │
│    0.20% │    1.88 ms │   238 │   7.88 µs ± 4.31   (  3.81 ‥ 32.66)  │ cuLaunchKernel          │
│    0.05% │  438.21 µs │     2 │ 219.11 µs ± 58.67  (177.62 ‥ 260.59) │ cuModuleLoadDataEx      │
│    0.02% │  196.93 µs │     9 │  21.88 µs ± 6.46   ( 12.64 ‥ 28.13)  │ cuMemcpyDtoHAsync       │
│    0.01% │  109.43 µs │    19 │   5.76 µs ± 4.09   (  2.15 ‥ 13.59)  │ cuMemAllocFromPoolAsync │
│    0.01% │   87.02 µs │     2 │  43.51 µs ± 9.61   ( 36.72 ‥ 50.31)  │ cuModuleGetFunction     │
│    0.01% │   74.86 µs │     6 │  12.48 µs ± 2.96   ( 10.01 ‥ 17.88)  │ cuMemcpyDtoDAsync       │
│    0.00% │   22.89 µs │     2 │  11.44 µs ± 4.05   (  8.58 ‥ 14.31)  │ cuCtxSynchronize        │
│    0.00% │   11.44 µs │    67 │ 170.81 ns ± 175.13 (   0.0 ‥ 715.26) │ cuCtxPushCurrent        │
│    0.00% │    7.15 µs │    67 │ 106.75 ns ± 126.46 (   0.0 ‥ 476.84) │ cuCtxPopCurrent         │
│    0.00% │    5.72 µs │    67 │   85.4 ns ± 122.43 (   0.0 ‥ 476.84) │ cuCtxGetDevice          │
│    0.00% │  715.26 ns │    12 │   59.6 ns ± 107.83 (   0.0 ‥ 238.42) │ cuDeviceGet             │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 644.59 ms (67.36% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              │
├──────────┼────────────┼───────┼──────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   57.38% │  549.15 ms │    25 │  21.97 ms ± 6.22   (  18.9 ‥ 34.26)  │ gpu_volume_integral_KAkernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 5, 1>, CuDeviceArray<Float32, 5, 1>, Type<P4estMesh<3, 3, Float64, False, PointerWrapper<p8est>, PointerWrapper<p8est_ghost_t>, 5, 2>>, False, CompressibleEulerEquations3D<Float64>, VolumeIntegralFluxDifferencing<flux_ranocha>, DG<LobattoLegendreBasis<Float32, 6, SArray<Tuple<6>, Float32, 1, 6>, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, LobattoLegendreMortarL2<Float32, 6, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, VolumeIntegralFluxDifferencing<flux_ranocha>>, NamedTuple<__elements__, Tuple<NamedTuple<__contravariant_vectors__, Tuple<CuDeviceArray<Float32, 6, 1>>>>>)                                                                                                          │
│    3.76% │   36.02 ms │    25 │   1.44 ms ± 0.37   (  1.27 ‥ 2.26)   │ gpu_calc_surface_integral_KAkernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 5, 1>, Type<P4estMesh<3, 3, Float64, False, PointerWrapper<p8est>, PointerWrapper<p8est_ghost_t>, 5, 2>>, CompressibleEulerEquations3D<Float64>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, DG<LobattoLegendreBasis<Float32, 6, SArray<Tuple<6>, Float32, 1, 6>, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, LobattoLegendreMortarL2<Float32, 6, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, VolumeIntegralFluxDifferencing<flux_ranocha>>, Float32, CuDeviceArray<Float32, 5, 1>)                                                                                                                                                   │
│    2.62% │   25.05 ms │    25 │    1.0 ms ± 0.27   (  0.89 ‥ 1.6)    │ gpu_apply_jacobian_KAkernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 5, 1>, Type<P4estMesh<3, 3, Float64, False, PointerWrapper<p8est>, PointerWrapper<p8est_ghost_t>, 5, 2>>, CompressibleEulerEquations3D<Float64>, DG<LobattoLegendreBasis<Float32, 6, SArray<Tuple<6>, Float32, 1, 6>, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, LobattoLegendreMortarL2<Float32, 6, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, VolumeIntegralFluxDifferencing<flux_ranocha>>, CuDeviceArray<Float32, 4, 1>)                                                                                                                                                                                                                                                                             │
│    1.14% │    10.9 ms │    25 │ 435.82 µs ± 123.06 (374.56 ‥ 678.54) │ gpu_prolong2interfaces_KAkernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 5, 1>, CuDeviceArray<Float32, 5, 1>, Type<P4estMesh<3, 3, Float64, False, PointerWrapper<p8est>, PointerWrapper<p8est_ghost_t>, 5, 2>>, CompressibleEulerEquations3D<Float64>, CuDeviceArray<Int64, 2, 1>, CuDeviceArray<Tuple<Symbol, Symbol, Symbol>, 2, 1>, OneTo<Int64>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       │
│    1.05% │   10.02 ms │     5 │    2.0 ms ± 0.0    (   2.0 ‥ 2.01)   │ gpu_max_scaled_speed_KAkernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, CuDeviceArray<Float32, 5, 1>, Type<P4estMesh<3, 3, Float64, False, PointerWrapper<p8est>, PointerWrapper<p8est_ghost_t>, 5, 2>>, False, CompressibleEulerEquations3D<Float64>, DG<LobattoLegendreBasis<Float32, 6, SArray<Tuple<6>, Float32, 1, 6>, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, LobattoLegendreMortarL2<Float32, 6, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, VolumeIntegralFluxDifferencing<flux_ranocha>>, CuDeviceArray<Float32, 6, 1>, CuDeviceArray<Float32, 4, 1>)                                                                                                                                                                                                        │
│    1.04% │    9.99 ms │    25 │ 399.46 µs ± 100.3  (345.71 ‥ 612.5)  │ gpu_calc_interface_flux_KAkernel_(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 5, 1>, Type<P4estMesh<3, 3, Float64, False, PointerWrapper<p8est>, PointerWrapper<p8est_ghost_t>, 5, 2>>, False, CompressibleEulerEquations3D<Float64>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, Type<DG<LobattoLegendreBasis<Float32, 6, SArray<Tuple<6>, Float32, 1, 6>, CuArray<Float32, 2, DeviceMemory>, CuArray<Float32, 2, DeviceMemory>>, LobattoLegendreMortarL2<Float32, 6, CuArray<Float32, 2, DeviceMemory>, CuArray<Float32, 2, DeviceMemory>>, SurfaceIntegralWeakForm<FluxPlusDissipation<flux_central, DissipationLocalLaxFriedrichs<max_abs_speed>>>, VolumeIntegralFluxDifferencing<flux_ranocha>>>, CuDeviceArray<Float32, 5, 1>, CuDeviceArray<Int64, 2, 1>, CuDeviceArray<Tuple<Symbol, Symbol, Symbol>, 2, 1>, CuDeviceArray<Float32, 6, 1>, OneTo<Int64>) │
│    0.19% │    1.78 ms │    25 │  71.33 µs ± 19.74  ( 61.27 ‥ 110.15) │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<5, Tuple<OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>>, NDRange<5, DynamicSize, DynamicSize, CartesianIndices<5, Tuple<OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<5, Tuple<OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>>>>, CuDeviceArray<Float32, 5, 1>, Broadcasted<CuArrayStyle<5, DeviceMemory>, Tuple<OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>, OneTo<Int64>>, identity, Tuple<Float32>>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    │
│    0.07% │  692.84 µs │    20 │  34.64 µs ± 8.19   ( 30.99 ‥ 53.88)  │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, muladd, Tuple<Float64, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│    0.05% │  480.18 µs │    25 │  19.21 µs ± 3.26   (  17.4 ‥ 26.7)   │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, muladd, Tuple<Float32, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│    0.03% │  246.05 µs │    20 │   12.3 µs ± 3.23   ( 10.49 ‥ 18.84)  │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, _, Tuple<Float32, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│    0.01% │  134.94 µs │     5 │  26.99 µs ± 7.84   ( 23.13 ‥ 41.01)  │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, _, Tuple<Float64, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│    0.00% │    42.2 µs │     4 │  10.55 µs ± 0.36   ( 10.01 ‥ 10.73)  │ partial_mapreduce_grid(INFINITE_OR_GIANT, _, Bool, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>, Val<true>, CuDeviceArray<Bool, 2, 1>, CuDeviceArray<Float32, 1, 1>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │
│    0.00% │   39.34 µs │     6 │   6.56 µs ± 2.86   (  5.25 ‥ 12.4)   │ [copy device to device memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    │
│    0.00% │   16.69 µs │     9 │   1.85 µs ± 0.2    (  1.67 ‥ 2.15)   │ [copy device to pageable memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  │
│    0.00% │   12.64 µs │     5 │   2.53 µs ± 0.13   (  2.38 ‥ 2.62)   │ partial_mapreduce_grid(identity, max, Float32, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>, Val<true>, CuDeviceArray<Float32, 1, 1>, CuDeviceArray<Float32, 1, 1>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         │
│    0.00% │   10.73 µs │     4 │   2.68 µs ± 0.12   (  2.62 ‥ 2.86)   │ partial_mapreduce_grid(identity, _, Bool, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, Val<true>, CuDeviceArray<Bool, 2, 1>, CuDeviceArray<Bool, 2, 1>)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │
└──────────┴────────────┴───────┴──────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

NVTX ranges:
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                     │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────┤
│   57.60% │  551.19 ms │    25 │  22.05 ms ± 6.23   ( 18.97 ‥ 34.37)   │ Trixi.volume integral    │
│    3.95% │   37.78 ms │    25 │   1.51 ms ± 0.37   (  1.34 ‥ 2.36)    │ Trixi.surface integral   │
│    2.81% │   26.86 ms │    25 │   1.07 ms ± 0.27   (  0.95 ‥ 1.7)     │ Trixi.Jacobian           │
│    1.39% │   13.29 ms │    25 │ 531.72 µs ± 184.32 (428.92 ‥ 1206.64) │ Trixi.prolong2interfaces │
│    1.22% │   11.72 ms │    25 │  468.9 µs ± 137.85 (396.01 ‥ 941.28)  │ Trixi.interface flux     │
│    0.28% │    2.67 ms │    25 │  106.6 µs ± 25.22  ( 89.41 ‥ 173.81)  │ Trixi.reset ∂u/∂t        │
│    0.01% │  102.28 µs │    25 │   4.09 µs ± 0.71   (   3.1 ‥ 6.2)     │ Trixi.source terms       │
│    0.01% │   83.68 µs │    25 │   3.35 µs ± 0.97   (  2.38 ‥ 5.96)    │ Trixi.prolong2boundaries │
│    0.01% │   67.71 µs │    25 │   2.71 µs ± 0.28   (  1.91 ‥ 3.34)    │ Trixi.prolong2mortars    │
│    0.01% │   62.47 µs │    25 │    2.5 µs ± 0.21   (  1.91 ‥ 2.86)    │ Trixi.mortar flux        │
│    0.01% │   61.27 µs │    25 │   2.45 µs ± 0.26   (  2.15 ‥ 3.1)     │ Trixi.boundary flux      │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────┘

From CUDA.@profile

vchuravy · 2026-04-01T14:01:10Z

I need to investigate why I am getting:

┌ Error: Unexpected CUPTI marker color flag 0. Please file an issue.
└ @ CUDA.Profile ~/.julia/packages/CUDA/Il00B/src/profile.jl:596

codecov · 2026-04-02T01:16:01Z

Codecov Report

❌ Patch coverage is 17.24138% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.75%. Comparing base (c9e6a85) to head (0056332).

Files with missing lines	Patch %	Lines
ext/TrixiIntelITTExt.jl	0.00%	11 Missing ⚠️
ext/TrixiNVTXExt.jl	0.00%	7 Missing ⚠️
src/auxiliary/auxiliary.jl	45.45%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2908      +/-   ##
==========================================
- Coverage   97.07%   96.75%   -0.33%     
==========================================
  Files         610      612       +2     
  Lines       47500    47531      +31     
==========================================
- Hits        46110    45985     -125     
- Misses       1390     1546     +156

Flag	Coverage Δ
unittests	`96.75% <17.24%> (-0.33%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ranocha

Can you please add a (brief) section to the documentation, e.g., https://trixi-framework.github.io/TrixiDocumentation/stable/performance/, describing how to use these tools for benchmarking (or at least mentioning that Trixi.jl supports them and linking to other docs for further information)?

… VTune Delay init of domain fixup: formatting add color

ranocha · 2026-04-02T15:05:53Z

Please request my review when you've finished this PR.

github-actions bot reviewed Apr 1, 2026

View reviewed changes

ext/TrixiIntelITTExt.jl Outdated Show resolved Hide resolved

github-actions bot reviewed Apr 1, 2026

View reviewed changes

ext/TrixiNVTXExt.jl Outdated Show resolved Hide resolved

ranocha reviewed Apr 2, 2026

View reviewed changes

vchuravy mentioned this pull request Apr 2, 2026

Use trixi_timeit_ext also for rhs and calculate_dt #2911

Merged

Add support for tracing profilers like Nvidia NSight System and Intel…

e2ee5e0

… VTune Delay init of domain fixup: formatting add color

vchuravy force-pushed the vc/nvtx branch from 4ecda7d to e2ee5e0 Compare April 2, 2026 12:03

vchuravy mentioned this pull request Apr 2, 2026

Use AcceleratedKernels.mapreduce in max_scaled_speed and integrate_via_indices #2882

Merged

add first documentation draft

0056332

vchuravy added the gpu label Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for tracing profilers like Nvidia NSight System and Intel VTune#2908

Add support for tracing profilers like Nvidia NSight System and Intel VTune#2908
vchuravy wants to merge 2 commits intomainfrom
vc/nvtx

vchuravy commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

vchuravy commented Apr 1, 2026

Uh oh!

vchuravy commented Apr 1, 2026

Uh oh!

codecov bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

ranocha left a comment

Uh oh!

ranocha commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vchuravy commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Review checklist

Purpose and scope

Code quality

Documentation

Testing

Performance

Verification

Uh oh!

Uh oh!

Uh oh!

vchuravy commented Apr 1, 2026

Uh oh!

vchuravy commented Apr 1, 2026

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ranocha left a comment

Choose a reason for hiding this comment

Uh oh!

ranocha commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 2, 2026 •

edited

Loading