Skip to content

render: Improve hairline strokes and scaling strokes on WebGL and WGPU#23011

Merged
kjarosh merged 1 commit into
ruffle-rs:masterfrom
darktohka:bugfix/hairline-strokes
May 4, 2026
Merged

render: Improve hairline strokes and scaling strokes on WebGL and WGPU#23011
kjarosh merged 1 commit into
ruffle-rs:masterfrom
darktohka:bugfix/hairline-strokes

Conversation

@darktohka
Copy link
Copy Markdown
Contributor

@darktohka darktohka commented Feb 11, 2026

This pull request improves both hairline strokes and scaling strokes on the Web (WGPU, WebGL renderers) and Desktop (WGPU renderer) targets.

The main idea is to keep track of the scale of the graphics that are being tessellated on the rendering backends. The tessellated shapes are then stored in a tessellation cache, which is a simple LRU cache that keeps track of the most frequently tessellated shapes (4 max per shared graphic). This means that the last 4 uniquely used tessellated scale buckets will be left cached. Shapes will only be retessellated if they grow or shrink by 2x relative to a cached variant (controlled by RETESSELLATION_SCALE_THRESHOLD).

When a shape grows disproportionately, it is re-tessellated. The re-tessellation precision (threshold) is specified by the scale. The larger the scale, the more precise the tessellation will be: small objects are expected to have less detail either way.

Tessellation cache is reused between graphic instances that use the same graphic as an optimization.

Hairline stroke rendering is also improved.

This fixes issues such as (tested them): #18852 #21803 #751 #7369 #14268 #13984 #1955 #3216 #9044 #2023 #11704 #12360 #14551 #20211 #1412
Partially (composite issues - not all from these are fixed, just the strokes): #10524 #12057
Could not test (site locks, missing SWF, etc): #20345 #3216 #18855 #1625 #9309

Relevant technical discussions: #7042 #7369 #751

Before:
image
After:
image

Before:
image
After:
image

Before:
image
After:
image

Before:
image
After:
image

Copilot AI review requested due to automatic review settings February 11, 2026 23:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request significantly improves rendering quality of hairline and scaled strokes in Ruffle's WebGL and WGPU backends by implementing scale-aware tessellation. The implementation adds an LRU tessellation cache that stores up to 4 different tessellations per graphic at different scales, retessellating only when shapes grow or shrink by more than 2x. This approach addresses numerous long-standing rendering issues where strokes appeared too thick or too thin when graphics were scaled.

Changes:

  • Introduces TessellationCache with LRU eviction to cache tessellated shapes at different scales
  • Adds register_shape_with_scale() method to render backends to support scale-aware tessellation
  • Modifies tessellator to adjust hairline stroke width and tessellation tolerance based on scale
  • Updates Graphic display objects to calculate current scale and retrieve or create appropriately scaled tessellations

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
core/src/tessellation_cache.rs New LRU cache for storing up to 4 tessellated shapes per graphic at different scales
core/src/lib.rs Adds tessellation_cache module to the core library
core/src/display_object/graphic.rs Integrates tessellation cache; calculates scale from transform matrix and retrieves/creates scaled tessellations
render/src/backend.rs Adds register_shape_with_scale() trait method with default implementation
render/wgpu/src/backend.rs Implements scale-aware shape registration for WGPU backend
render/webgl/src/lib.rs Implements scale-aware shape registration for WebGL backend
render/src/tessellator.rs Adjusts hairline stroke width and tessellation tolerance based on scale to prevent artifacts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread render/src/tessellator.rs
Comment thread render/src/tessellator.rs
Comment thread core/src/display_object/graphic.rs Outdated
Comment thread core/src/tessellation_cache.rs
@darktohka
Copy link
Copy Markdown
Contributor Author

darktohka commented Feb 12, 2026

Tests that have improved:

  • visual/simple_shapes/strokes/scale: Large black box fixed, triangular circle fixed, very slim square fixed
  • from_shumway/acid/acid-stroke-0: Fixed invisible square
  • from_gnash/misc-ming.all/shape_test: Fixed very thick line

Slightly different but visually indistinguishable (mostly due to precision increasing):

  • visual/cache_as_bitmap/scroll_rect_scaled: Edges are differently defined
  • visual/cache_as_bitmap/masks: Edge of squares are differently defined
  • from_shumway/acid/acid-video: Visual's position is a little bit off, but visually indistinguishable, Windows CI runner renders Ruffle version differently as well compared to Linux and Mac CI (3 pixels difference)
  • from_shumway/acid/acid-mask: Edges are differently defined
  • from_shumway/acid/acid-clip: Edges are differently defined
  • from_shumway/acid/acid-blend-2: Edges are differently defined
  • from_shumway/3_joystick: Different antialiasing around circle
  • avm1/edittext_stylesheet: Edges are differently defined
  • visual/edittext/edittext_caret_empty: Edges are differently defined
  • visual/edittext/edittext_gutter: Edges are differently defined
  • visual/edittext/edittext_underline_scale2: Edges are differently defined

Broken:

  • from_shumway/acid/acid-small: Wheels are different width than they should be, subjectively looks better visually though, marked as broken

@Lord-McSweeney Lord-McSweeney added A-rendering Area: Rendering & Graphics T-fix Type: Bug fix (in something that's supposed to work already) labels Feb 12, 2026
Comment thread tests/tests/swfs/visual/edittext/edittext_caret_empty/test.toml Outdated
Comment thread tests/tests/swfs/from_shumway/acid/acid-blend-2/test.toml Outdated
@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented Feb 12, 2026

@darktohka Just a general remark about visual tests: the tolerance/max_outliers are set so that tests pass on CI and on devs' machines. You should either:

  1. run all tests on master and make them pass (in a separate PR), or
  2. do not look at your local results, only at CI.

If you make changes and edit tests to pass locally in the same PR, it will result in an unmergeable mess. If there are any changes to tests, we want them to be well documented, and well-thought-out.

I'd recommend to stick to the 2nd option for now. It should be relatively easy—set 0 tolerance, push, download image diffs, set appropriate tolerance and outliers based on image diffs.

At the end of the day, if your PR brings us closer to Flash Player, you shouldn't need to increase tolerance/outliers. If you do, it could mean a bad test that didn't have output from Flash Player. I can then take care of those tests and fix them before merging this PR.

TL;DR: my recommendation is to revert changes to tolerance/outliers in tests and see what happens.

@darktohka
Copy link
Copy Markdown
Contributor Author

Good point.

I will remove the change to the tests from this PR and keep only the functional changes, let's see what happens.

@darktohka darktohka force-pushed the bugfix/hairline-strokes branch from 6d8a2af to dd53f0b Compare February 12, 2026 23:00
@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented Feb 12, 2026

I think those failures are caused by the fact that we are using Ruffle's and not FP's output. I'll take a look in my free time and I'll try fixing them up.

@danielhjacobs
Copy link
Copy Markdown
Contributor

danielhjacobs commented Feb 13, 2026

Worth noting that something similar seems to be true for #22961. That PR caused a bunch of changes to images because lyon changed a little bit about its rendering methods and most of those tests were from Ruffle, not FP. I fixed that up by downloading the images from CI, but ideally we'd replace those tests with FP images and then see whether the lyon update improves their consistency with Flash.

@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented Feb 17, 2026

Made 2 PRs which fix tests failing here:

Hopefully after merging them and rebasing this PR, they should stop failing, and they should even improve a bit (but don't worry about that, we can lower tolerance later).

After those, there are few known failures failing, I'll try looking into them, but I think we're just closer to Flash Player and that's why they are failing.

@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented Feb 18, 2026

@darktohka Can you rebase the PR on top of main? The majority of tests should stop failing.

@darktohka darktohka force-pushed the bugfix/hairline-strokes branch from dd53f0b to fee6a8e Compare February 18, 2026 21:21
@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented Feb 18, 2026

Only one non-known-failure test is failing: from_shumway/acid/acid-clip. Looks like it could be a regression? After this PR we're farther away from Flash Player.

@darktohka
Copy link
Copy Markdown
Contributor Author

Testing from_shumway/acid/acid-clip, we get the following results:

Scenario Branch Test Target Outliers Max Difference
Tolerance = 125 Master from_shumway/acid/acid-clip 6598 144
Tolerance = 125 Current PR from_shumway/acid/acid-clip 6589 144
Tolerance = 64 Master from_shumway/acid/acid-clip 120 144
Tolerance = 64 Current PR from_shumway/acid/acid-clip 135 144

So the CI fails because it reached the threshold of 125 outliers when tolerance = 64.

Here are the image differences:

Master branch
output difference-color-linux-Vulkan-master-branch

Current PR

output difference-color-linux-Vulkan-this-pr

The other tests are improved and bring us closer to FP when you compare the current output with the Flash output (they fail since we're getting further from the master branch Ruffle which is currently incorrect)

I don't think we're going to get any closer (scientifically, 1-to-1) than this with adjusting the tessellation-based backend, since it's going to be lossy either way. It's very dependent on the tolerance and how the triangles are calculated. It's fundamentally rendering differently than Flash Player did.

However, I believe the impact we can make with actual content and games is great. So much content is improved with this change.

I have another PR in the works that will implement LineScaleMode for the WGPU/WebGL backends, but that doesn't get any closer 1-to-1 either:

Screenshot From 2026-02-19 01-32-49

@ncuxonaT
Copy link
Copy Markdown

ncuxonaT commented Mar 5, 2026

This should fix the missing ninja body and barely visible outlines of coins and mines in the N, right?
Ruffle:
N_ruffle
FP:
N_flashplayer

swf: N.zip

@darktohka
Copy link
Copy Markdown
Contributor Author

darktohka commented Mar 5, 2026

This should fix the missing ninja body and barely visible outlines of coins and mines in the N, right?

swf: N.zip

It fixes the barely visible outlines and improves on the mines, but does not improve the player character:

image

Strangely enough, JPEXS doesn't like the player character either:
image

@yangyangdaji
Copy link
Copy Markdown

yangyangdaji commented Mar 15, 2026

This PR Supersedes and Close #9981 ?

Would you test this #9981 (comment)

@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented May 4, 2026

@darktohka Sorry for the delay, I should have taken care of it earlier... Code looks great! I would have implemented it roughly the same way. The only remaining thing is tests, which I will take care of and push changes here.

@kjarosh kjarosh force-pushed the bugfix/hairline-strokes branch 4 times, most recently from c763caa to 1f57a7d Compare May 4, 2026 23:00
- Add tessellation cache for storing previous tessellation results
- Base tessellation tolerance and width based on scale
- Retessellate objects if their scale changes by 2x

This pull request improves both hairline strokes and scaling strokes on
the Web (WGPU, WebGL renderers) and Desktop (WGPU renderer) targets.

The main idea is to keep track of the scale of the graphics that are
being tessellated on the rendering backends. The tessellated shapes are
then stored in a tessellation cache, which is a simple LRU cache that
keeps track of the most frequently tessellated shapes (4 max per shared
graphic). This means that the last 4 uniquely used tessellated scale
buckets will be left cached. Shapes will only be retessellated if they
grow or shrink by 2x relative to a cached variant (controlled by
RETESSELLATION_SCALE_THRESHOLD).

When a shape grows disproportionately, it is re-tessellated. The
re-tessellation precision (threshold) is specified by the scale. The
larger the scale, the more precise the tessellation will be: small
objects are expected to have less detail either way.

Tessellation cache is reused between graphic instances that use the same
graphic as an optimization.

Hairline stroke rendering is also improved.
@kjarosh kjarosh force-pushed the bugfix/hairline-strokes branch from 1f57a7d to c71e7a8 Compare May 4, 2026 23:03
@kjarosh
Copy link
Copy Markdown
Member

kjarosh commented May 4, 2026

So I tested out the patch on a lot of content. Mostly it doesn't seem to negatively impact performance, sometimes it even affects performance positively—I guess it's because we're not only increasing detail for large scales, but also decreasing detail for small scales.

It still doesn't fix the scaling 100%, it looks like Flash uses both x and y scales, and not a combined scale. Hairline strokes also seem off in some cases.

However, this PR improves strokes in the majority of cases, and architecturally we're going in the right direction: retessellation cache is the right solution IMO. There are small things that could be improved with the code, but it doesn't make sense to block on them, they can be fixed as a follow-up by somebody.

Copy link
Copy Markdown
Member

@kjarosh kjarosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! That improves rendering by a lot and fixes one of the most annoying issues in Ruffle.

@kjarosh kjarosh merged commit 1c7d411 into ruffle-rs:master May 4, 2026
26 checks passed
Hancock33 added a commit to Hancock33/batocera.piboy that referenced this pull request May 9, 2026
------------------------------------------------------------------------------------------
dolphin-emu.mk b0eb643c614ddeda6400dc4033d58934a20ba5eb # Version: Commits on May 05, 2026
------------------------------------------------------------------------------------------
Merge pull request #14642 from SuperSamus/cpp-move-fixup-nocubeb

Fixup #14565 (compilation with `-DENABLE_CUBEB=OFF`),

-----------------------------------------------------------------------------------
eden.mk 4f4c298a39fee558f2a593157192afe7f821014c # Version: Commits on May 05, 2026
-----------------------------------------------------------------------------------
[hle, service] fix errors related to race conditions triggering under SMG1 and SMG2 (#3927)

-----------------------------------------------------------------------------------------------
lindbergh-loader.mk 0af606d845b70339c335785c0eba68b47b78df3c # Version: Commits on May 05, 2026
-----------------------------------------------------------------------------------------------
Update Patreon link in README.md,

--------------------------------------------------------------------------------------
openmsx.mk 22ec19b72a717446a18364fecda8e8132e0e0880 # Version: Commits on May 05, 2026
--------------------------------------------------------------------------------------
Update Node.js 20 actions to Node.js 24 versions.,

-----------------------------------------------------------------------------------
play.mk c9eccec03d1ee6840a3b818153df7fea7a6c142c # Version: Commits on Apr 16, 2026
-----------------------------------------------------------------------------------
FrameDebugger: Set initial file picker directory.,

-------------------------------------------------------------------------------------
ppsspp.mk 462b57bc1a21417b097acd06711935bdc9334c43 # Version: Commits on May 05, 2026
-------------------------------------------------------------------------------------
Merge pull request #21642 from hrydgard/dinput-code-cleanup

UWP keyboard fix, DInput code cleanup,

------------------------------------------------------------------------------------
rpcs3.mk d93d9b2c5aa859d1cf2f1381cefd204fb022163a # Version: Commits on May 05, 2026
------------------------------------------------------------------------------------
game_list: Fix ISO cache bypass in is_from_yml branch for multi-game ISOs (#18683)

Fixes regression from #18546 and #18679.

## Problem

The is_from_yml ISO branch constructed iso_archive unconditionally,

bypassing the cache check inside add_game, making the cache write-only

for yml-sourced ISOs.

## Fix

Added a lightweight index cache entry (iso_path + \//index\) storing the

subdir list + mtime. On hit, skips archive construction entirely. On

miss, walks as before and writes the index,

-----------------------------------------------------
ryujinx.mk 1.3.287 # Version: Commits on May 05, 2026
-----------------------------------------------------
1.3.287

--------------------------------------------------------------------------------------
shadps4.mk 4d3827c34949d034cc47e86c943b7fd9318c48ae # Version: Commits on May 05, 2026
--------------------------------------------------------------------------------------
Avoid out-of-bounds array access when checking custom color for TV Remote (#4356),

---------------------------------------------------------------------------------------
touchhle.mk f886c577758f596b2a77ed599a9e1a3597540cb7 # Version: Commits on May 04, 2026
---------------------------------------------------------------------------------------
Remove edits to SDLActivity.java

It seems that debug builds work fine without it? I'm not sure why it was

breaking before...

Change-Id: Ibaf1cdaf55a91bdb12c02d5d5ac423ba1d112194,

-------------------------------------------------
vice.mk r46091 # Version: Commits on May 04, 2026
-------------------------------------------------
null

-------------------------------------------------------------------------------------------
xenia-canary.mk 80f2b535e9736a9772de528952877e912c328aea # Version: Commits on Feb 15, 2026
-------------------------------------------------------------------------------------------
[Kernel] Added KeSaveFloatingPointState and KeRestoreFloatingPointState from nukernel,

-----------------------------------------------------------------------------------------
xenia-edge.mk ba5fd0f4149a99e8665e989d53bbd2c6b9b7bc91 # Version: Commits on May 05, 2026
-----------------------------------------------------------------------------------------
[GPU/macOS] Tighten vblank and present pacing with mach_wait_until,

-----------------------------------------------------------------------------------
ymir.mk 374c8be5c37eb3853a9f0fc2b1eb5c263c725fe2 # Version: Commits on May 05, 2026
-----------------------------------------------------------------------------------
chore: Update Patreon supporters list,

---------------------------------------------------------------
ruffle.mk nightly-2026-05-05 # Version: Commits on May 05, 2026
---------------------------------------------------------------
## What's Changed

* ci: Add support for release version bumps other than nightly by @kjarosh in ruffle-rs/ruffle#23618

* chore: Bump esbuild version in package-lock.json by @torokati44 in ruffle-rs/ruffle#23616

* chore: Bump rollup package version in package-lock.json by @torokati44 in ruffle-rs/ruffle#23615

* chore: Bump webpack-cli to 7 in web/ by @torokati44 in ruffle-rs/ruffle#23613

* render: Improve hairline strokes and scaling strokes on WebGL and WGPU by @darktohka in ruffle-rs/ruffle#23011

## New Contributors

* @darktohka made their first contribution in ruffle-rs/ruffle#23011

**Full Changelog**: ruffle-rs/ruffle@nightly-2026-05-04...nightly-2026-05-05,

-----------------------------------------------------------------------------------------
catacombgl.mk a18035bf899d6f3093b487725b3c6e3867365231 # Version: Commits on May 05, 2026
-----------------------------------------------------------------------------------------
Adapt Catacomb 3-D menu instructions for game controller,

------------------------------------------------------------------------------------
cdogs.mk 3483ad394587f205f467a0d819b435395145b879 # Version: Commits on May 05, 2026
------------------------------------------------------------------------------------
Fix vehicle head drawing,

------------------------------------------------------------------------------------------
devilutionx.mk 3eb2b44e5a572c7ae1aaf8eaaa3856d188110d88 # Version: Commits on May 01, 2026
------------------------------------------------------------------------------------------
Ensure that buffered player info gets processed,

------------------------------------------------------------------------------------------
fallout2-ce.mk e42d8021c1fddc51ede3216f89cc9cdc75e07dc5 # Version: Commits on May 05, 2026
------------------------------------------------------------------------------------------
WIP Mapper implementation (#438)

* Add mapper CMakeTarget, tool for mapping function names to originals, load/save toolbar & update_art implemented

* edit_mapper function + stubs

* Rename exe to mapper-ce

* load_lbm_to_buf

* Add comments for read/write functions in db.h

* load_dialog, save_dialog, save_as, info_dialog and some other functions

* Fix LBM loading

* Fix mouse input not working on initial empty map, changed error in partyMemberRecoverLoadInstance to print to log, matching vanilla

* mapper.cc: basic hi-res support, NULL->nullptr

* load_lbm_to_buf rewrite, print_toolbar_name background fix

* Stubs for enter/exit playmode, art slot indexes fix, map_scr_toggle_hexes

* Fix memory corruption on screen_width > 640, fix various UI offset bugs

* mapper.cc: UI code style, toggle button fixes, rotation keys, edit button placeholders, PAGEUP key fix

* Elevation display fix, object type switching

* Spatial script placement and display, basic object selection

* Fixed dragging objects, block object showing, add all missing cases in edit_mapper with stubs, move all keys codes to constants

* chore: auto-format with clang-format

* Fix non-win builds

* Add stub calls from edit_mapper, fix objects being incorrectly deleted when unselected, fix tile number display

* Fix compile on Linux

* Attempt to fix iOS signing error

* Placing of objects and tiles, F12 to erase map, bug fixes

* Fixed block object toggling logic and add missing switch cases to edit_mapper

* Object editing added, 'p' to scroll palette fixed

* Add new files to CMakeLists

* Attempt to fix some colors + alignment in critter edit window

* chore: auto-format with clang-format

* Linux build fix attempt

* Critter inventory editing

* Vanilla grid-based inventory item picker

* Review fixes

* More review fixes and const correctness

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>,

----------------------------------------------------------------------------------------
retroarch.mk 14a5cc00a050b3d253d42ae0afa284c4a6fb129f # Version: Commits on May 05, 2026
----------------------------------------------------------------------------------------
Fix Dolphin autostate load hang by sleeping a bit first,

----------------------------------------------------------
bgfx.mk v1.143.9248-539 # Version: Commits on May 05, 2026
----------------------------------------------------------
Fix cmake syntax error when compiling shaders in Debug mode,

------------------------------------------------------------
libdof.mk c02135e90ce1acd13a5ab21a4878b1d1820bbe49 NULL-NULL
------------------------------------------------------------
Moved Permanently,

---------------------------------------------------------------------------------------
vpinball.mk 034f9408539c8bc39866305fdb9cd57721961816 # Version: Commits on May 04, 2026
---------------------------------------------------------------------------------------
BGFX: use camera relative rendering to support low precision platform (Meta Quest),

----------------------------------------------------
glslang.mk 16.3.0 # Version: Commits on May 01, 2026
----------------------------------------------------
Deprecation Notice:

* Deprecate the HLSL front-end. See issue #4210 for details.

Changes in this release:

* Support GL_NV_explicit_typecast

* Raise the maximum limit for specialization constant IDs

* Add explicit 8-bit and 16-bit type support for bitfieldReverse

* Implement system include directives for the standalone wrapper

* Check for invalid usage of gl_WorkGroupSize components

* HLSL: Provide string error context only if token is a string

* Fix layoutDescriptorStride bitfield truncation for large stride values

* GL_EXT_long_vector with 2-4 components no longer require LongVector capability

* Fix alignment of guard blocks

* Fix ShaderDebugInfo having invalid line numbers when generating SPIRV 1.0

* Replace ostringstream with string concat during #include preprocessing

* Check for bad parameters on long vector type

* HLSL: Check for bad integer argument on Load*, Store*, Interlocked*

* HLSL: handle type error for ternary operator

* HLSL: Ensure scope is popped even when method body fails to parse

* Avoid unneccessary copies in SpirvIntrinsics.cpp

* Unconditionally emit debug source for include files when using non-semantic debug info

* Support bfloat16 and float8 tensors

* Add small type capabilities for GLSL.STD.450

* Add initial support for NonSemantic.Shader.DebugInfo 101

* Fix access chains for GL_ARM_tensors with raw descriptor heap accesses

* Support GL_KHR_compute_shader_derivatives

* Require a quad or linear layout qualifier to be specified for GL_KHR_compute_shader_derivatives

* Supportx SPV_KHR_constant_data and SPV_KHR_abort

----------------------------------------------------------------------------------------
doomretro.mk 827c09d875a53f4a6ad6464d30448c51496ab6b9 # Version: Commits on May 05, 2026
----------------------------------------------------------------------------------------
Update releasenotes.md,

--------------------------------------------------------------------------------------
yquake2.mk f8939a0561ac992837ab006c144fd972d9cd1628 # Version: Commits on May 04, 2026
--------------------------------------------------------------------------------------
game: scale ammo on fire

Scale exposion effect is unsupported by protocol.,

------------------------------------------------------------------------------------------
xash3d-fwgs.mk e6a44b70e08c379fc6dc059ae7cfeca799fb7c58 # Version: Commits on May 04, 2026
------------------------------------------------------------------------------------------
engine: client: always load client.dll last to crash on nullptr in mods that fetch cvar pointers early, add comment for anyone who would modify this file,

--------------------------------------------------------------------------------------------------
libretro-beetle-psx.mk 882e55b8cb3a1b4c3b91d71a2c156a9b33f279b8 # Version: Commits on May 05, 2026
--------------------------------------------------------------------------------------------------
mednafen: drop clamp.h; fold + optimize audio saturation; fix Vulkan static-after-extern shadow

Two changes that travel together because they touch the same audit

pass.

(1) clamp.h dropped, callers folded inline

==========================================

clamp.h was a 29-line file with one 4-line static inline

function (`clamp(int32_t *val, ssize_t min, ssize_t max)`) that

saturated a value in place. 12 call sites across spu.c (7),

cdc.cpp (4), and gte.c (1). All but one saturated to the

signed 16-bit audio range [-32768, 32767]; the gte.c outlier

saturates to [-32768 + lm * 32768, 32767] where lm is a bit

from the GTE opcode. Folded inline at every call site, where

each fold also gets a comment explaining what kind of

saturation is happening (audio output sample, ADPCM IIR-filter

intermediate, GTE projected coordinate, etc.).

While auditing the call sites for the fold, three real

optimisation opportunities surfaced:

 (a) cdc.cpp ApplyVolume short-circuit on Muted:

     Historical body computed L/R volume-matrix mix

     unconditionally, ran two clamps, then conditionally zeroed

     both channels if Muted was set. Muted is the resting state

     any time CD audio isn't actively playing - probably the

     majority of frames in many games. Reordered to test Muted

     first and bail with samples[]=0 in that case; mix and

     clamp only run when the result is going to be used. Saves

     4 multiplies + 4 shifts + 2 adds + 4 saturating compares

     per sample on the muted path. Same final samples[] in both

     paths so behaviour is identical.

 (b) cdc.cpp GetCDAudio resampler eliminates out_tmp[2] stack

     scratch:

     The fractional-rate path used an int32 out_tmp[2] stack

     accumulator, accumulated each channel's 25-tap windowed-

     sinc convolution into it, clamped, then copied to

     samples[i]. Folded into a per-channel local int32 acc that

     accumulates and writes straight to samples[i] - same ops,

     one fewer stack temp.

 (c) spu.c per-sample mix loop eliminates output[2] stack

     scratch and tightens the IntermediateBuffer overflow

     guard:

     The mix loop computed per-LR `output[lr]` from accum[lr]

     and the global volume sweep, clamped, and on the next line

     wrote `(output[lr] * 3 + 2) >> 2` to IntermediateBuffer.

     output[] only existed to carry one int32 per channel

     between those two lines. Fused: the post-volume-sweep

     value is computed inside the IntermediateBuffer write

     expression directly, saving 8 bytes of stack and one

     round-trip per sample. As a side effect the

     IntermediateBufferPos overflow guard now covers the

     volume-sweep step too - previously only the buffer write

     was guarded and the sweep + clamp ran every sample even

     when the buffer was full (debugger edge case).

     SPU_Sweep_ReadVolume is pure (returns sweep->Current), so

     skipping it on the buffer-full path is behaviour-

     preserving.

The two reverb resampler helpers (Reverb4422 / Reverb2244)

collapse from `clamp(&out, ...); return out;` to a pair of

inline ifs followed by `return out;`. Each is a simple

collapse, no semantic change.

The voice-decode clamp inside the SPU's ADPCM nibble loop is a

straight inline-the-clamp; no opportunity for a structural

optimisation there because the saturated value feeds into both

tb[i] and the M1/M2 history (PS1 silicon clamps at int16 for

its IIR filter state), so the temporary is genuinely needed.

Per-TU text-section sizes at -O2 (size /tmp/X.o):

                  before   after    delta

  spu.o            34846    34910    +64

  gte.o            20055    20055      0

  cdc.o            29443    29379    -64

                                    ----

                                       0   net

Same total binary size; the optimisations balance the slight

structural growth from the IntermediateBuffer-guard rework.

(2) rsx_lib_vulkan.cpp: rename file-static crop_overscan to

    avoid extern-vs-static shadow

======================================================

fc4d742 (\core: prune dead globals; consolidate cross-TU

extern decls\) replaced rsx_lib_vulkan.cpp's local-extern

redecls of cross-TU globals with a `#include

\beetle_psx_globals.h\`. The header includes

`extern int crop_overscan;`. Unfortunately the file had a

`static int crop_overscan;` declaration at file scope from

long before fc4d742 - a long-standing shadow of the global

that nothing else in the TU referenced.

g++ (correctly) refuses the resulting static-after-extern:

  rsx/rsx_lib_vulkan.cpp:55:12: error: 'crop_overscan' was

    declared 'extern' and later 'static' [-fpermissive]

   55 | static int crop_overscan;

      |            ^~~~~~~~~~~~~

Renamed the file-static to `vulkan_crop_overscan` plus its 8

internal use sites; the BEETLE_OPT(crop_overscan) macro key on

line 360 stays as-is (it's the env-var name, not the variable

name). Behaviour preserved bit-perfect: the file still reads

the BEETLE_OPT(crop_overscan) env var into its own private

copy and uses that locally, exactly as before. The cross-TU

global crop_overscan from beetle_psx_globals.h is left for

other TUs (libretro.cpp, gpu.cpp, input.cpp, rsx_intf.cpp,

rsx_lib_gl.cpp) which have always read it directly. The two

parallel-but-separate values track identically because

libretro.cpp's check_variables() reads the same env var into

the global at the same time rsx_lib_vulkan reads it into the

static.

Verification

============

  - All 9 sampled CXX TUs (gpu.cpp, frontio.cpp, cdc.cpp,

    cpu.cpp, guncon.cpp, justifier.cpp, gamepad.cpp,

    general.cpp, mempatcher.cpp) compile clean at -O2.

  - All 10 sampled C TUs (dma.c, gte.c, timer.c, spu.c, sio.c,

    irq.c, mdec.c, error.c, mednafen-endian.c, Deinterlacer.c)

    compile clean at -O2.

  - rsx_lib_vulkan.cpp structural check passes - no

    static-vs-extern conflicts, no undeclared-symbol errors

    (the file still needs Vulkan SDK headers not on this

    sandbox to compile fully, but those errors are unrelated

    and identical before/after this change).

  - Direct grep confirms zero remaining `clamp(` calls outside

    GLSL shader code (`clamp(uint(coords.x), 0, 0xff)` in

    rsx/shaders_gl/command_fragment.glsl.h is GLSL's built-in,

    not C).,

---------------------------------------------------------------------------------------------
libretro-fbneo.mk f7574b86e0eeece0e8c633b77dd9833840155dd9 # Version: Commits on May 05, 2026
---------------------------------------------------------------------------------------------
(libretro) update files,

--------------------------------------------------------------------------------------------------
libretro-gearcoleco.mk c4ae7b25b35ab1060fa84cc5464dd899b43651d2 # Version: Commits on May 04, 2026
--------------------------------------------------------------------------------------------------
Update publish to mcp registry workflow,

-------------------------------------------------------------------------------------------------
libretro-geargrafx.mk c4b8b8eab4427ebfe4a5f08af8b349ff3b4a21bc # Version: Commits on May 04, 2026
-------------------------------------------------------------------------------------------------
Update publish to mcp registry workflow,

--------------------------------------------------------------------------------------------------
libretro-gearsystem.mk 4dedd026c1c861158e1f17b8616bdf11d7cd9ad2 # Version: Commits on May 04, 2026
--------------------------------------------------------------------------------------------------
Update publish to mcp registry workflow,

---------------------------------------------------------------------------------------------
libretro-noods.mk 626628ca270e41528c20ebbedb69408eca326834 # Version: Commits on May 05, 2026
---------------------------------------------------------------------------------------------
Libretro: fix saves on non unix platforms,

----------------------------------------------------------------------------------------------
libretro-ppsspp.mk 462b57bc1a21417b097acd06711935bdc9334c43 # Version: Commits on May 05, 2026
----------------------------------------------------------------------------------------------
Merge pull request #21642 from hrydgard/dinput-code-cleanup

UWP keyboard fix, DInput code cleanup,

-------------------------------------------------------------------------------------------
libretro-ps2.mk 0f2c9a7c615357e6d82a4520e502f94ff27ca77b # Version: Commits on May 05, 2026
-------------------------------------------------------------------------------------------
Buildfixes: restore __forceinline on non-mingw toolchains

The d2d1ebc / fdb0eec / c9d5ee4 series stubbed __fi / __ri /

__releaseinline (and removed __forceinline from a few SPU2 hot-path

functions) to make the libretro Makefile build link under mingw.  That

was correct for the failing target, but it was applied universally and

silently disabled cross-TU inlining on every working toolchain too -

MSVC, Linux gcc, macOS clang.  The hot paths that lost their always-

inline (SPU2 Mix / TimeUpdate / spu2M_Write / UpdateSpdifMode and

everything reached through __fi / __ri elsewhere in the codebase) are

all on the audio mix and EE/IOP-recompiler-adjacent paths where the

inlining is the point of the decoration.

The actual breakage is mingw-only.  mingw-w64's _mingw.h defines

__forceinline as `extern __inline__ __attribute__((__always_inline__,

__gnu_inline__))`, which under GNU inline rules means \inline at every

callsite AND DO NOT emit an out-of-line copy\.  In a non-LTO build

that turns every cross-TU caller of a __forceinline-decorated free

function (dmaSIF1, vtlb_GetPhyPtr, x86Emitter::xPUSH, the four SPU2

ones above, ...) into an undefined reference.  cmake builds avoid this

because PCSX2_LTO=ON merges all TUs at link time; the libretro

Makefile builds do not LTO.

MSVC's __forceinline always emits an out-of-line copy, and Linux/macOS

gcc/clang's __attribute__((always_inline, unused)) also emits one.

On those toolchains the historical decoration is correct.

So we keep the historical __forceinline definition and the historical

__fi / __ri / __releaseinline = __forceinline mapping for everyone,

and special-case __MINGW32__ to bind __fi / __ri / __releaseinline to

empty.  __forceinline itself stays untouched on mingw - the system

headers (winbase.h, processthreadsapi.h, synchapi.h, _mingw.h)

declare strnlen_s / _InterlockedIncrement / NtCurrentTeb / etc as

__forceinline and rely on gnu_inline semantics for ODR.

Verified by preprocessing common/Pcsx2Defs.h on both compilers:

  Linux gcc -DNDEBUG: __fi -> __attribute__((always_inline, unused))

  mingw-w64 gcc      : __fi -> empty, __forceinline left alone

Verified by running nm against fresh .o files compiled with both

compilers in NDEBUG mode:

  Linux:  spu2M_Write / TimeUpdate / UpdateSpdifMode / Mix all emit

          out-of-line T symbols (cross-TU linkable).

  mingw:  same four symbols emit T (cross-TU linkable, link will

          succeed for the libretro Makefile build).

Also restored the __forceinline that was dropped from SPU2 Mixer.cpp's

Mix() and from spu2sys.cpp's three __forceinline functions, but spelt

as __fi instead of __forceinline directly so the mingw-stub path

applies cleanly.

Net effect on the Windows MSVC, Linux, macOS, and cmake builds: code

emission goes back to whatever it was before d2d1ebc (perf restored).

Net effect on the libretro Makefile mingw build: identical to ab74e3d

(still links, still runs as far as it currently does).,

---------------------------------------------------------------------------------------------------
libretro-snes9x-next.mk d9cba8a41b3407ebb929816a7033e0407fd7b2d0 # Version: Commits on May 05, 2026
---------------------------------------------------------------------------------------------------
tile.c: hoist invariant RealScreenColors assignment out of backdrop renderers

The 28 DrawBackdrop16* renderers each began with

    GFX.RealScreenColors = IPPU.ScreenColors;

    GFX.ScreenColors = GFX.ClipColors ? BlackColourMap : GFX.RealScreenColors;

The first line is invariant across the whole backdrop pass: backdrop

has no per-tile palette slice (unlike SELECT_PALETTE for regular tiles)

and no Direct Colour Mode override (unlike Mode 7 entry points), so it

always sets RealScreenColors to IPPU.ScreenColors. Lift that line out

of every renderer body into the DrawBackdrop() and DRAW_BACKDROP_NO_MATH()

macros in ppu.c, set once before the per-clip-region loop.

The second line stays inside each renderer (BlackColourMap is private

to tile.c) and is genuinely per-clip-region (ClipColors changes each

iteration of the macro's loop).

Saves N-1 redundant assignments per backdrop pass where N is the

number of clip regions; perf-negligible. Net -19 lines.

src/ppu.c  +9

src/tile.c -28,

----------------------------------------------------------------------------------------------
libretro-stella.mk 93a070e927573584bb3059028a5514ec22f2b0ce # Version: Commits on May 05, 2026
----------------------------------------------------------------------------------------------
More ostringstream cleanups.,

---------------------------------------------------------------------------------------------
libretro-vba-m.mk 26fe5b40ca10931bf5e4bfde671a85625247e1a4 # Version: Commits on May 05, 2026
---------------------------------------------------------------------------------------------
ci: disable SDL3 PPA on Ubuntu runners for now

Disable getting the SDL3 backport from a PPA on the Ubuntu CI runners

for now due to issues with launchpad.

Signed-off-by: Rafael Kitover <rkitover@gmail.com>,

-------------------------------------------------------------------------------------------
glsl-shaders.mk 42fa8a98ab19bdaffb53280746a30819eb21f807 # Version: Commits on May 05, 2026
-------------------------------------------------------------------------------------------
crt-geom-mini; optimize to be closer to crt-geom, tiny-ntsc add saturation parameter (#562)

* Update crt-geom-mini.glsl

* Update tiny_ntsc.glsl

* Update crt-geom-mini.glslp,

--------------------------------------------------------------------------------------------
slang-shaders.mk 2ba50bfaeae630741216a9b60b5147485657316f # Version: Commits on May 05, 2026
--------------------------------------------------------------------------------------------
vectorscale: pack-positions pre-pass + geometric crossing intersection (#909)

* vectorscale: pack-positions pre-pass + inline crossing intersection

Adds a per-CP pre-pass (pack-positions) that denormalizes render

geometry into a single PackedPositions texture and folds the crossing

curve-curve intersection into the same pass. The rasterizer reads its

full per-CP geometry from PackedPositions and skips ghost extension,

neighbor-index decoding, and t_branch solving in its hot loop.

New shader: pack-positions.slang

For each CP slot, packs into 3 horizontally-adjacent texels:

  col 0 = (pp.x, pp.y, prev_ci_or_-1, _)

  col 1 = (cp.x, cp.y, t_branch, validity 0=skip 1=normal 2=line)

  col 2 = (np.x, np.y, next_ci_or_-1, _)

(pp, cp, np) is the ghost-extended (pp = 2·prev - cp etc.) Bezier

control triple. t_branch is computed per CP type:

- IS_CROSSING: 2D Newton iteration on F(t,s) = B_a(t) - B_b(s) = 0,

  starting from (0.5, 0.5). The optimizer keeps crossings near the

  grid corner so the initial guess is within ~0.1 of the answer;

  4 iterations drive the residual below f32 epsilon. Reads neighbor

  positions from both this slot's chain (N-S or E-W) and the partner

  slot's chain.

  This replaces the legacy ghost-aware inverse-correction that moved

  each crossing CP so the rendered curve passed through the grid

  corner at t=0.5. The CP now stays at its optimizer-final position

  and the rasterizer's wedge AA anchors at the geometric intersection

  B_a(t) = B_b(s).

- 2-CP chain (degenerate stem with both ends as endpoint markers):

  t_branch = 0.5; render geometry pre-built as a straight line so the

  rasterizer dispatches to its closed-form line solver via is_line.

- One-sided clamped Bezier (prev or next is endpoint): closed-form

  cubic project of the interior B-spline midpoint onto the clamped

  span — finds the t at which the rendered clamped curve reaches the

  same physical \before/after sc\ boundary an interior B-spline would

  at t=0.5.

- Else: t_branch = 0.5.

Modified: update-tjunction.slang

Drop the IS_CROSSING ghost-aware inverse-correction branch; crossings

pass through unchanged. Drops the now-unused Opt2 sampler binding,

read_orig_pos helper, and Opt2Size UBO field.

Modified: cell-rasterizer.slang

Replace read_pos + read_neighbors + ghost extension + 2-CP-chain

construction + t_branch cubic-solver in test_one_cp with a single

read_packed_cp(ci) call returning a PackedCp struct. Per-active-probe

fetch count: ~6 → 4 (1 flag + 3 packed reads). resolve_hit's

neighbor-direction lookups for color resolution are unchanged.

Modified: vectorscale.slangp

11 passes (was 10). pack-positions inserted between the final

update-tjunction iteration (FinalPositions) and cell-rasterizer.

PackedPositions framebuffer is 3.0 × source-relative wide.

* vectorscale: cubic solver — FMA on discriminants, Newton polish, faster trig

Three numerical improvements to closest_on_span:

1. FMA on discriminants. b²−4ac is the textbook catastrophic-cancellation

   case when b² ≈ 4ac (near-double-root); fma(b, b, -4·c·a) computes the

   sum with a single rounding instead of two, recovering ~1 extra bit and

   preventing disc from rounding to the wrong sign at the branch boundary.

   Same trick for the cubic disc q²/4 + p³/27 at the disc≈0 (near-triple-

   root) boundary between Cardano and trig branches.

2. Newton polish on every analytical root. Cardano + acos/cos/pow(_, 1/3)

   come back at ~5 ULP; one Newton step on D'(t) drives the root to

   ~1 ULP. polish_root_c skips when D''(t) is small or |step| ≥ 0.5 to

   avoid divergence at near-double/triple-root cases.

3. Faster trig branch. Replaces pow(sqrt(-p³/27), 1/3) (3 multiplies +

   sqrt + pow(_, 1/6)) with the equal 2·sqrt(-p/3). Reduces work and

   avoids precision loss of pow(_, 1/6).

* vectorscale: split cell rasterizer into single-AA + multi-AA passes

Replaces the monolithic cell-rasterizer.slang with two passes that share

the same algorithm but separate the AA work for occupancy on register-

constrained GPUs.

1. cell-rasterizer-single-aa.slang — tracks one best hit + the second-

   best hit's distance² (no full 2nd hit data). Resolves color, applies

   single-curve AA on the resolved hit. Writes RGB = AA color and

   A = sentinel (1.0 if 2nd hit is within aa_threshold so multi-curve

   AA could fire, 0.0 otherwise). Hit struct is slim (5 scalars: d2, t,

   cp_idx, prev_ci, next_ci) — geometry refetched via read_packed_cp at

   consumer sites (texture cache hits ~100% since test_one_cp just read

   the same texels).

2. cell-rasterizer-multi-aa.slang — reads SingleAA. If A < 0.5, passes

   RGB through unchanged (most pixels). Otherwise redoes find_hits

   (top-3) and runs wedge AA + dual-curve AA gates as in the original

   monolithic rasterizer, falling back to SingleAA's RGB if neither

   fires (single-curve AA already applied). pos/neg colors are scoped

   to each AA branch via out-params on resolve_hit instead of struct

   fields, keeping them out of the cross-branch live set.

Two presets:

- vectorscale.slangp: chains single-aa → multi-aa. Output is equivalent

  to the original monolithic rasterizer; most pixels take the cheap

  early-exit path on pass 2.

- vectorscale-single-aa.slangp: single-aa pass alone. Faster on register-

  constrained GPUs but jaggy at junctions and dual-curve crossings.

The sentinel is purely an inter-pass signal — the standalone single-aa

preset writes it to viewport alpha where display ignores it.

Measured on Apple Silicon: monolithic was 254 VGPRs (1/8 occupancy with

240 bytes spill); single-aa pass alone is ~120 VGPRs (clears the 128

threshold for ~30% occupancy, ~3x faster end-to-end). The chained

two-pass setup matches monolithic output with the early-exit speedup.,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment