Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl_do_print: stringify an SVt_IV IV/UV more efficiently #22927

Open
wants to merge 6 commits into
base: blead
Choose a base branch
from

Conversation

richardleach
Copy link
Contributor

https://stackoverflow.com/questions/79244888 points out that pp_print
is really slow at stringifying integers. It's bizarrely much faster to add
additional operations to do the stringification first.

Here are some sample timings:

1003 ms | 'for (my $i=1; $i<=10000000;$i++) { say $i }'
 630 ms | 'for (my $i=1; $i<=10000000;$i++) { say "$i" }'
 695 ms | 'for (my $i=1; $i<=10000000;$i++) { say $i."" }'
 553 ms | 'my $i = "str"; for ($i=1; $i<=10000000;$i++) { say $i }'

Note from the last timing that stringification in pp_print is not
always slow, only when the IV is stored in a SVt_IV. This traces back
to Perl_do_print which has a special path for just this case:

    if (SvTYPE(sv) == SVt_IV && SvIOK(sv)) {
        assert(!SvGMAGICAL(sv));
        if (SvIsUV(sv))
            PerlIO_printf(fp, "%" UVuf, (UV)SvUVX(sv));
        else
            PerlIO_printf(fp, "%" IVdf, (IV)SvIVX(sv));
        return !PerlIO_error(fp);
    }

(There are no code comments to say why it does this. Perhaps deliberately
preserving the type for some reason?)

PerlIO_printf calls PerlIO_vprintf, which calls Perl_vnewSVpvf, which
creates a new (effectively) temporary SV, then sends that to
Perl_sv_vcatpvfn_flags for stringification and storage of the IV.
PerlIO_vprintf then writes out the buffer from the temp SV and frees it.

All the other routes essentially call SvPV_const on $i or a PADTMP.
This invokes sv_2pv_flags and its helper, S_uiv_2buf, does the work.
No temporary SVs are created and freed in such cases.

This PR changes Perl_do_print to also use a small buffer and
S_uiv_2buf to do the stringification more efficiently, whilst still
avoiding a type upgrade, Re-measuring after this PR:

 470 ms 'for (my $i=1; $i<=10000000;$i++) { say $i }'

  • This set of changes requires a perldelta entry, and it is included.

@richardleach richardleach changed the title Perl_do_print: stringify an SVt_IV IV more efficiently Perl_do_print: stringify an SVt_IV IV/UV more efficiently Jan 17, 2025
sv_inline.h Outdated Show resolved Hide resolved
doio.c Outdated Show resolved Hide resolved
doio.c Outdated Show resolved Hide resolved
doio.c Outdated Show resolved Hide resolved
Copy link
Contributor

@khwilliamson khwilliamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note most of my comments were about the original that got moved. I think there should be a commit with just the move; then another to enhance the resulting one.

LGTM besides mauke's points

sv_inline.h Outdated Show resolved Hide resolved
doio.c Outdated Show resolved Hide resolved
sv_inline.h Outdated
word_ptr = (U16*)ptr;
word_table = (U16*)int2str_table.arr;

if (UNLIKELY(is_uv))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the branch prediction? In the next commit, I don't see why this would be unlikely to be called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was pre-existing. I agree though and will remove the UNLIKELY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UVs are super rare. I'd argue >99% of all integers are stored as IVs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that something like my $foo = 100000 isn't an UV, despite not having a sign and being representable as an UV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there likely any benefit in defaulting sign to zero and changing the code to something like:

   if (!is_uv) {
       if (iv >= 0) {
           uv = iv;
        } else {
            /* This is NEGATE_2UV(iv), which can be found in handy.h. */
            /* sv_inline.h does not include handy.h because the latter
             * would then get included twice into .c files. */
            uv = (ASSUME((iv) < 0), (UV)-((iv) + 1) + 1U);
            sign = 1;
        }
   }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On GCC generated code will be identical: https://gcc.godbolt.org/z/bj9x4xnTz

BTW, "extract the sign and convert to UV" is a common pattern in Perl code. I experimented with a similar snippet extracted from pp_multiply, I tried to make the IV part branchless, but it turned out to be a pessimization. On my PC, the branchless version is 2x slower. The benchmark assumes 100% correct prediction, which I think would be close enough to real life.

Here's the code I benchmarked:

/* branchless.c */
_Bool auvok, buvok;
unsigned long long alow, blow;

/* this may not look branchless, but it actually compiles to branchless code */
void to_uv(long long aiv, long long biv) {
    alow = aiv >= 0 ? aiv : -(unsigned long long)aiv;
    auvok = aiv >= 0;

    blow = biv >= 0 ? biv : -(unsigned long long)biv;
    buvok = biv >= 0;
}

/* branching.c */
_Bool auvok, buvok;
unsigned long long alow, blow;

void to_uv(long long aiv, long long biv) {
    if (aiv >= 0) {
        alow = aiv;
        auvok = 1; /* effectively it's a UV now */
    } else {
        /* abs, auvok == false records sign */
        alow = -(unsigned long long)aiv;
    }

    if (biv >= 0) {
        blow = biv;
        buvok = 1; /* effectively it's a UV now */
    } else {
        /* abs, buvok == false records sign */
        blow = -(unsigned long long)biv;
    }
}

/* main.c */
void to_uv(long long aiv, long long biv);

int main() {
    for (long long i = 0; i < 1000000000; ++i) {
        to_uv(i, i+1);
    }
    return 0;
}

(main and to_uv need to be in separate files to prevent constant folding)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark assumes 100% correct prediction, which I think would be close enough to real life.

Well. It depends. If the code is mixing negative and positive numbers then the branch prediction will be all over the place. I think that multiplication is typically done with two positive integers. But then again it's not like mixing signs is something unusual.

It's always difficult to decide which trade-off is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried a few variations with these three, non-UV cases:

  1. 'for (my $i=1; $i<=10000000;$i++) { say $i }'

UNCHANGED - - if (UNLIKELY(is_uv))

            486.07 msec task-clock                       #    0.969 CPUs utilized          
     2,062,905,670      cycles                           #    4.244 GHz                    
     7,040,218,542      instructions                     #    3.41  insn per cycle         
     1,664,254,169      branches                         #    3.424 G/sec       

NO_UNLIKELY - if (is_uv)

            469.99 msec task-clock                       #    0.934 CPUs utilized          
     2,008,565,577      cycles                           #    4.274 GHz                    
     7,029,774,550      instructions                     #    3.50  insn per cycle         
     1,654,155,079      branches                         #    3.520 G/sec 

REFACTOR - initialize sign to zero, then if (!is_uv) {

            460.94 msec task-clock                       #    0.939 CPUs utilized          
     2,018,454,841      cycles                           #    4.379 GHz                    
     7,030,210,591      instructions                     #    3.48  insn per cycle         
     1,654,255,171      branches                         #    3.589 G/sec 

REFACTOR+LIKELY - initialize sign to zero, then if (LIKELY(!is_uv)) {

           468.79 msec task-clock                       #    0.999 CPUs utilized          
     2,069,028,677      cycles                           #    4.414 GHz                    
     7,039,775,425      instructions                     #    3.40  insn per cycle         
     1,664,153,959      branches                         #    3.550 G/sec
  1. 'for (my $i=1; $i<=10000000;$i++) { say -$i }'

UNCHANGED - - if (UNLIKELY(is_uv))

            526.16 msec task-clock                       #    0.972 CPUs utilized          
     2,217,557,251      cycles                           #    4.215 GHz                    
     7,553,385,689      instructions                     #    3.41  insn per cycle         
     1,755,583,715      branches                         #    3.337 G/sec

NO_UNLIKELY - if (is_uv)

            501.48 msec task-clock                       #    0.952 CPUs utilized          
     2,109,110,712      cycles                           #    4.206 GHz                    
     7,563,289,130      instructions                     #    3.59  insn per cycle         
     1,765,558,885      branches                         #    3.521 G/sec 

REFACTOR - initialize sign to zero, then if (!is_uv) {

            482.06 msec task-clock                       #    0.928 CPUs utilized          
     2,129,078,811      cycles                           #    4.417 GHz                    
     7,563,276,997      instructions                     #    3.55  insn per cycle         
     1,765,556,805      branches                         #    3.663 G/sec   

REFACTOR+LIKELY - initialize sign to zero, then if (LIKELY(!is_uv)) {

            468.20 msec task-clock                       #    0.999 CPUs utilized          
     2,166,095,827      cycles                           #    4.626 GHz                    
     7,553,254,402      instructions                     #    3.49  insn per cycle         
     1,755,552,941      branches                         #    3.750 G/sec
  1. 'my $x; for (my $i=1; $i<=10000000;$i++) { ($i % 2) ? say $x : say -$x}'

UNCHANGED - - if (UNLIKELY(is_uv))

            701.13 msec task-clock                       #    0.948 CPUs utilized          
     2,997,635,522      cycles                           #    4.275 GHz                    
     9,450,504,662      instructions                     #    3.15  insn per cycle         
     2,422,527,385      branches                         #    3.455 G/sec 

NO_UNLIKELY - if (is_uv)

            670.31 msec task-clock                       #    0.958 CPUs utilized          
     2,867,434,790      cycles                           #    4.278 GHz                    
     9,450,450,298      instructions                     #    3.30  insn per cycle         
     2,422,513,128      branches                         #    3.614 G/sec

REFACTOR - initialize sign to zero, then if (!is_uv) {

            692.92 msec task-clock                       #    0.976 CPUs utilized          
     3,013,950,916      cycles                           #    4.350 GHz                    
     9,450,506,408      instructions                     #    3.14  insn per cycle         
     2,422,528,389      branches                         #    3.496 G/sec 

REFACTOR+LIKELY - initialize sign to zero, then if (LIKELY(!is_uv)) {

            671.45 msec task-clock                       #    0.975 CPUs utilized          
     3,009,114,252      cycles                           #    4.482 GHz                    
     9,450,461,477      instructions                     #    3.14  insn per cycle         
     2,422,517,655      branches                         #    3.608 G/sec 

Results varied by +/-50msec (at least) per run, and I'm not saying the above timings were mean or median. It seems hard to distinguish real differences from noise. The variants without branch prediction hints have about the same number of instructions and branches, as do the variants with hints, which isn't a surprise. Maybe there's a slight benefit to intializing int sign = 0; and simplifying the source code, so I'm going to go with that. I'll leave the branch prediction hint in for now, it can always be changed later.

doio.c Outdated Show resolved Hide resolved
@richardleach
Copy link
Contributor Author

I've made changes in response to each of the comments so far. Please check that you're happy with those changes and the overall state of the PR. (I'll rebase after tomorrow's blead-point release.)

sv_inline.h Outdated
/* This is NEGATE_2UV(iv), which can be found in handy.h. */
/* sv_inline.h does not include handy.h because the latter
* would then get included twice into .c files. */
uv = (ASSUME((iv) < 0), (UV)-((iv) + 1) + 1U);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handy.h, like most .h files has a guard against getting included twice. Did you try it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit not, but something goes wrong with DynaLoader if I do:

cc -c   -fwrapv -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2 -Wall -Werror=pointer-arith -Werror=vla -Wextra -Wno-long-long -Wno-declaration-after-statement -Wc++-compat -Wwrite-strings -Wno-use-after-free -O2 -g   -DVERSION=\"1.57\" -DXS_VERSION=\"1.57\"  "-I../.."  -DLIBC="/lib/x86_64-linux-gnu/libc.so.6" DynaLoader.c
In file included from ../../perl.h:7935,
                 from DynaLoader.xs:136:
../../sv_inline.h: In function ‘S_uiv_2buf’:
../../sv_inline.h:1066:14: warning: implicit declaration of function ‘NEGATE_2U’ [-Wimplicit-function-declaration]
 1066 |         uv = NEGATE_2UV(iv);
      |              ^~~~~~~~~~
rm -rf ../../DynaLoader.o

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, this code already uses TYPE_CHARS which is defined in handy.h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEGATE_2U is within an #ifdef PERL_CORE section, but DynaLoader only defines #define PERL_EXT. I'll take a look at it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I relaxed the #ifdef around NEGATE_2U. Shout if this seems inappropriate.

@tonycoz
Copy link
Contributor

tonycoz commented Jan 22, 2025

Since this is an optimization, could you add a benchmark that exercises this to t/perf/benchmarks?

This allows other parts of core to do integer stringification using
the same, fast function.
`Perl_do_print`'s pre-existing method for stringification of an IV within
an SVt_IV involves creating a temporary SVt_PV, using `sv_vcatpvfn_flags`
to do the stringification, then freeing the SVt_PV once the buffer has
been written out.

This is considerably slower than using `S_uiv_2buf`, the helper
function used by `sv_2pv_flags`. So this commit modifies `Perl_do_print`
to use `sv_2pv_flags`.
@richardleach
Copy link
Contributor Author

Since this is an optimization, could you add a benchmark that exercises this to t/perf/benchmarks?

In the end, I couldn't see how to do anything portable since the benchmarks have to run under miniperl. Open to suggestions!

@bulk88
Copy link
Contributor

bulk88 commented Jan 26, 2025

@richardleach
Move S_uiv_2buf from sv.c to sv_inline.h

S_uiv_2buf() should really be made public API, I have an unfinished branch to do that but you picked up the idea first.

itoa() is MS only, svcatpvf is okay, locale mutex locking, makes atleast on windows, the OSs print fmt string librarys slower and slower over the years. So there isnt any portable fast, num 2 ascii func in the perl api. but sv_2pv_flags is hiding its S_uiv_2buf() from core and cpan

Copy link
Contributor

@bulk88 bulk88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire commit (but not anything else on this branch)is a very bad idea, dozens/100s of unique addressed static const union { char arr[200]; U16 dummy;} int2str_table will be made in libperl.so/.dll and in CPAN XS modules. Just make it a normal exported function.

If you want leave the S_*() inner function in sv.c and rest of core uses the extern symbol. Or to get around ELFs sym tab bloat keep 1 wrapper as exported for cpan, then a 2nd or middle end for visibility hidden, ony inside libperl.

Whether LTO does or doesn't optimize it in some way inside 1 binary is upto the CC unless some one visibly identifies a code gen flaw and a b4/after explanation. This fn has 5 args, its too big in my armchair opinion to ever get inlined (ran out of c stack CPU registers/now spilling to c stack in ram). Its ultimately upto the CC, but per C spec, you will get 10s of unique copies of those global vars.

@bulk88
Copy link
Contributor

bulk88 commented Jan 26, 2025

    if (SvTYPE(sv) == SVt_IV && SvIOK(sv)) {
        assert(!SvGMAGICAL(sv));
        if (SvIsUV(sv))
            PerlIO_printf(fp, "%" UVuf, (UV)SvUVX(sv));
        else
            PerlIO_printf(fp, "%" IVdf, (IV)SvIVX(sv));
        return !PerlIO_error(fp);
    }

(There are no code comments to say why it does this. Perhaps deliberately preserving the type for some reason?)

Paranoia about portability to trinary CPUs and 1s complement? Cray (???) was the last supported perl port with weirdness in electronics/fundamentals. IDK if ISO C ever allowed no sign bit CPUs/CCs, or the CRC/parity/ecc bit, so IV would 31 bits, UV is 32 bits. Sign extend/overflow lt gt are software emulated by CC then.

SvPVX macros have similar const type propagation since NC 2005ish, going on. The SvPVX code looks as if Perl 5 codebase is ready to move to C++ or JVM, since that was alot of work, well organized, for C/C++ features that don't exist/got withdrawn.

@richardleach
Copy link
Contributor Author

@richardleach Move S_uiv_2buf from sv.c to sv_inline.h

S_uiv_2buf() should really be made public API

I had thought about that, but wasn't sure how useful it would be. If no-one objects, I'm happy to put in back into sv.c and make it public.

@richardleach
Copy link
Contributor Author

This fn has 5 args, its too big in my armchair opinion to ever get inlined

Even on a DEBUGGING build, gcc quite happily inlines it in both Perl_sv_2pv_flags and Perl_do_print. For example:

doio.c:2220:15: missed:   will not early inline: Perl_do_print/319->S_uiv_2buf/279, growth 27 exceeds --param early-inlining-insns
doio.c:2220:15: optimized:  Inlined S_uiv_2buf/279 into Perl_do_print/319 which now has time 97.741547 and size 209, net change of -28.

@richardleach
Copy link
Contributor Author

Its ultimately upto the CC, but per C spec, you will get 10s of unique copies of those global vars.

Do you know what mainstream compilers do this? (gcc does not seem to. I only see int2str_table in sv.o and doio.o.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants