-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perl_do_print: stringify an SVt_IV IV/UV more efficiently #22927
base: blead
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note most of my comments were about the original that got moved. I think there should be a commit with just the move; then another to enhance the resulting one.
LGTM besides mauke's points
sv_inline.h
Outdated
word_ptr = (U16*)ptr; | ||
word_table = (U16*)int2str_table.arr; | ||
|
||
if (UNLIKELY(is_uv)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the branch prediction? In the next commit, I don't see why this would be unlikely to be called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was pre-existing. I agree though and will remove the UNLIKELY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UVs are super rare. I'd argue >99% of all integers are stored as IVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that something like my $foo = 100000
isn't an UV, despite not having a sign and being representable as an UV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there likely any benefit in defaulting sign to zero and changing the code to something like:
if (!is_uv) {
if (iv >= 0) {
uv = iv;
} else {
/* This is NEGATE_2UV(iv), which can be found in handy.h. */
/* sv_inline.h does not include handy.h because the latter
* would then get included twice into .c files. */
uv = (ASSUME((iv) < 0), (UV)-((iv) + 1) + 1U);
sign = 1;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On GCC generated code will be identical: https://gcc.godbolt.org/z/bj9x4xnTz
BTW, "extract the sign and convert to UV" is a common pattern in Perl code. I experimented with a similar snippet extracted from pp_multiply
, I tried to make the IV part branchless, but it turned out to be a pessimization. On my PC, the branchless version is 2x slower. The benchmark assumes 100% correct prediction, which I think would be close enough to real life.
Here's the code I benchmarked:
/* branchless.c */
_Bool auvok, buvok;
unsigned long long alow, blow;
/* this may not look branchless, but it actually compiles to branchless code */
void to_uv(long long aiv, long long biv) {
alow = aiv >= 0 ? aiv : -(unsigned long long)aiv;
auvok = aiv >= 0;
blow = biv >= 0 ? biv : -(unsigned long long)biv;
buvok = biv >= 0;
}
/* branching.c */
_Bool auvok, buvok;
unsigned long long alow, blow;
void to_uv(long long aiv, long long biv) {
if (aiv >= 0) {
alow = aiv;
auvok = 1; /* effectively it's a UV now */
} else {
/* abs, auvok == false records sign */
alow = -(unsigned long long)aiv;
}
if (biv >= 0) {
blow = biv;
buvok = 1; /* effectively it's a UV now */
} else {
/* abs, buvok == false records sign */
blow = -(unsigned long long)biv;
}
}
/* main.c */
void to_uv(long long aiv, long long biv);
int main() {
for (long long i = 0; i < 1000000000; ++i) {
to_uv(i, i+1);
}
return 0;
}
(main
and to_uv
need to be in separate files to prevent constant folding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark assumes 100% correct prediction, which I think would be close enough to real life.
Well. It depends. If the code is mixing negative and positive numbers then the branch prediction will be all over the place. I think that multiplication is typically done with two positive integers. But then again it's not like mixing signs is something unusual.
It's always difficult to decide which trade-off is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tried a few variations with these three, non-UV cases:
'for (my $i=1; $i<=10000000;$i++) { say $i }'
UNCHANGED - - if (UNLIKELY(is_uv))
486.07 msec task-clock # 0.969 CPUs utilized
2,062,905,670 cycles # 4.244 GHz
7,040,218,542 instructions # 3.41 insn per cycle
1,664,254,169 branches # 3.424 G/sec
NO_UNLIKELY - if (is_uv)
469.99 msec task-clock # 0.934 CPUs utilized
2,008,565,577 cycles # 4.274 GHz
7,029,774,550 instructions # 3.50 insn per cycle
1,654,155,079 branches # 3.520 G/sec
REFACTOR - initialize sign to zero, then if (!is_uv) {
460.94 msec task-clock # 0.939 CPUs utilized
2,018,454,841 cycles # 4.379 GHz
7,030,210,591 instructions # 3.48 insn per cycle
1,654,255,171 branches # 3.589 G/sec
REFACTOR+LIKELY - initialize sign to zero, then if (LIKELY(!is_uv)) {
468.79 msec task-clock # 0.999 CPUs utilized
2,069,028,677 cycles # 4.414 GHz
7,039,775,425 instructions # 3.40 insn per cycle
1,664,153,959 branches # 3.550 G/sec
'for (my $i=1; $i<=10000000;$i++) { say -$i }'
UNCHANGED - - if (UNLIKELY(is_uv))
526.16 msec task-clock # 0.972 CPUs utilized
2,217,557,251 cycles # 4.215 GHz
7,553,385,689 instructions # 3.41 insn per cycle
1,755,583,715 branches # 3.337 G/sec
NO_UNLIKELY - if (is_uv)
501.48 msec task-clock # 0.952 CPUs utilized
2,109,110,712 cycles # 4.206 GHz
7,563,289,130 instructions # 3.59 insn per cycle
1,765,558,885 branches # 3.521 G/sec
REFACTOR - initialize sign to zero, then if (!is_uv) {
482.06 msec task-clock # 0.928 CPUs utilized
2,129,078,811 cycles # 4.417 GHz
7,563,276,997 instructions # 3.55 insn per cycle
1,765,556,805 branches # 3.663 G/sec
REFACTOR+LIKELY - initialize sign to zero, then if (LIKELY(!is_uv)) {
468.20 msec task-clock # 0.999 CPUs utilized
2,166,095,827 cycles # 4.626 GHz
7,553,254,402 instructions # 3.49 insn per cycle
1,755,552,941 branches # 3.750 G/sec
'my $x; for (my $i=1; $i<=10000000;$i++) { ($i % 2) ? say $x : say -$x}'
UNCHANGED - - if (UNLIKELY(is_uv))
701.13 msec task-clock # 0.948 CPUs utilized
2,997,635,522 cycles # 4.275 GHz
9,450,504,662 instructions # 3.15 insn per cycle
2,422,527,385 branches # 3.455 G/sec
NO_UNLIKELY - if (is_uv)
670.31 msec task-clock # 0.958 CPUs utilized
2,867,434,790 cycles # 4.278 GHz
9,450,450,298 instructions # 3.30 insn per cycle
2,422,513,128 branches # 3.614 G/sec
REFACTOR - initialize sign to zero, then if (!is_uv) {
692.92 msec task-clock # 0.976 CPUs utilized
3,013,950,916 cycles # 4.350 GHz
9,450,506,408 instructions # 3.14 insn per cycle
2,422,528,389 branches # 3.496 G/sec
REFACTOR+LIKELY - initialize sign to zero, then if (LIKELY(!is_uv)) {
671.45 msec task-clock # 0.975 CPUs utilized
3,009,114,252 cycles # 4.482 GHz
9,450,461,477 instructions # 3.14 insn per cycle
2,422,517,655 branches # 3.608 G/sec
Results varied by +/-50msec (at least) per run, and I'm not saying the above timings were mean or median. It seems hard to distinguish real differences from noise. The variants without branch prediction hints have about the same number of instructions and branches, as do the variants with hints, which isn't a surprise. Maybe there's a slight benefit to intializing int sign = 0;
and simplifying the source code, so I'm going to go with that. I'll leave the branch prediction hint in for now, it can always be changed later.
27a35a2
to
fca15f2
Compare
I've made changes in response to each of the comments so far. Please check that you're happy with those changes and the overall state of the PR. (I'll rebase after tomorrow's blead-point release.) |
fca15f2
to
3c6483c
Compare
sv_inline.h
Outdated
/* This is NEGATE_2UV(iv), which can be found in handy.h. */ | ||
/* sv_inline.h does not include handy.h because the latter | ||
* would then get included twice into .c files. */ | ||
uv = (ASSUME((iv) < 0), (UV)-((iv) + 1) + 1U); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handy.h, like most .h files has a guard against getting included twice. Did you try it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit not, but something goes wrong with DynaLoader if I do:
cc -c -fwrapv -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2 -Wall -Werror=pointer-arith -Werror=vla -Wextra -Wno-long-long -Wno-declaration-after-statement -Wc++-compat -Wwrite-strings -Wno-use-after-free -O2 -g -DVERSION=\"1.57\" -DXS_VERSION=\"1.57\" "-I../.." -DLIBC="/lib/x86_64-linux-gnu/libc.so.6" DynaLoader.c
In file included from ../../perl.h:7935,
from DynaLoader.xs:136:
../../sv_inline.h: In function ‘S_uiv_2buf’:
../../sv_inline.h:1066:14: warning: implicit declaration of function ‘NEGATE_2U’ [-Wimplicit-function-declaration]
1066 | uv = NEGATE_2UV(iv);
| ^~~~~~~~~~
rm -rf ../../DynaLoader.o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, this code already uses TYPE_CHARS
which is defined in handy.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEGATE_2U
is within an #ifdef PERL_CORE
section, but DynaLoader only defines #define PERL_EXT
. I'll take a look at it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I relaxed the #ifdef around NEGATE_2U
. Shout if this seems inappropriate.
Since this is an optimization, could you add a benchmark that exercises this to |
This allows other parts of core to do integer stringification using the same, fast function.
3c6483c
to
d2bdba3
Compare
`Perl_do_print`'s pre-existing method for stringification of an IV within an SVt_IV involves creating a temporary SVt_PV, using `sv_vcatpvfn_flags` to do the stringification, then freeing the SVt_PV once the buffer has been written out. This is considerably slower than using `S_uiv_2buf`, the helper function used by `sv_2pv_flags`. So this commit modifies `Perl_do_print` to use `sv_2pv_flags`.
d2bdba3
to
573b8f7
Compare
In the end, I couldn't see how to do anything portable since the benchmarks have to run under miniperl. Open to suggestions! |
@richardleach S_uiv_2buf() should really be made public API, I have an unfinished branch to do that but you picked up the idea first. itoa() is MS only, svcatpvf is okay, locale mutex locking, makes atleast on windows, the OSs print fmt string librarys slower and slower over the years. So there isnt any portable fast, num 2 ascii func in the perl api. but sv_2pv_flags is hiding its S_uiv_2buf() from core and cpan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire commit (but not anything else on this branch)is a very bad idea, dozens/100s of unique addressed static const union { char arr[200]; U16 dummy;} int2str_table
will be made in libperl.so/.dll and in CPAN XS modules. Just make it a normal exported function.
If you want leave the S_*() inner function in sv.c and rest of core uses the extern symbol. Or to get around ELFs sym tab bloat keep 1 wrapper as exported for cpan, then a 2nd or middle end for visibility hidden, ony inside libperl.
Whether LTO does or doesn't optimize it in some way inside 1 binary is upto the CC unless some one visibly identifies a code gen flaw and a b4/after explanation. This fn has 5 args, its too big in my armchair opinion to ever get inlined (ran out of c stack CPU registers/now spilling to c stack in ram). Its ultimately upto the CC, but per C spec, you will get 10s of unique copies of those global vars.
Paranoia about portability to trinary CPUs and 1s complement? Cray (???) was the last supported perl port with weirdness in electronics/fundamentals. IDK if ISO C ever allowed no sign bit CPUs/CCs, or the CRC/parity/ecc bit, so IV would 31 bits, UV is 32 bits. Sign extend/overflow lt gt are software emulated by CC then. SvPVX macros have similar const type propagation since NC 2005ish, going on. The SvPVX code looks as if Perl 5 codebase is ready to move to C++ or JVM, since that was alot of work, well organized, for C/C++ features that don't exist/got withdrawn. |
I had thought about that, but wasn't sure how useful it would be. If no-one objects, I'm happy to put in back into sv.c and make it public. |
Even on a DEBUGGING build, gcc quite happily inlines it in both
|
Do you know what mainstream compilers do this? (gcc does not seem to. I only see |
https://stackoverflow.com/questions/79244888 points out that
pp_print
is really slow at stringifying integers. It's bizarrely much faster to add
additional operations to do the stringification first.
Here are some sample timings:
Note from the last timing that stringification in
pp_print
is notalways slow, only when the IV is stored in a SVt_IV. This traces back
to
Perl_do_print
which has a special path for just this case:(There are no code comments to say why it does this. Perhaps deliberately
preserving the type for some reason?)
PerlIO_printf
callsPerlIO_vprintf
, which callsPerl_vnewSVpvf
, whichcreates a new (effectively) temporary SV, then sends that to
Perl_sv_vcatpvfn_flags
for stringification and storage of the IV.PerlIO_vprintf
then writes out the buffer from the temp SV and frees it.All the other routes essentially call
SvPV_const
on$i
or aPADTMP
.This invokes
sv_2pv_flags
and its helper,S_uiv_2buf
, does the work.No temporary SVs are created and freed in such cases.
This PR changes
Perl_do_print
to also use a small buffer andS_uiv_2buf
to do the stringification more efficiently, whilst stillavoiding a type upgrade, Re-measuring after this PR: