Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry = 1 #19

Open
tdulcet opened this issue Apr 7, 2024 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@tdulcet
Copy link
Member

tdulcet commented Apr 7, 2024

Latest Mlucas v21.0.1, AVX2 build, Assignment: PRP=1,2,700001,-1

Output:

$ ./Mlucas -cpu 0

    Mlucas 21.0.1

    https://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 15893, free RAM = 15174
INFO: 15174 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
HWLOC Version = 2.5.0;
        Hardware topology: 7 levels, 1 sockets, 6 cores, 12 logical processors (threads)
INFO: Build uses AVX2 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 12 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 1 cores: 0.
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
NTHREADS = 1
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 looking for worktodo.txt file...
 worktodo.txt file found...reading next assignment...
 worktodo.txt entry: PRP=1,2,700001,-1

INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
<snip>
INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [hiacc] in carry step.
 INFO: restart file p700001 found...reading...
INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [hiacc] in carry step.
 INFO: restart file p700001 found...reading...
INFO: Maximum recommended exponent for FFT length (96 Kdbl) = 1983260; p[ = 700001]/pmax_rec = 0.3529547311.
Initial DWT-multipliers chain length = [hiacc] in carry step.
 INFO: restart file p700001 found...reading...
ERROR: at line 5668 of file ../src/Mlucas.c
Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry = 1, most sig. word =             -21.0000

.stat file:

INFO: primary restart file p700001 not found...looking for secondary...
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 51327
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
Using complex FFT radices        24         8        16        16
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
<snip>
Restarting M700001 at iteration = 380000. Res64: 03980C61BBAC8628, residue shift count = 460455
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 460455
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
[2024-04-07 04:21:14] M700001 Iter# = 390000 [55.71% complete] clocks = 00:00:13.071 [  1.3072 msec/iter] Res64: D0BA77D74E6BBBEA. AvgMaxErr = 0.000000004. MaxErr = 0.000000006. Residue shift count = 644618.
Restarting M700001 at iteration = 390000. Res64: D0BA77D74E6BBBEA, residue shift count = 644618
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 644618
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
[2024-04-07 04:21:27] M700001 Iter# = 400000 [57.14% complete] clocks = 00:00:12.907 [  1.2908 msec/iter] Res64: 6BAE11EC9CF55E94. AvgMaxErr = 0.000000004. MaxErr = 0.000000006. Residue shift count = 28904.
Restarting M700001 at iteration = 400000. Res64: 6BAE11EC9CF55E94, residue shift count = 28904
M700001: using FFT length 96K = 98304 8-byte floats, initial residue shift count = 28904
This gives an average    7.120778401692708 bits per digit
The test will be done in form of a 3-PRP test.
[2024-04-07 04:21:40] M700001 Iter# = 410000 [58.57% complete] clocks = 00:00:12.994 [  1.2995 msec/iter] Res64: 5754D94558C49ED9. AvgMaxErr = 0.000000004. MaxErr = 0.000000007. Residue shift count = 285098.

Edit: Ken also had this issue when attempting to PRP test a known prime:

worktodo.txt entry: PRP=1,2,77232917,-1,70,0

INFO: Maximum recommended exponent for FFT length (4608 Kdbl) = 87540871; p[ = 77232917]/pmax_rec = 0.8822498122.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: restart file p77232917 found...reading...
ERROR: Function convert_res_bytewise_FP, at line 5680 of file ../src/Mlucas.c
Assertion '0' failed: convert_res_bytewise_FP: Illegal combination of nonzero carry = 1, most sig. word = 16448.0000
@tdulcet tdulcet added the bug Something isn't working label Apr 7, 2024
@tdulcet tdulcet changed the title Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry Error: Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry Apr 7, 2024
@xanthe-cat
Copy link
Collaborator

I ran the same PRP on my M1/ASIMD build where it selected a much smaller FFT of 36K; are you able to use a smaller FFT than 96K?
First p700001.stat:

INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
M700001: using FFT length 36K = 36864 8-byte floats, initial residue shift count = 51327
This gives an average   18.988742404513889 bits per digit
The test will be done in form of a 3-PRP test.
Using complex FFT radices        36        32        16
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
[2024-04-09 08:21:22] M700001 Iter# = 10000 [ 1.43% complete] clocks = 00:00:04.149 [  0.4150 msec/iter] Res64: F73F55AC8F92C1F0. AvgMaxErr = 0.036662518. MaxErr = 0.062500000. Residue shift count = 470088.
...
[2024-04-09 08:24:04] M700001 Iter# = 390000 [55.71% complete] clocks = 00:00:04.214 [  0.4214 msec/iter] Res64: D0BA77D74E6BBBEA. AvgMaxErr = 0.036756567. MaxErr = 0.062500000. Residue shift count = 644618.
[2024-04-09 08:24:08] M700001 Iter# = 400000 [57.14% complete] clocks = 00:00:04.188 [  0.4188 msec/iter] Res64: 6BAE11EC9CF55E94. AvgMaxErr = 0.036776587. MaxErr = 0.062500000. Residue shift count = 28904.
[2024-04-09 08:24:13] M700001 Iter# = 410000 [58.57% complete] clocks = 00:00:04.225 [  0.4225 msec/iter] Res64: 5754D94558C49ED9. AvgMaxErr = 0.036733561. MaxErr = 0.054687500. Residue shift count = 285098.
...
[2024-04-09 08:26:20] M700001 Iter# = 700000 [100.00% complete] clocks = 00:00:04.651 [  0.4652 msec/iter] Res64: 3D70083B9439BA98. AvgMaxErr = 0.036834389. MaxErr = 0.062500000. Residue shift count = 51327.
[2024-04-09 08:26:20] M700001 Iter# = 700001 [100.00% complete] clocks = 00:00:00.000 [  0.7658 msec/iter] Res64: C8C0467CC5E32F55. AvgMaxErr = 0.027343750. MaxErr = 0.027343750. Residue shift count = 102654.
M700001 is not prime. Program: E21.0.1. Final residue shift count = 102654.
If using the manual results submission form at mersenne.org, paste the following JSON-formatted results line:
{"status":"C", "exponent":700001, "worktype":"PRP-3", "res64":"32C007D4F98B0542", "residue-type":1, "fft-length":36864, "shift-count":102654, "error-code":"00000000", "program":{"name":"Mlucas", "version":"21.0.1"}, "timestamp":"2024-04-08 22:26:20 UTC"}

Your run appears to have the 10000-iteration restart bug which I might separately flag as an issue. The output from Mlucas looked like:

cxc@192-168-1-3 obj_asimd % ./Mlucas         

    Mlucas 21.0.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
INFO: 16384 MB of available system RAM detected.
CPU Family = ARM Embedded ABI, OS = OS X, 64-bit Version, compiled with Gnu-C-compatible [llvm/clang], Version 14.0.0 (clang-1400.0.29.202).
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation. 
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 8 available processor cores.
INFO: testing FFT radix tables...
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
No CPU set or threadcount specified ... running single-threaded.
Set affinity for the following 1 cores: 0.
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 looking for worktodo.txt file...
 worktodo.txt file found...reading next assignment...
 worktodo.txt entry: PRP=1,2,700001,-1,75,0

INFO: Maximum recommended exponent for FFT length (36 Kdbl) = 759433; p[ = 700001]/pmax_rec = 0.9217416151.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
M700001 is not prime. Program: E21.0.1. Final residue shift count = 102654.
If using the manual results submission form at mersenne.org, paste the following JSON-formatted results line:
{"status":"C", "exponent":700001, "worktype":"PRP-3", "res64":"32C007D4F98B0542", "residue-type":1, "fft-length":36864, "shift-count":102654, "error-code":"00000000", "program":{"name":"Mlucas", "version":"21.0.1"}, "timestamp":"2024-04-08 22:26:20 UTC"}

@tdulcet
Copy link
Member Author

tdulcet commented Apr 9, 2024

I ran the same PRP on my M1/ASIMD build where it selected a much smaller FFT of 36K; are you able to use a smaller FFT than 96K?

Thanks for testing it. When passing the -fft 36K option, it still used 96K, but I was able to fudge the ms/iter speeds in mlucas.cfg so that the 36K FFT length was faster. This caused it to use 36K, which did work as expected:

$ ./Mlucas -cpu 0:3

    Mlucas 21.0.1

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
System total RAM = 15893, free RAM = 15326
INFO: 15326 MB of free system RAM detected.
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 11.4.0.
HWLOC Version = 2.5.0;
        Hardware topology: 7 levels, 1 sockets, 6 cores, 12 logical processors (threads)
INFO: Build uses AVX2 instruction set.
INFO: Using prefetch.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: Using FMADD-based 100-bit modmul routines for factoring.
INFO: MLUCAS_PATH is set to ""
INFO: using 64-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation.
INFO: testing IMUL routines...
INFO: Testing 64-bit 2^p (mod q) functions with 100000 random (p, q odd) pairs...
INFO: System has 12 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 4 cores: 0.1.2.3.
User did not set LowMem in mlucas.ini ... allowing all test types.
User did not set CheckInterval in mlucas.ini ... using default.
NTHREADS = 4
Setting ITERS_BETWEEN_CHECKPOINTS = 10000.
 looking for worktodo.txt file...
 worktodo.txt file found...reading next assignment...
 worktodo.txt entry: PRP=1,2,700001,-1

INFO: Maximum recommended exponent for FFT length (36 Kdbl) = 759433; p[ = 700001]/pmax_rec = 0.9217416151.
Initial DWT-multipliers chain length = [long] in carry step.
INFO: primary restart file p700001 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.
mers_mod_square: Init threadpool of 4 threads
Using 4 threads in carry step
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 551631
M700001 is not prime. Program: E21.0.1. Final residue shift count = 102654.
If using the manual results submission form at mersenne.org, paste the following JSON-formatted results line:
{"status":"C", "exponent":700001, "worktype":"PRP-3", "res64":"32C007D4F98B0542", "residue-type":1, "fft-length":36864, "shift-count":102654, "error-code":"00000000", "program":{"name":"Mlucas", "version":"21.0.1"}, "timestamp":"2024-04-09 09:43:00 UTC"}

The bug must be related to using a larger than optimal FFT length, which should work, but in the meantime maybe Mlucas should not try to use FFT lengths more than some multiple of the optimal, even if they are faster.

@xanthe-cat
Copy link
Collaborator

xanthe-cat commented Apr 27, 2024

I have a question about something which might be tangentially related this problem; do you know how the code throttles the variable which decides how to chain multiplications together? In the Mlucas standard output there are lines such as:

Initial DWT-multipliers chain length = [long] in carry step.

If things are not going well, one of Ernst’s tricks is to change the chain length; your output above soon changes to:

Initial DWT-multipliers chain length = [hiacc] in carry step.

Usually [long] is the fastest mode, though Ernst has three further settings to more carefully multiply, [medium], [short], and dialling things up to eleven, [hiacc]. I presume that last setting is an abbreviation for “high accuracy”.
One of my problems (trying to use a teensy FFT for the Suyama test of $F_{13}$) is that it seems to do a whole lot of mod-square calculations (eight thousand or so) fine, and then as it tries to perform the final one it drops the ball with this carry error. Since the Suyama call is a separate part of the codebase, I would like to tell Mlucas to use the [hiacc] setting for that one mod-square operation, but I don’t see how that can even be specified.

@tdulcet
Copy link
Member Author

tdulcet commented Apr 28, 2024

In my example, I do not believe it should be using [hiacc], as the ROE is already extremely low (MaxErr = 0.000000006) due to the excessively large FFT length, so I suspect that this is a separate issue caused by #21.

Anyway, to answer your question, the "chain length" is not something that can generally be specified on a per iteration basis. When Mlucas detects that the ROE is too high, it first tries to increase chain length to resolve the issue, before finally resorting to increasing the FFT length, which is of course much more costly in terms of performance. When it does increase chain length, it restarts the test from the last savefile, which means that it loses up to 10K iterations by default. Considering that the entire F13 test has less than 10K iterations, it would probably be easiest to force the test to use [hiacc] from the start. In that case, just adjust the logic as needed here:

Mlucas/src/Mlucas.c

Lines 1351 to 1360 in 1839858

// Set initial value of USE_SHORT_CY_CHAIN based on how close p/pmax is to 1.0, but only if current chain length is longer
// (e.g. if ROE-retry logic has led to a shorter-than-default chain length, don't revert to default):
if(exp_ratio > 0.99 && USE_SHORT_CY_CHAIN < 3)
USE_SHORT_CY_CHAIN = 3;
else if(exp_ratio > 0.98 && USE_SHORT_CY_CHAIN < 2)
USE_SHORT_CY_CHAIN = 2;
else if(exp_ratio > 0.97 && USE_SHORT_CY_CHAIN < 1)
USE_SHORT_CY_CHAIN = 1;
const char*arr_sml[] = {"long","medium","short","hiacc"};
fprintf(stderr,"Initial DWT-multipliers chain length = [%s] in carry step.\n",arr_sml[USE_SHORT_CY_CHAIN]);
For example, you could add a USE_SHORT_CY_CHAIN = USE_SHORT_CY_CHAIN_MAX; line above the fprintf() function.

@tdulcet tdulcet changed the title Error: Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry Error: Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry = 1 Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants