PRP modular division bug #18

xanthe-cat · 2024-04-04T23:34:39Z

As discovered in issue #15, a PRP test of a probable Mersenne prime was incorrectly tested, probably as a result of legacy code that was not correctly adapted from the case of Lucas–Lehmer or Pépin tests.

To PRP test a Mersenne prime efficiently with modular squaring, Mlucas evaluates b^{2^p} ≡ b² (mod 2^p – 1), and then as the final step undertakes a modular division by b² to obtain the usual form of Fermat’s little theorem, where b^{2^p–2} ≡ 1 (mod 2^p – 1) indicates the Mersenne number is a probable prime to the base b. Any result other than 1 indicates the Mersenne number is composite.

The typical base for this PRP test is 3; of the other small primes, 2 is unsuitable as a base because any Mersenne number with an odd prime exponent is a 2–PRP pseudoprime, while bases 5 and 7 do not present problems. An assert at line 698 of Mlucas.c disables using PRP bases other than 3 (which was put in place probably owing to this issue).

So for base 3, Mlucas performs a modular division by 9, taking the 3^{2^p} residue from the a[ ] array and storing the result of the division in arrtmp[ ]. (Pull request #17 fixes the primality checking loop that examines the arrtmp[ ] for any non-zero values other than the final 0x1 result.)

Bases 5 and 7 require modular division by 25 and 49 respectively as the final step of the test, and in testing the 5–PRP results were correct (0x1) while the 7–PRP results were not (0x31 ÷ 7² ≠ 0x0, the result returned).

Therefore there exists a substantial bug in the modular division code for untypical PRP bases such as 7 (and possibly others).

xanthe-cat · 2024-04-12T04:32:42Z

The problem is solved by a small rewrite ofmi64.c and the mi64_div_by_scalar64 function:

Mlucas/src/mi64.c

Lines 6893 to 6907 in 93b703d

    
           if(len == 1) { 
        
           	fquo = tmp*fqinv; 
        
           	itmp64 = (uint64)fquo; 
        
           	rem64 = tmp - q*itmp64; 
        
           	// May need a 2nd pass to clean up any ROE in 1st iteration, and 
        
           	// must account for ROE which leads to a borrow in the above subtraction to get rem64: 
        
           	if(rem64 > tmp) {	// Had a borrow 
        
           		fquo = -rem64*fqinv + 1;	// Add one to FP quotient to effectround-toward-zero 
        
           		itmp64 -= (uint64)fquo; 
        
           		rem64 = rem64 + q*(uint64)fquo; 
        
           	} else { 
        
           		fquo = rem64*fqinv; 
        
           		itmp64 += (uint64)fquo; 
        
           		rem64 = rem64 - q*(uint64)fquo; 
        
           	}

The problem is that fquo may have a tiny round-off error so that instead of 1.00000000000000000000, it equals 0.99999999999999988898, and so in turn the conversion itmp64 equals 0 rather than 1. Ernst is aware of round-off-errors, so he makes two passes, but the problem is in the else limb of the second pass, because fquo still evaluates to just shy of an integer (0.99999999999999988898) and gets rounded down to zero again, which causes the modular division to fail since fquo is one short and returns a value of 0x0. One possible workaround (but I’m sure someone can think of something superior) is to just nudge fquo slightly higher in the else limb, since (uint64)fquo will convert it to an integer anyway:

fquo = rem64*fqinv + 0.00000000000001; // CXC: Give fquo a tiny push if we are in the 2nd pass, to ensure a 0.9999999999999 value actually gets to 1.0

ldesnogu · 2024-04-12T07:47:42Z

Ouch. That kind of tricks with floating-point values is dangerous.
There might be small differences between ISA. And also it might change depending on how rounding is set on the machine.
Did you test on both x86-64 and Arm machines?

xanthe-cat · 2024-04-12T08:39:02Z

It’s a nasty hack, which is why I hope someone can think of a more elegant method of dealing with the float to integer conversion.
I have an AVX2 Intel Mac which produced the same rounding problems for the same bases:
3, 5, 6, 9, 10, 11, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 30, 31 — success (M44497 is a known MERSENNE PRIME.)
7, 14, 27, 28, 29 — fail (M44497 is not prime.)
My AVX2 build can only use 3K or 4K FFTs as the smallest which are workable, so to convince Mlucas to run tests on M44497 (which would normally expect a 2K FFT I faked the Mlucas.cfg radix data with an inferior timing speed so that it would automatically select 3K or 4K instead.

Edited to add: I nearly forgot. For some reason Mlucas really didn’t like trying the 31 base test of M44497 at 4K (so I substituted 3K which completed the test successfully):

ERROR: at line 5661 of file ../src/Mlucas.c
Assertion failed: convert_res_bytewise_FP: Illegal combination of nonzero carry = 1, most sig. word =            -474.0000

ldesnogu · 2024-04-12T10:25:31Z

I've had enough of implementing FP operations in our Arm simulator... I now hate FP :-)
Joke aside, time permitting, I'll take a look (but I'm leaving for holiday for a week, so don't hold your breath).

xanthe-cat · 2024-04-12T10:34:01Z

Enjoy your holiday, away from devilish FP implementations! I figure there’s no rush to solve this, given that the line 698 assertion in Mlucas.c prevents PRP tests of these other bases anyway, which in turn would prevent the bug from manifesting in the first place.

tdulcet · 2024-04-12T11:00:29Z

Thanks for debugging this and finding the source of the problem!

It’s a nasty hack, which is why I hope someone can think of a more elegant method of dealing with the float to integer conversion.

The typical solution to this problem in C is to add 0.5 before the integer conversion. You could instead use one of the round functions from math.h to handle this more robustly. Of course, we would need to be extremely careful that we do not break something else.

For some reason Mlucas really didn’t like trying the 31 base test of M44497 at 4K

That is the same error I got in #19. It looks like it affects multiple FFT lengths.

xanthe-cat · 2024-04-13T00:54:38Z

On second thought, I think it is possible to make the fix much more targeted by making the “push” conditional on fquo lying between 0.999…999 (14 nines) and 1.0. Adding 0.5 might be a typical solution for roundoff in other contexts but it is excessive here:

		} else {
			fquo = rem64*fqinv;
			if(fquo >= 0.99999999999999 && fquo < 1.0) fquo += 0.00000000000001; // CXC: Give fquo a tiny push if we are in the 2nd pass, to ensure a 0.9999999999999 value actually gets to 1.0; see https://github.com/primesearch/Mlucas/issues/18
			itmp64 += (uint64)fquo;
			rem64 = rem64 - q*(uint64)fquo;
		}

tdulcet · 2024-04-13T09:29:40Z

Hmm, unless we are certain the bad value is going to be greater than 0.99999999999999 on all possible hardware and for all exponent and PRP base combinations, I think adding 0.5 is what you would want to do. That way if it the value is anywhere between 0.5 and 1.499..., it would still be 1 as expected after the integer conversion. Note that it would only be 1 for primes and when using a PRP residue type of 1, so your conditional assumes the bug only happens in that case.

xanthe-cat · 2024-04-13T10:23:59Z

I think we can be confident that the quotient will be evaluated to a fair number of significant digits. The main reason we are looking at this bit of code is because of a rounding error in b² * b^–2, as for some values of b we do not get 1.0, but very close to it; so I would be quite happy with ten nines (0.9999999999) as the lower bound. Contrarily I would not be happy with 0.5 since we know that this value should not be the result of a multiplication by a modular inverse where we expect to get divisibility by an integer, and if that is happening we should be getting an error.

Mlucas seems happy using up to about 2²⁴–1 = 16777215 as a PRP base:

INFO: primary restart file p9941 not found...looking for secondary...
INFO: no restart file found...starting run from scratch.
M9941: using FFT length 1K = 1024 8-byte floats, initial residue shift count = 8068
This gives an average    9.708007812500000 bits per digit
The test will be done in form of a 16777215-PRP test.
Using complex FFT radices        32        16
At iter ITERS_BETWEEN_GCHECK_UPDATES = 1000: RES_SHIFT = 7106
[2024-04-13 20:11:39] M9941 Iter# = 9941 [100.00% complete] clocks = 00:00:04.156 [  0.4181 msec/iter] Res64: 0000FFFFFE000001. AvgMaxErr = 0.000000012. MaxErr = 0.250000000. Residue shift count = 6195.
M9941 is a known MERSENNE PRIME.

The rounding error for this base was evidently not problematic, since my debug printout showed itmp64 was evaluated as 1 on the first pass:

Floating-point mod in mi64_div_by_scalar64: fqinv = 0.00000000000000355271, tmp = 281474943156225, fquo = 0.00000000000000000000, itmp64 = 1, x = 281474943156225, q = 281474943156225: exact remainder = 0, FP gives 0.

tdulcet · 2024-04-13T11:26:38Z

Thanks for the additional information. As you may know, this is a very common problem when using floating point numbers. The rounding error for the calculation is only going to get bigger as the base increases, especially as one gets near the limits for double precision floating point. Anyway, I will let @ldesnogu make the final decision on how we should fix this when they get back from their holiday.

If you are certain that the value is going to be "very close" to 1.0, you could try adding DBL_EPSILON to it or better yet, calling the nextafter function from math.h. However, I still believe adding 0.5 would be the most robust solution, as short of a fatal roundoff error (greater than 0.5), it should always produce the correct result.

xanthe-cat · 2024-04-14T01:42:51Z

Sorry for my inability to explain the problem and my concern. Generally the point of finding a modular inverse is that you don’t have to do division (which is slow) but instead a modular multiplication by the mod inverse should give the same result as a division by the number it inverts – the algorithm for obtaining a modular inverse is non-trivial but it’s still better than doing division.
If the starting number is an integer (it is), and you find the modular inverse of the divisor (which will also be an integer in Z_2^p–1), then the quotient therefore must also be an integer. The remainder of mi64.c seems to be doing that part fine, judging by the cases where the residue is non-trivial for composite Mersennes – all of my tests had matching Res2048 residues to mprime, even for the problematic base 7.
I am assuming that (somewhere else in the thousands of lines of code) there is a determination that the number tmp to be divided and the divisor q are both small (i.e., can fit into a uint64) and tmp % q == 0, so that we know we should have an integer quotient. (If we did not have this condition, we would need to use a different algorithm.) Then for speed it uses floating point to do the division on the (mistaken) assumption that rounding errors wouldn’t cause a near-integer result to be cancelled out.
What this means – if this section of code is returning an fquo that has a value that is not even close to an integer, then something is badly wrong that we definitely want to know about, rather than the algorithm sweeping the problem under the rug. Upthread I rather undersold the issue by saying “Adding 0.5 might be a typical solution for roundoff in other contexts but it is excessive here”, because the only case we are interested in are integer solutions (even though we have dined with the devil by doing the calculation in floating point). An occasional floating point error of 0.00000000001 is something I can excuse and live with myself. In this context an error of 0.5 cannot be tolerated.

tdulcet · 2024-04-14T11:51:28Z

Thanks for the detailed explanation. My understanding is that all of the many algorithms implemented by Mlucas exclusively involve integers and ideally should be implemented entirely with the integer datatypes. However, modern CPUs and other hardware) are much faster when using the floating point datatypes. The SIMD instructions needed for good performance on CPUs can only be fully utilized when using the floating point datatypes. That is why all GIMPS programs, including Mlucas, are forced to use floating point datatypes, which requires constantly converting back and forth between floating point and integer numbers. However, as you know, in general, floating point datatypes cannot perform integer calculations precisely, so this causes inevitable roundoff errors (ROEs). For example, the double precision (float64) datatype used here can only store integers up to 53-bits. When performing any of the currently supported test types (LL/PRP/Pépin/P-1), Mlucas checks for ROEs after every iteration and anything up to 0.4375 is accepted. EOEs of above 0.2 and 0.3 are very common, as Mlucas typically pushes the limits to achieve the maximum possible performance.

I understand what you are saying that the result of this calculation should be very close to an integer. However, it is multiplying a potentially very large number by a very small number, which could easily result in a huge ROE, as the floating point datatypes are not designed for working with numbers of different magnitudes. By using the floating point datatypes, we have very little control over the ROEs, so we just have to hope that they are less than 0.5 (or 0.4375) so they can be reliably detected. Currently any value between 1.0 and 1.999... would be truncated to 1 after the integer conversion. By adding 0.5 (I am not saying this is the best solution), it would change that to any number between 0.5 and 1.499..., so the size of the range (1) would be the same, it is just shifted down 0.5, similar to how most people round numbers. The problem with adding an arbitrary value like 0.00000000000001 is that it may work most of the time, but it will also likely still fail for some users with different hardware, as @ldesnogu explained, or those using exponent and PRP base combinations that you did not explicitly test. If we are going to add an arbitrary value here, I would much rather it be many orders of magnitude larger than the largest possible ROE value you believe could possibly occur, for example 0.1, which would change the range to between 0.9 and 1.899....

Regardless of how we decide to fix this, feel free to add a warning or assert statement here for the case that the result is not close to an integer, if you believe it is needed.

xanthe-cat · 2024-04-20T03:54:21Z

After a little bit of thought I added a warning using fprintf if a division falls in the very narrow range where a problem may appear, as to be honest the assert statements slightly terrify me.

xanthe-cat added the bug Something isn't working label Apr 4, 2024

tdulcet added the enhancement New feature or request label Apr 5, 2024

tdulcet mentioned this issue Apr 9, 2024

Error Assertion failed: Nonzero carryout of nonoverlapping vector add! on x86 when compiled with GCC and LTO enabled #10

Open

This was referenced Apr 13, 2024

Update Mlucas.c, mi64.c for correct modular division (small scalars) #23

Merged

Mluces incorrectly states that M37 is no prime #15

Closed

xanthe-cat closed this as completed in #23 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRP modular division bug #18

PRP modular division bug #18

xanthe-cat commented Apr 4, 2024 •

edited

Loading

xanthe-cat commented Apr 12, 2024

ldesnogu commented Apr 12, 2024

xanthe-cat commented Apr 12, 2024 •

edited

Loading

ldesnogu commented Apr 12, 2024

xanthe-cat commented Apr 12, 2024

tdulcet commented Apr 12, 2024

xanthe-cat commented Apr 13, 2024

tdulcet commented Apr 13, 2024

xanthe-cat commented Apr 13, 2024

tdulcet commented Apr 13, 2024 •

edited

Loading

xanthe-cat commented Apr 14, 2024 •

edited

Loading

tdulcet commented Apr 14, 2024

xanthe-cat commented Apr 20, 2024

PRP modular division bug #18

PRP modular division bug #18

Comments

xanthe-cat commented Apr 4, 2024 • edited Loading

xanthe-cat commented Apr 12, 2024

ldesnogu commented Apr 12, 2024

xanthe-cat commented Apr 12, 2024 • edited Loading

ldesnogu commented Apr 12, 2024

xanthe-cat commented Apr 12, 2024

tdulcet commented Apr 12, 2024

xanthe-cat commented Apr 13, 2024

tdulcet commented Apr 13, 2024

xanthe-cat commented Apr 13, 2024

tdulcet commented Apr 13, 2024 • edited Loading

xanthe-cat commented Apr 14, 2024 • edited Loading

tdulcet commented Apr 14, 2024

xanthe-cat commented Apr 20, 2024

xanthe-cat commented Apr 4, 2024 •

edited

Loading

xanthe-cat commented Apr 12, 2024 •

edited

Loading

tdulcet commented Apr 13, 2024 •

edited

Loading

xanthe-cat commented Apr 14, 2024 •

edited

Loading