Skip to content

Conversation

RaimoNiskanen
Copy link
Contributor

This PR adds functions rand:shuffle/1 and rand:shuffle_s/2 due to a discussion on ErlangForums: https://erlangforums.com/t/random-sort-should-be-included-in-the-lists-module/5125

There are 4 algorithms in the first commit. The suggested winner is the one remaining in the second commit.

Documentation and test cases are still missing...

@RaimoNiskanen RaimoNiskanen added this to the OTP-29.0 milestone Oct 14, 2025
@RaimoNiskanen RaimoNiskanen requested a review from bjorng October 14, 2025 14:54
@RaimoNiskanen RaimoNiskanen self-assigned this Oct 14, 2025
@RaimoNiskanen RaimoNiskanen added team:VM Assigned to OTP team VM team:PS Assigned to OTP team PS feature in progress priority:medium labels Oct 14, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 14, 2025

CT Test Results

    2 files     97 suites   1h 5m 33s ⏱️
2 214 tests 2 162 ✅ 52 💤 0 ❌
2 600 runs  2 544 ✅ 56 💤 0 ❌

Results for commit 95b21d9.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Copy link
Contributor

@bjorng bjorng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I don't know whether you intend to keep the first commit. In case you do, the last paragraph is missing a closing parenthesis, and the word "ridiculous" is misspelled.

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from 70efe49 to aba9094 Compare October 15, 2025 07:55
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from aba9094 to 0280798 Compare October 15, 2025 10:33
Write a few shuffle algorithms for comparison.

I have found no formal statement that it is bias free,
but have tried to reason around it.  The algorithm should be
equivalent to generating more random decimals to decide
the shuffle order for elements with the same random number.
It should make no difference if the random decimals are generated
always and ignored, or when needed.

Speed: 1.2 s for 2^20 integers on my laptop.

The classical textbook shuffle.

Speed: 5 s for 2^20 integers on my laptop.

Quite a beautiful algorithm since the `gb_tree` has all
the functionality in itself.

Speed: 5 s for 2^20 integers on my laptop.

The same as the `gb_tree` above, but with a map.  Uses
the map key order instead of the general term order,
which works just fine.

Speed: 2 s for 2^20 integers on my laptop.

Suggested by Richard A. O'Keefe on ErlangForums as
"a random variant of Quicksort".  Shall we name it Quickshuffle?

Really fast. Uses random numbers efficiently by looking at
individual bits for the random split.  Has no overhead
for tagging.  Just creates intermediate lists as garbage.

This generator appears to be equivalent with shuffle1,
using a random number generator with 1 bit.

Speed: 0.8 s for 2^20 integers on my laptop.

The classical textbook shuffle.

Our standard `array` module here outperforms map, probably
because keys does not have to be stored, they are implicit.

Speed: 2 s for 2^20 integers on my laptop.

shuffle3 and shuffle4 have the theoretical limitation that
when the length of the list approaches the generator size,
it will take catastrophically much longer time to generate
a random number that has not been used.

There is no check for the list length being larger than
the generator size in which case it will be impossible
to generate unique random numbers for all list elements,
and the algorithm will simply keep on failing forever.
This is for now a theoretical problem since already for
a list length with log half the generator size
(e.g 2^28 with a generator size 2^56), my laptop
runs out of memory with a VM of about 30 GB.

shuffle1 and shuffle5 avoids that limitation.  shuffle1 by recursing
over the duplicates sublists so it is not affected much
by fairly long lists of duplicates, shuffle5 by using only
individual bits and ranges 2, 6, or 24.

The classical Fisher-Yates algorithm in shuffle2 and shuffle6
does not have this limitation, but generating random numbers
of unlimited length gets increasingly expensive, but should not be
any problem for 2 or even 4 times the generator length, that is
list lengths of well over 2^200, which is well over ridiculous.
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from 0280798 to a536947 Compare October 16, 2025 14:17
@RaimoNiskanen
Copy link
Contributor Author

New algorithm selected. "Quickshuffle"?

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from fb2cb14 to 8e991bf Compare October 17, 2025 08:45
@RaimoNiskanen
Copy link
Contributor Author

I wrote a longer explanation of the algorithm

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from 8e991bf to c72c71c Compare October 17, 2025 09:51
@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from c72c71c to 3aeae41 Compare October 18, 2025 21:45
* Use raw generator as bitstream.
* Optimize 3 and 4 elements permutation by rejection sampling
* Use `div` instead of `rem` for simpler reject-and-retry test.
@RaimoNiskanen
Copy link
Contributor Author

Pushed some optimizations

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from 3c0ceca to 6e4d1e8 Compare October 20, 2025 12:10
Shuffle a list.

From the specified `State` shuffles a list
so that every element in `List` has an equal probability
Copy link

@gproskurin gproskurin Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement about "equal probability" is quite strong if taken literally. Are you sure it is mathematically correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is that our shuffle algorithm should be good enough for an application such as a poker site so it should have no exploitable anomalies in itself. Then of course the chosen backend PRNG algorithm also has to be non-exploitable.

I hoped that "equal probability" would say precisely that.
Maybe it should also say that it depends on the backend PRNG...

There has been a discussion on ErlangForums for a while about which algorithm to choose, and I am convinced that this chosen one, if implemented correctly, should have no statistical flaws, and be among the fastest possible given that requirement.

If it can be shown that this algorithm does not have an equal probability for every possible permutation of the returned list, then I will change to one that has...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rephrased to clarify the relationship between the shuffle algorithm and the PRNG algorithm.

Comment on lines +1367 to +1375
%% Randomly split the list in two lists, and recursively shuffle
%% the two smaller lists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%% Randomly split the list in two lists, and recursively shuffle
%% the two smaller lists.
%% Randomly split the list in four lists, and recursively shuffle
%% the four smaller lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text describes the basic split in two algorithm that has pseudocode further down, and the explanation is centered on one binary bit at the time.

That split in four is an optimization is stated in this comment's last paragraph, and in the comments where the split by four is done.

@RaimoNiskanen RaimoNiskanen force-pushed the raimo/stdlib/rand-shuffle branch from 5f73e08 to 95b21d9 Compare October 21, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature in progress priority:medium team:PS Assigned to OTP team PS team:VM Assigned to OTP team VM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants