Fishtest FAQ

How can I give computer time to the Stockfish community?

Even if you don't know programming yet, you can help the Stockfish developers to improve the chess engine by connecting your computer to the fishnet. Your computer will then play some chess games in the background to help the developers testing ideas and improvements.

Instructions on how to connect your computer on the fishtest network are given there:

Can I take my computer off at any time without wasting work?

For SPRT tests, which are by far the most common type, the worker will send an update to fishtest every eight games. So on average, you can expect to lose four games when quitting the worker. Four STC games for a 1 core worker represents about 2 minutes of work. Four LTC games (which are less common) represent about 12 minutes of work.

What is a "residual"?

The statistical models that Fishtest uses are based on the assumption that the pentanomial probabilities (a variation on the win, loss, draw probabilities) are the same for each worker. Therefore for each worker, a "residual" is shown on the overview page of every test. It is a measure of how far the worker deviates from the average. Small deviations are normally just due to statistical fluctuations and these will be colored green. However, if the deviation is exceptionally large then the residual will be colored yellow or even red. If this happens on a regular basis for a particular worker then this may be some cause for concern.

The following questions are more technical and aimed at potential Stockfish developers:

Can I program or run any test I want?

You should first check if the test has not been run previously. You can look at the test's history, and follow the corresponding link on the left of Fishtest's main view.

What time-control/method should I use for my test?

Most tests should use the two-stage approach, starting with stage 1, and if that passes, using the reschedule button to create the stage 2 test.

Selecting the type of test according to the stage you are in will configure all the necessary options for you.

Stage 1	Reschedule	Stage 2

What is SPRT?

SPRT stands for sequential probability ratio test. In SPRT, we have a null hypothesis that the two engines are equal in strength, while an alternative hypothesis is that one of the engines is stronger. With SPRT, we can test the hypothesis with the least expected number of games, that is, we don't attempt to fix the number of games to be played. The parameters of the test control the Type 1 and Type 2 errors. Essentially, we run matches sequentially, for each match we update a value from a likelihood function. The test is terminated when the value is below a lower-bound threshold or above an upper-bound threshold. The threshold is calculated based on the two parameters given to the test (please read the paragraph "Testing methodology" on the page Creating my first test for details).

What if my test is tuning parameters?

You can use the NumGames stop rule, with 20000 games TC 10+0.1, and schedule a few tests around the direction you want to tune in. If you find a tuning that looks good, you can then schedule a two-stage SPRT test.

How many tries on an idea are too many?

Generally, four or five tries is the limit. It's a good balance between exploring the change and not giving lucky tries too much of a chance to pass.

Can I test a fork of SF?

No. For various reasons, please base your tests on the current SF master.

What is a union patch?

A union is the bundling of patches that failed SPRT but with a positive or near-positive score. Sometimes retesting the union as a whole passes SPRT. Due to the nature of the approach and because each individual patch failed already, a union has some constraints:

Maximum 2 patches per union
Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.

How can I test commits N-1, N-2, ... of a branch?

If your branch name is passed_pawn, you can enter passed_pawn^, passed_pawn^^, ... in the branch field of the test submission page at https://tests.stockfishchess.org/tests/run .

How to disable NUMA?

Important

Note for patch authors: it is necessary, when testing patches with more than 8 threads, to disable "thread binding" in engine.cpp. Not doing so would have a negative effect on multi NUMA node (more than one physical CPU) Windows contributors machines with more than 8 cores, due to the parallelization of our test scripts for fishtest. This would bias the statistical value of the test.

Diff:

diff --git a/src/engine.cpp b/src/engine.cpp
index 81bb260b..bf3ebc12 100644
--- a/src/engine.cpp
+++ b/src/engine.cpp
@@ -184,23 +184,23 @@ void Engine::set_position(const std::string& fen, const std::vector<std::string>
 // modifiers

 void Engine::set_numa_config_from_option(const std::string& o) {
-    if (o == "auto" || o == "system")
-    {
-        numaContext.set_numa_config(NumaConfig::from_system());
-    }
-    else if (o == "hardware")
-    {
-        // Don't respect affinity set in the system.
-        numaContext.set_numa_config(NumaConfig::from_system(false));
-    }
-    else if (o == "none")
-    {
-        numaContext.set_numa_config(NumaConfig{});
-    }
-    else
-    {
-        numaContext.set_numa_config(NumaConfig::from_string(o));
-    }
+    // if (o == "auto" || o == "system")
+    // {
+    //     numaContext.set_numa_config(NumaConfig::from_system());
+    // }
+    // else if (o == "hardware")
+    // {
+    //     // Don't respect affinity set in the system.
+    //     numaContext.set_numa_config(NumaConfig::from_system(false));
+    // }
+    // else if (o == "none")
+    // {
+    numaContext.set_numa_config(NumaConfig{}); // <-------------
+    // }
+    // else
+    // {
+    //     numaContext.set_numa_config(NumaConfig::from_string(o));
+    // }

     // Force reallocation of threads in case affinities need to change.
     resize_threads();

Why is the regression test bad?

First, note that regression tests are not actually run to detect regressions. SF quality control is very stringent and regressive patches are very unlikely to make it into master. No, they are run to get an idea of SF's progress over time, which is impressive. See

https://github.com/official-stockfish/Stockfish/wiki/Regression-Tests

But still... what if the Elo outcome of a regression test is disappointingly low? Usually, there is little reason to worry.

First of all: wait till the test is finished. Drawing conclusions from an unfinished test is statistically meaningless.
Look at the error bars. The previous test may have been a lucky run, and the current one is perhaps an unlucky one. Note that the error bar is for the Elo relative to the fixed release (base). Differences between two such Elo estimates have nearly double the statistical error (2-3 Elo).
SFdev vs SF11 : NNUE vs classical evaluation is very sensitive to the hardware mix present at the time of testing. If a fleet of AVX512 workers is present/absent, Elo will be larger/smaller.
Error bars are designed to be right 95% of the time. So, conversely, 1 in 20 tests will be an outlier.
Selection bias is a book-related effect, patches are more likely to be selected if they perform well with the testing book. When they are retested with a different book their Elo score may be adversely affected.
Elo estimates of single patches (SPRT runs) typically come with large error bars. Take this into account when adding Elo estimates. Furthermore, Elo's estimates of passing patches are biased. The SPRT Elo estimates are only unbiased if one takes all patches into account, both passed and non-passed ones. As a result, the Elo gain measured by a regression test will typically be less than the sum of the estimated Elo gains of the individual patches since the previous regression test.

How to compare opening books

If a book is new, first make a PR against the Stockfish book repo https://github.com/official-stockfish/books and wait for a maintainer to commit it.

Then use the books to run time odds tests of master vs itself with a fixed number of games and compare the normalized Elo estimates - taking into account the error bars. Don't make the time odds too large since the aim is to approximate standard testing conditions. On the other hand, you also cannot make them too small since in that case, you will need many games to separate the books. I have had good experiences with tests of 60000 games with 30% time odds. Using this procedure it has been shown that unbalanced books are definitely better than balanced books for engine tests.

Do not run SPRT tests. They are a waste of resources for this application.
Do not run tests of master vs an earlier version. This may give misleading results as it favors the current book. This effect (selection bias) has been shown to exist several times.
This procedure can also be used to evaluate other testing changes (e.g. contempt). For changes that affect the amount of resources used (e.g. TC) one should take the resources into account (the amount of resources used by a test is ~ (game duration)/(normalized Elo)^2).

Fishtest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fishtest FAQ

How can I give computer time to the Stockfish community?

Can I take my computer off at any time without wasting work?

What is a "residual"?

Can I program or run any test I want?

What time-control/method should I use for my test?

What is SPRT?

What if my test is tuning parameters?

How many tries on an idea are too many?

Can I test a fork of SF?

What is a union patch?

How can I test commits N-1, N-2, ... of a branch?

How to disable NUMA?

Why is the regression test bad?

How to compare opening books

Home

FAQ

Creating a Test

Running the Worker

Server Setup

Advanced Topics

Build Cutechess

Fishtest Mathematics

PGN Files

Clone this wiki locally