Debunking the Chessboard: Confronting GPTs Against Chess Engines to Estimate Elo Ratings and Assess Legal Move Abilities
Source code supporting experiments presented in https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/ I've started with the GPT-3 chess experiment from https://github.com/clevcode/skynet-dev and explore the variability space of experiments (GPT variants, number of games, Stockfish skills, temperature, prompt, etc.)
pip install openai chess stockfish
for the analysis part, you'll also need libraries like pandas
and matplotlib
See https://stockfishchess.org/download/
I'm using the Linux stockfish-ubuntu-x86-64-avx2
binary (version 16).
export OPENAI_KEY=<your-openai-API-key>
python ./gptchess/gpt-experiments.py
or if you want to run multiple games, you can use a loop like:
for i in {1..20}; do python gptchess/gpt-experiments.py; done
Edit the source code to change the GPT version, the number of games, the number of moves per game, etc. Something like this:
chess_config = ChessEngineConfig(
skill_level=4,
engine_depth=15,
engine_time=None,
random_engine=False
)
gpt_config = GPTConfig(
model_gpt="gpt-3.5-turbo-instruct",
temperature=0.0,
max_tokens=5,
chat_gpt=False,
system_role_message=None
)
play_game(chess_config, gpt_config, base_pgn=BASE_PGN, nmove=1, white_piece=False)
The outcome is located in output
folder and is a subfolder, with the PGN file of the game, the log of the game, and the session with GPT.
You can then analyze the data with the Jupyter notebook analysis.ipynb
.
For convenience, we have released the data used as part of the experiment documented in the blog post.
It has been saved into the games.tar.gz
of this repo.
I'm usually using zenodo, but the size of the data is manageable.
Decompress the archive to get games
folder.
> ls
game30da9c70-31ed-4489-9e93-9ef179661da4 game6a246b94-4511-4b34-908a-6f1fd5b838ac game98a79424-b219-4764-a4d8-71f524454923 gamec9ba4fa3-b7a9-4953-b8d5-a25db0e10886 gameffa12232-8e93-4440-82d8-a0c8a6023c63
game3179c6e9-935e-4f59-b932-268fd10356fc game6a34daf4-4820-4eb0-a7ad-6a472d6506a2 game99371010-b5af-48d4-adbc-4fcfda798304 gameca445cfd-0076-4146-ba1b-c263c8cd8707
with plenty of folders. Each folder contains:
game.pgn
: the PGN file of the gamemetainformation.txt
about configuration of the experimentslog.txt
the log of the gamesession.txt
the session with GPT
games_db.csv
contains almost all information about all the games, in a structured way.
analysis.ipynb
is a Jupyter notebook to analyze the data.
- new experiments based on
Monsieur Phi
suggestion/experiments (X/Twitter thread in french: https://twitter.com/MonsieurPhi/status/1781260337754366265), as a follow-up of his excellent video (in french again!) https://www.youtube.com/watch?v=6D1XIbkm4JE where I was interviewed. The basic idea is to study the impact of the prompt on the GPTs' playing skill, on the very specific position1. e4 e5 2. Bc4 Nc6 3. Qh5
. I have to wrap-up, but the tldr is that the prompt has indeed a significant impact on the GPTs' playing skill (at least on this position!), and that we can identify intuitive patterns of prompt leading to either g6 or Nf6. Seegptchess/gpt-experiments-prompt-variations.py
andanalysis_prompt_variations.ipyng
andprompt_variations_phi.csv
. - In parallel, Yosha Iglesias has made a fantastic video: https://www.youtube.com/watch?v=FBx95bdJzh8 further exploring prompt sensitivity and surprising skills of GPT as well as many interesting ideas worth replicating in the large. Fascinating!
analysis_yosha.ipynb
is one ongoing/modest attempt to study the impact of prompt on the GPTs' playing skill (see alsogames_yosha.tar.gz
andgames_db_yosha.csv
). Stay tuned for more experiments and analysis!