More details from test suites

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

More details from test suites

Post by matthewlai »

I am working on the search of a new engine, and have been testing using tactical test suites (since I don't care about eval yet).

The problem is, a lot of times, scoring on test suites do not correspond to playing strength.

For example, I introduced a severe bug into futility pruning at one point (forgetting to negate material score), and the number of solved problems actually went way up! While playing strength obviously went down the toilet (at least -500 ELO).

Turned out it was just doing pretty much random pruning, and the increased depth actually solved more problems by luck than problems lost to blunders.

Obviously, the most reliable way is to play a lot of games. But I have been thinking, maybe it's possible to do more useful testing using test suites.

The problem with test suites is that all wrong solutions are scored the same.

If a position has 5 possible moves, with scores
300, 5, 5, 0, -500

The best move would be the move that wins a knight. However, a typical test suite run would give the same score to an engine that picked the score=5 move, as another engine that picked the move that lost a rook.

I am thinking about taking a bunch of random positions from lower level games (since the quality of games doesn't really matter, and at higher levels, player resign earlier and would not give the engine enough exposure to more one-sided positions).

Then I would have a strong engine analyze all possible moves from each position, and record all the scores.

Then, on a test suite run, we can take the move the engine picks, and compare the score of that to the score of the move with highest score. The differences can be added together to give the final result of the test run (lower is better).

With the example above (serious bug), it would get a very bad result due to serious blunders, and the fact that it happens to pick more best moves would be inconsequential.

This can also be used to test risky pruning techniques, etc, to see if it helps on average.

Has anyone been doing similar things?
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: More details from test suites

Post by cdani »

I did a relatively similar test some time ago.

I created a file of fen positions followed by acceptable moves (analyzed by stockfish), for example all the moves that where at most -20cp than the best one.

Then I used it to improve the parameters of pruning, razoring, null move, etc. so to be able to increase the speed, testing those parameters with random values, and to lose the minimum of strength.
ymatioun
Posts: 64
Joined: Fri Oct 18, 2013 11:40 pm
Location: New York

Re: More details from test suites

Post by ymatioun »

I use "strategic test suite". It is big (1,000 positions), and moves are scored according to their value. I usually run it at 0.5 second/position, for 500 seconds total run time.

My results on that test match reasonably close to playing strength (the match is not perfect, so it does not work for small changes).
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: More details from test suites

Post by matthewlai »

ymatioun wrote:I use "strategic test suite". It is big (1,000 positions), and moves are scored according to their value. I usually run it at 0.5 second/position, for 500 seconds total run time.

My results on that test match reasonably close to playing strength (the match is not perfect, so it does not work for small changes).
I didn't know something like that existed. That's awesome!

EDIT: I just took a quick look. The problem with using STS is that they don't have scores for bad moves. Eg. a passive move would still get the same score as throwing away a queen.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: More details from test suites

Post by jdart »

I used test suites for years, tuning for more solutions. That wasn't entirely futile but it was largely wasted effort because of the poor correlation with games.

Your method sounds plausible, but still may do poorly unless you have a large collection of positions for tuning. I think one of the core issues with test suites is that billions of positions can be visited during a game, but most test suites are small, so you are getting tuning against only the tiniest sample of those possible positions. On top of that many common test suites have uncommon positions where there is a hidden deep solution move not found by shallow search.

--Jon
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: More details from test suites

Post by matthewlai »

jdart wrote:I used test suites for years, tuning for more solutions. That wasn't entirely futile but it was largely wasted effort because of the poor correlation with games.

Your method sounds plausible, but still may do poorly unless you have a large collection of positions for tuning. I think one of the core issues with test suites is that billions of positions can be visited during a game, but most test suites are small, so you are getting tuning against only the tiniest sample of those possible positions. On top of that many common test suites have uncommon positions where there is a hidden deep solution move not found by shallow search.

--Jon
I think the biggest problem is test suites don't punish blunders nearly enough, so they would strongly favour aggressive pruning.

For example, a change that makes an engine find 20 more solutions out of 200, but blunder 10 more times will be favoured by the test suite, while being detrimental to game play.

I guess the optimal combination would be mostly normal positions + a few positions from test suites. That way deep solutions are rewarded, but only if they don't significantly increase number of blunders.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: More details from test suites

Post by matthewlai »

Code: Select all

  8446 matthew   20   0  231564 138404   1464 S 100.5  0.2   1:03.53 stockfish                                                      
  8458 matthew   20   0  231564 136376   1468 S 100.5  0.2   1:03.52 stockfish                                                      
  8468 matthew   20   0  231564 136360   1464 S 100.5  0.2   1:03.50 stockfish                                                      
  8469 matthew   20   0  231564 136356   1468 S 100.5  0.2   1:03.52 stockfish                                                      
  8470 matthew   20   0  231564 136364   1468 S 100.5  0.2   1:03.52 stockfish                                                      
  8456 matthew   20   0  231564 136372   1468 S 100.2  0.2   1:03.52 stockfish                                                      
  8460 matthew   20   0  231564 136376   1468 S 100.2  0.2   1:03.52 stockfish                                                      
  8461 matthew   20   0  231564 136364   1468 S 100.2  0.2   1:03.52 stockfish                                                      
  8462 matthew   20   0  231564 138340   1464 S 100.2  0.2   1:03.54 stockfish                                                      
  8465 matthew   20   0  231564 136376   1468 S 100.2  0.2   1:03.52 stockfish                                                      
  8466 matthew   20   0  231564 136364   1468 S 100.2  0.2   1:03.52 stockfish                                                      
  8467 matthew   20   0  231564 136364   1468 S 100.2  0.2   1:03.52 stockfish                                                      
  8471 matthew   20   0  231564 138408   1468 S 100.2  0.2   1:03.51 stockfish                                                      
  8472 matthew   20   0  231564 136360   1468 S 100.2  0.2   1:03.53 stockfish                                                      
  8473 matthew   20   0  231564 136364   1468 S 100.2  0.2   1:03.53 stockfish                                                      
  8474 matthew   20   0  231564 136360   1468 S 100.2  0.2   1:03.50 stockfish  
Bed time :).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: More details from test suites

Post by matthewlai »

Aaaand it's done.

http://matthewlai.ca/random_positions.scores

These are 500 random positions from the gm2006 database. I decided to not bother with lower level games since there is surprisingly quite a bit of variety in these positions already.

At most 1 position is selected from each game to minimize correlation. Duplicate positions are taken out (from early moves in openings). Otherwise all positions are completely randomly selected.

Each position begins with the FEN on its own line.

Next line is number of legal moves from that position.

This is followed by one line for each move, and score after making that move.

Scores are from Stockfish. 30 seconds analysis on Xeon E5-2670. This is run on Amazon's older generation cc2.8xlarge instances, with 16 processes, so I got the whole physical machine and there shouldn't be any other load on that physical machine.

I initially thought about adding some positions from tactical test suites to reward deeper solutions, but decided against it since I want it to be an accurate representative of positions from actual games, and what's a better sampling method than random sampling from actual games? Biasing it towards tactical positions would probably encourage risky pruning.

Next - writing a tool to run the test suite with any engine.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: More details from test suites

Post by matthewlai »

Some results with random engines I have around in my gauntlet:
Crafty
Average node count: 481728
Average plies reached: 12.04
Average NPS: 4.13285e+06
Average penalty: 31.314
Penalty bins:
[0, 0]: 263
(0, 60]: 189
(60, 140]: 28
(140, 450]: 14
(450, 750]: 0
(750, 1000]: 6

Stockfish
Average node count: 210545
Average plies reached: 13.15
Average NPS: 1.77024e+06
Average penalty: 25.266
Penalty bins:
[0, 0]: 305
(0, 60]: 162
(60, 140]: 19
(140, 450]: 8
(450, 750]: 0
(750, 1000]: 6

Greko
Average node count: 270613
Average plies reached: 9.30862
Average NPS: 2.31583e+06
Average penalty: 45.6934
Penalty bins:
[0, 0]: 245
(0, 60]: 196
(60, 140]: 26
(140, 450]: 19
(450, 750]: 3
(750, 1000]: 10

Diablo
Average node count: 140827
Average plies reached: 7.526
Average NPS: 1.31305e+06
Average penalty: 50.01
Penalty bins:
[0, 0]: 242
(0, 60]: 193
(60, 140]: 35
(140, 450]: 12
(450, 750]: 8
(750, 1000]: 10

RobboLito
Average node count: 278180
Average plies reached: 11.778
Average NPS: 2.38186e+06
Average penalty: 31.01
Penalty bins:
[0, 0]: 287
(0, 60]: 177
(60, 140]: 18
(140, 450]: 8
(450, 750]: 3
(750, 1000]: 7

Giraffe (my new engine, ~2000)
Average node count: 158418
Average plies reached: 6.836
Average NPS: 1.80894e+06
Average penalty: 64.528
Penalty bins:
[0, 0]: 198
(0, 60]: 204
(60, 140]: 52
(140, 450]: 26
(450, 750]: 8
(750, 1000]: 12

Brainless (my old engine, ~2200)
Average node count: 239766
Average plies reached: 7.518
Average NPS: 2.44139e+06
Average penalty: 61.034
Penalty bins:
[0, 0]: 212
(0, 60]: 204
(60, 140]: 37
(140, 450]: 27
(450, 750]: 10
(750, 1000]: 10
This is at 0.1 second per position.

Stockfish did the best, but that's not surprising since Stockfish also developed the solutions (though at 30 seconds per move).

If you want to run it on your engine:
1. Make sure you have Mercurial and GCC 4.9 or LLVM installed. GCC 4.8 has horribly broken <regex>. This tool only works on Linux/OSX.
2. "hg clone https://bitbucket.org/waterreaction/chessenginetools"
3. Go into searchtestrun
4. "make"
5. "./searchtestrun random_positions.scores <engine directory> <max nodes> <max time (seconds)>"

<engine directory> is absolute or relative path to your engine. If your engine is xboard, create an "engine_def.txt" file in the directory, with one line "exec <executable>".

If your engine is UCI, copy polyglot binary to the directory, and just use "exec polyglot".

Max nodes and max time are whichever is reached first.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.