Testing strategies for my engines playing strength

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27809
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing strategies for my engines playing strength

Post by hgm »

mvanthoor wrote: Mon Feb 01, 2021 8:47 pmYou can't do this with most other terms, such as the value for mobility; ...
Oh, surely you can. I have made an engine that actually does this ('Inferno'). But, as you illustrate, this can become rather complex, and I am not sure that for orthodox Chess it would pay off. The engine is for Tenjiku Shogi, though, a game on a 16x16 board where each player starts with 78 pieces. So there it would take massive effort to recalculate the mobility of every piece from scratch, and incremental techniques become extremely competitive, even those that are rather cumbersome.

It is also true that incremental techniques can be highly cooperative, such that when you do everything incrementally, the individual tasks take much less effort, by drawing on each other's results. Inferno keeps track of distances between occupied squares along the orthogonals and diagonals ('view distances'), the captures to each square ('attack map') and the mobility per piece (in the piece list), all updated incrementally. When the occupancy of a square changes, the attack map tells you directly which pieces where attacking it, and thus need a mobility change (which you apply to the total as well). You know how many moves they gain or lose (for sliders) by consulting the view distance in the opposit direction. You can also displace their attack between the given square and the downstream target in the attack map. When a piece gets captured, or moves away, you can subtract its stored mobility from the total, and delete its captures from the attack map. The view-distance table brings you directly to the squares they were attacking (if these were in the move's range), no matter how far away these were. The same for the moves of the moved piece in the new location. So you only have to generate captures for the moved piece from its new location (and calculate its new mobility as a side effect), and for the piece in its old location, and the capture victim (if any), to delete those. On UnMake you have to do that again (to apply the opposit change). So you have to generate moves for only 4 (or 6 on a capture) pieces, rather than for all. (And 'all' can be 156, in Inferno's case!) And the slider captures can be calculated without the intervening non-captures, as the view distance table directly tells you where the capture goes, and how many empty square that skips over (to add to the mobility).

This entire structure is also used for move generation: by generating the captures from the attack map you can generate them in order of victim value, running through the piece list top-down to see where the victims are, looking in the attack map what attacks them, and again in the piece list where that is located. This saves a lot of work on move sorting. As well as on move generation, as when the capture of the most-valuable victim produces a beta cutoff, you never even get to generate captures on any other victim. So there is no need to first generate all captures, and then sort them MVV-wise.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Testing strategies for my engines playing strength

Post by mvanthoor »

hgm wrote: Mon Feb 01, 2021 10:51 pm Oh, surely you can. I have made an engine that actually does this ('Inferno'). But, as you illustrate, this can become rather complex, and I am not sure that for orthodox Chess it would pay off. The engine is for Tenjiku Shogi, though, a game on a 16x16 board where each player starts with 78 pieces. So there it would take massive effort to recalculate the mobility of every piece from scratch, and incremental techniques become extremely competitive, even those that are rather cumbersome.
Granted; wrong wording. Of course you *can* do it, technically speaking... what I actually meant is that you probably shouldn't do it because maintaining the incremental value is probably more costly than just recalculating the mobility.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
User avatar
hgm
Posts: 27809
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing strategies for my engines playing strength

Post by hgm »

Well, this remains to be seen. 6 still is significantly smaller than the 32 you have in orthodox Chess. And in a mailbox engine calculating mobility for a slider usually involves looping over all distances in all directions to count the moves; Inferno just adds the 4 view distances in the relevant directions for a Rook or a Bishop.

But it might only be a factor 10 faster, rather than 50-100.
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Testing strategies for my engines playing strength

Post by lithander »

lithander wrote: Fri Jan 29, 2021 12:57 am And to confirm that I've implemented a 'bench' command which reads a file produced with 'make_bench' containing dozens or even hundreds of test positions. It will basically evaluate the position at the same depth and compare the results and generate output such as this:

Code: Select all

1 OK! acn 13815676 / 4192086432 = 0.003, acs 4.797 / 836.49 = 0.006.
2 OK! acn 1917864 / 486503398 = 0.004, acs 1.867 / 116.351 = 0.016.
3 OK! acn 4960790 / 1260872611 = 0.004, acs 1.649 / 243.975 = 0.007.
4 OK! acn 5389962 / 988286346 = 0.005, acs 4.349 / 243.634 = 0.018.
...
Test finished with 0 wrong results! Avg time ratio: 0.016, Avg node ratio: 0.007, Performance: 1831kN/sec
Since I posted the above I have changed my search routine to output the principal variation for each of the best moves. (There can be many moves that lead to the same 'best' score in my eval because it only counts material) This allowed me to make the testing procedure a bit stricter: Before I consider a test as passed I now also play each principal variation as the proof that the claimed score can be reached.

Code: Select all

PV c7b7 e2e1 c6d4 c2c8 d7c8 = -1.00, OK!
PV a2b4 c2d2 c6d4 b3d4 f6d4 = -1.00, OK!
PV c6d4 b3d4 c7c2 d4c2 c8c2 = -1.00, OK!
936 OK! acn 75105 / 658298 = 0.114, acs 0.105 / 0.79 = 0.133.
This example show's the output for the position 936 from the test dataset for depth 5 which looks like this:

Code: Select all

2r5/2rk2pp/1pn1pb2/pN1p4/P2P4/1N2B3/nPR1KPPP/3R4 b - -;ce -100;bm a2b4 c6d4 c7b7;acd 5;acn 658298;acs 0.79
The test was generated with a simple AB search and claims that the best moves are 'a2b4 c6d4 c7b7' and that they lead to a position with an eval of -100 centipawns. The AB search evaluated 658298 nodes to find this result and took 0.79s.

My "improved" search finds the same best moves, the PVs lead indeed to positions with the expected score. But only 10% of the nodes where explored and so it was 8x faster. If all 1000 test positions are "okay" in the same way I can be quite sure that I improved my search without introducing new bugs.

Extracting the PV was trickier than I thought (I don't use hash tables yet) but a nice benefit I found is that you can use the intermediate PV to make the search faster. If of all the legal moves in your position one is already present in the PV candidate at the current depth then it is likely that it is the best move again and you should explore it first. I think this idea is related to what would be called a 'killer heuristic'?

The cool thing is: I don't need to be 100% sure. I can just make such a change on a hunch and then run the testpositions and make sure that a) the search result didn't change on any position and b) the number of evaluated nodes got smaller and/or the search time improved. If both is true the idea was likely a valid one. That's a really fun workflow!
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Testing strategies for my engines playing strength

Post by lithander »

This is certainly not worth opening a new thread but I've run into weird issue...

When I setup Cute Chess' Time Controls to "Time per move" and set it to 1s per move, then Stockfish loses (almost) every game and the Termination reason given is "time forfeit". Monchester, Alouette and my own engine are doing fine.

Code: Select all

[Event "TPM 1sec"]
[Site "?"]
[Date "2021.02.08"]
[Round "3"]
[White "Monchester"]
[Black "Stockfish ELO 1350"]
[Result "1-0"]
[ECO "A00"]
[GameDuration "00:00:03"]
[GameEndTime "2021-02-08T14:00:53.005 Mitteleuropäische Zeit"]
[GameStartTime "2021-02-08T14:00:49.850 Mitteleuropäische Zeit"]
[Opening "Dunst (Sleipner, Heinrichsen) Opening"]
[PlyCount "5"]
[Termination "time forfeit"]
[TimeControl "1/move"]

1. Nc3 {+0.45/4 0.014s} d5 {-0.05/17 1.0s} 2. e3 {+0.45/4 0.030s}
e5 {+0.33/16 1.0s} 3. Nf3 {+0.45/4 0.10s, Black loses on time} 1-0
Any idea what the cause for this is? Does Stockfish really not support fast time controls?
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
Guenther
Posts: 4610
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: Testing strategies for my engines playing strength

Post by Guenther »

lithander wrote: Mon Feb 08, 2021 2:53 pm This is certainly not worth opening a new thread but I've run into weird issue...

When I setup Cute Chess' Time Controls to "Time per move" and set it to 1s per move, then Stockfish loses (almost) every game and the Termination reason given is "time forfeit". Monchester, Alouette and my own engine are doing fine.

Code: Select all

[Event "TPM 1sec"]
[Site "?"]
[Date "2021.02.08"]
[Round "3"]
[White "Monchester"]
[Black "Stockfish ELO 1350"]
[Result "1-0"]
[ECO "A00"]
[GameDuration "00:00:03"]
[GameEndTime "2021-02-08T14:00:53.005 Mitteleuropäische Zeit"]
[GameStartTime "2021-02-08T14:00:49.850 Mitteleuropäische Zeit"]
[Opening "Dunst (Sleipner, Heinrichsen) Opening"]
[PlyCount "5"]
[Termination "time forfeit"]
[TimeControl "1/move"]

1. Nc3 {+0.45/4 0.014s} d5 {-0.05/17 1.0s} 2. e3 {+0.45/4 0.030s}
e5 {+0.33/16 1.0s} 3. Nf3 {+0.45/4 0.10s, Black loses on time} 1-0
Any idea what the cause for this is? Does Stockfish really not support fast time controls?
Stockfish supports fast and very fast time controls, but you use here an extreme edge case of tc, which isnt' usually tested.
The only test I ever remember that was done with fixed time per single move, was AlphaZero vs. SF 8 ;-)

More seriously, you can see from your pgn that SF tries to use exactly all time for every move and this surely can fail.
(also it is more prone to overhead on own your system)
You can also see that it makes no sense vs. Monchester, as it does a fixed 4 plies search by default compilation and SF gets a time advantage of 35-100:1!). I have no idea though if your engine already has a real time management, or also play by fixed depth though?
So saying programx supports very fast time controls, while in fact it just always plays in a fraction of a second is a bit excessive.
(Wasting too much time also means not really supporting a time control)

If you really wanna try this, you should adapt the setting for move overhead for SF to a higher value (it is 10ms in SF12 at least),
or even for Cutechess generally.
https://rwbc-chess.de

trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Testing strategies for my engines playing strength

Post by jdart »

lithander wrote: Wed Jan 13, 2021 9:39 pm I've got a follow up question:

How are opening books typically handled in tournaments?
Depends. ICGA and others allow engines to use their own book, if they have one. There are generally rules about use of existing books not authored by the engine team however. TCEC uses fixed books of their choice.
Or are they configured to all use the same one? If so, and I wanted to have my engine take part in such tournaments, what file formats need to be supported? What are standard opening books and where do I find them?
With UCI engines generally the book is handled by the interface program, not the engine. But there are several formats in use. Polyglot is widely used/supported, it is an open standard. ChessBase programs including Fritz have their own format (.ctg).
And if no opening books are allowed and an engine has some amount of opening lines encoded in their executable, are they banned or will this go undetected?
The organizations that want to use their own fixed books would not like that, I think.

Btw. also you can get an engine account on one of the chess servers (chessclub.com or freechess.org for example). It can take some time/effort because you need to have a dedicated account for the engine that is registered as a computer account. But playing there is a good way to find bugs also.
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Testing strategies for my engines playing strength

Post by lithander »

Guenther wrote: Mon Feb 08, 2021 4:09 pm I have no idea though if your engine already has a real time management, or also play by fixed depth though?
So saying programx supports very fast time controls, while in fact it just always plays in a fraction of a second is a bit excessive.
(Wasting too much time also means not really supporting a time control)
When MinimalChess 0.2 releases (soon) my ambition is that it will support proper time controls. Testing that was actually what this tournament was all about and I included the fast movers to verify that my engine, making use of a full second, could win against them.

And I included stockfish because... well it's my reference engine to guesstimate the playing strength of the others because you can configure it's ELO rating. So I thought if I'd win half the games against stockfish I could estimate my engine under these time controls playing at 1350 ELO. And then I realized that Stockfish had trouble sending it's best move in time which I totally didn't expect.
Guenther wrote: Mon Feb 08, 2021 4:09 pm If you really wanna try this, you should adapt the setting for move overhead for SF to a higher value (it is 10ms in SF12 at least),
or even for Cutechess generally.
I'm using Stockfish 12 and I've now tried setting "Move Overhead" to 100ms but it didn't change a thing. It still tries to max out the 1ms time window and wins only 5 games out of 50.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
unserializable
Posts: 64
Joined: Sat Oct 24, 2020 6:39 pm
Full name: Taimo Peelo

Re: Testing strategies for my engines playing strength

Post by unserializable »

lithander wrote: Mon Feb 08, 2021 2:53 pm This is certainly not worth opening a new thread but I've run into weird issue...
Hey Thomas, certainly you must be wrong here, "Stockfish DEMOLISHED by MinimalChess, Alouette and Monchester" would have been most shocking thread :)
lithander wrote: Mon Feb 08, 2021 2:53 pm When I setup Cute Chess' Time Controls to "Time per move" and set it to 1s per move, then Stockfish loses (almost) every game and the Termination reason given is "time forfeit". Monchester, Alouette and my own engine are doing fine.
Quick test with SF9 (no ELO scaling, I do not even know how to perform that) under Ubuntu 20.04.2 LTS with xboard 4.9.1 (-st/-searchTime defines the fixed search time for move), SF9 performed all moves cleanly within 1s time allocated for move (+100 -0 =0):

Code: Select all

xboard -xexit -xmovesound -saveSettingsOnExit false \
  -mg 100 -st 0:1 \
  -debugfile 1spm.log \
  -fcp  ../Stockfish/9/src/stockfish -fUCI \
  -scp ./monchester-1.0.1-14-g84db4ed \
  -sgf sf9-vs-monch-fixed1spm.pgn

xboard: Match Stockfish 9 64 POPCNT vs. Monchester 1.0.1-14-g84db4ed: final score 100-0-0
Guenther pointed out some possible reasons and remedies already. Happening is weird though, as SF is excellent at 1s+0 time control (whole game in 1 second, no increment). Playing 1s+0 so well means SF does have robust, granular and reactive timekeeping at ms and sub-ms level which should not be stretched when having whopping 1s per move. Maybe the glitch originates from strength-scaling code of SF ("Stockfish ELO 1350" is reported as opponent there in the given log) or a bug has been introduced after SF8 which played AlphaZero and SF9 that I tested.
Monchester 1.0, chess engine playing at scholastic level: https://github.com/unserializable/monchester ("Daddy, it is gonna take your horsie!")
Tickle Monchester at: https://lichess.org/@/monchester
User avatar
Guenther
Posts: 4610
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: Testing strategies for my engines playing strength

Post by Guenther »

lithander wrote: Mon Feb 08, 2021 6:55 pm
Guenther wrote: Mon Feb 08, 2021 4:09 pm I have no idea though if your engine already has a real time management, or also play by fixed depth though?
So saying programx supports very fast time controls, while in fact it just always plays in a fraction of a second is a bit excessive.
(Wasting too much time also means not really supporting a time control)
When MinimalChess 0.2 releases (soon) my ambition is that it will support proper time controls. Testing that was actually what this tournament was all about and I included the fast movers to verify that my engine, making use of a full second, could win against them.

And I included stockfish because... well it's my reference engine to guesstimate the playing strength of the others because you can configure it's ELO rating. So I thought if I'd win half the games against stockfish I could estimate my engine under these time controls playing at 1350 ELO. And then I realized that Stockfish had trouble sending it's best move in time which I totally didn't expect.
Guenther wrote: Mon Feb 08, 2021 4:09 pm If you really wanna try this, you should adapt the setting for move overhead for SF to a higher value (it is 10ms in SF12 at least),
or even for Cutechess generally.
I'm using Stockfish 12 and I've now tried setting "Move Overhead" to 100ms but it didn't change a thing. It still tries to max out the 1ms time window and wins only 5 games out of 50.
Oops, I totally overlooked the <elo setting> for SF in your post, you need to look up the code, usually that elo rating mimicking settings have some special rules and were included for playing against Humans. It is possible that in this case it is not allowed to move faster - some programs even add a sleep of x ms to make it feel more like a weak Human opponent.

Well, actually I don't believe in testing against 'downscaled' programs, I would always try to find programs for testing who really play around the same strength. But that's just my opinion, others might disagree.
https://rwbc-chess.de

trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...