statistics, testing and frustration

xr_a_y · Post by **xr_a_y** » Wed Sep 12, 2018 8:03 pm

Trying to manually tune Weini search parameters, I launch a self testing session the past few weeks (15000 games were played) and I get the following results

T columns stands for the test name, V for the value of the parameter being used and E for the elo difference.
Test "1" is the Weini 0.0.23 release, the elo is not center on this test "1", that is just a cutechess tournament result table.

Code: Select all


== razoring marging tuning ( parent node, val <= alpha - margin )
T   V   E
2  80  -8
3 100   7
4 130   13
5 150   6
6 180  -1
1 200   2   // value 200 from xiphos
7 220  14
8 240  -2
9 270  12

== static null move tuning (parent node, val >= beta + margin*depth
T   V   E
10  50  23
11  60   2
12  70   5
1   80   2   // value 80 from xiphos
13  90 -15
14 100   9
15 110  -5
16 130  -5
17 150  -9
18 170  -5
19 200  -1

== qsearch futility tuning (current node, stockfish-like) 
T   V   E
20  70 26
21  80  3
22  90 -1
23 100  3
24 120  6
25 120  6
1  128  2   / value 128 from stockfish
26 140  9
27 150  2
28 160 10
29 180 18

With this number of games, all elo values are given +/-25 ...

It seems I cannot deduce anything from this ...

Starting to tune an engine as weak as Weini, I was hoping for a greater influence of all those search parameters ...

What are you suggesting :
- use other engines in those kind of test, stop self-test
- use more games to be able to see +5 or +10 elo gain (I was expecting a lot more ...)
- have a better evaluation (but rofchade just proves PST and a highly tuned search can do good, way better than Weini ... Xiphos is very strong also with quite a simple code and evaluation)
- look for bugs

Thanks for your inputs.

Edsel Apostol · Post by **Edsel Apostol** » Wed Sep 12, 2018 8:10 pm

Look for bugs. Make sure that everything works. Write unit tests. Put a lot of ASSERTS just like in Fruit 2.1.

xr_a_y · Post by **xr_a_y** » Wed Sep 12, 2018 8:47 pm

Edsel Apostol wrote: ↑Wed Sep 12, 2018 8:10 pm Look for bugs. Make sure that everything works. Write unit tests. Put a lot of ASSERTS just like in Fruit 2.1.

Do you have any idea what kind of bug can lead to a weak engine without any crash / error / ... ?

Edsel Apostol · Post by **Edsel Apostol** » Wed Sep 12, 2018 9:12 pm

xr_a_y wrote: ↑Wed Sep 12, 2018 8:47 pm
Edsel Apostol wrote: ↑Wed Sep 12, 2018 8:10 pm Look for bugs. Make sure that everything works. Write unit tests. Put a lot of ASSERTS just like in Fruit 2.1.
Do you have any idea what kind of bug can lead to a weak engine without any crash / error / ... ?

What I've learned from computer chess programming over the years is that "the devil is in the detail". Like for example this code:

Code: Select all

      if ( score > alpha && score < beta){ // PVS fail (fail-soft framework)
        ++Stats::pvsRootfails;
        Line::ClearPV(pv_loc); // reset pv_loc (lmr, pvs, ...)
        score = -Negamax( -beta, -alpha, nextDepth , p2, color,iterativeDeepeningDepth,
                          true, pv_loc, allowNullMove,timeControl,forbidExtension); //can be a new pv!

        HANDLE_TC_WINDOW
      }

I think most strong engines doesn't check for && score < beta anymore.

Also this one:

Code: Select all

    // this is because, even if no bestmove is found > alpha,
    // we insert something in TT with score alpha ...
    if ( isPV ) bestMove = *it;

Usually you don't store a bestmove in the TT if score is equal or below alpha.

There are a lot of stuff like that. You'll probably learn it from reading old posts based on anecdotal experiences of the programmers here or if you study other strong open source engines.

xr_a_y · Post by **xr_a_y** » Wed Sep 12, 2018 9:16 pm

Ok, thanks a lot for those hints. Still a lot to learn

sandermvdb · Post by **sandermvdb** » Wed Sep 12, 2018 9:47 pm

I assume this is without time forfeits or crashes?

You probably still have quite some bugs which you need to get rid of. You could implement certain verifications, for instance for the transposition table score and bestmove. Or mirror your eval function and check if you get the same scores. Also add asserts to check if certain values are within expected boundaries. Another option is to disable certain features and check its strength with and without.

Is your move-generation correct? Using certain positions and perft you can check this.

At what timecontrol did you play these games? I wouldn't use super fast ones. My current coding/testing approach is as follows: every time I implement something new or change a particular value, I play 400 games against the previous version at 40/10sec. If the elo difference is bigger than 20, I am quite confident that it improves the engine and I check the new code in. About once a week I do a small tournament using my current code, my previous version and 2 engines that are about the same strength at 40/1minute timecontrol with about 1000 total matches. This is to verify that all previous commits (combined) actually improved the engine.

If I were you I would also start with Texel tuning. Using this approach you can quite easily tune the evaluation parameters. If your current eval is really bad, improving the search won't make your engine stronger.

Sometimes I let my engine play 500 matches against much weaker engines. Afterwards I investigate all loses (probably <10) and investigate if something strange is going on.

I see Weini 0.0.20 is ~2000 elo. In this range I wouldn't even implement razoring or futility pruning. First get the basics correct!

sandermvdb · Post by **sandermvdb** » Wed Sep 12, 2018 10:15 pm

I just downloaded the games Weini 0.0.20 played in the CCRL 40/4. This is the game it has lost in the least number of moves:
https://lichess.org/aoEmr338

It looks like a 3-fold repetition bug where it just throws away its queen and therefore loses the match.

xr_a_y · Post by **xr_a_y** » Wed Sep 12, 2018 10:33 pm

Weini 0.0.23 is around 2200. Move generation is fine (but slow). I think there is still some 3 fold things to investigate indeed. I lack some real unit test but have some good signs anyway : weini is scaling well with increasing depth, it solves fine70 very fast, perft results are ok, almost never loss on time anymore and nearly no crash (the only crash I know now is for games >512 ply), ... I also clear TT, killers and history for each search.

There are still some bugs probably, and I definitely need to add more tests.

xr_a_y · Post by **xr_a_y** » Wed Sep 12, 2018 10:35 pm

Really appreciate you take time to check weini game. May I ask what analysis tool you are using on the pgn?

xr_a_y · Post by **xr_a_y** » Wed Sep 12, 2018 10:36 pm

In another post I talk about my texel tuning try and I am still stuck with it. Not even working for piece value. Clop is not converging fast either for piece value.

statistics, testing and frustration

statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration

Re: statistics, testing and frustration