Re: Progress on Rustic
Posted: Sun Mar 07, 2021 11:52 pm
Do you score with distance from root? I.e. do you reward reaching good positions earlier, and bad ones later?
Do you score with distance from root? I.e. do you reward reaching good positions earlier, and bad ones later?
No; I just do swear he and evaluate. So I should subtract ply from the score, such that the same score will be lower when reached later. Same as adding ply to minus checkmate.
Yes, similar approach. I subtract the ply from eval if eval is positive, but don't reduce it to less than +1. If eval is negative, I add the ply, but not to more than -1 as result. I think you're right that some minor eval noise such as slightly better positioning at the end of the chain could be the cause. Not sure whether 2*ply would be better. That could be up to some experiments after you have added the basic distance handling.
Calculating deeper only leads to more strength if that actually translates into finding better moves. Take the extreme case of eval always returning the same value, then deeper calculations will not be useful.
This didn't work when I tested it. The evaluation isn't differentiated enough; there are too many similar moves. Adding or subtracting the depth has too big an impact.Ras wrote: ↑Mon Mar 08, 2021 12:18 am Yes, similar approach. I subtract the ply from eval if eval is positive, but don't reduce it to less than +1. If eval is negative, I add the ply, but not to more than -1 as result. I think you're right that some minor eval noise such as slightly better positioning at the end of the chain could be the cause. Not sure whether 2*ply would be better. That could be up to some experiments after you have added the basic distance handling.
Code: Select all
0 Rustic Alpha 1 -81 29 500 38.5% 16.2%
1 Deepov 0.4 210 109 50 77.0% 14.0%
2 Wukong JS 1.4 182 110 50 74.0% 8.0%
3 Clueless 1.4 164 107 50 72.0% 8.0%
4 CDrill Build 4 92 100 50 63.0% 6.0%
5 Pigeon 1.5.1 85 82 50 62.0% 32.0%
6 TSCP 1.81 78 98 50 61.0% 6.0%
7 Shallow Blue 2.0 35 89 50 55.0% 18.0%
8 Mizar 3 28 93 50 54.0% 12.0%
9 Celestial 1.0 7 87 50 51.0% 22.0%
10 FracTal 1.0 -28 78 50 46.0% 36.0%
Code: Select all
0 Rustic Alpha 2 rc5 15 27 500 52.1% 20.6%
1 Clueless 1.4 108 98 50 65.0% 10.0%
2 Pigeon 1.5.1 42 79 50 56.0% 36.0%
3 CDrill Build 4 28 90 50 54.0% 16.0%
4 Wukong JS 1.4 21 87 50 53.0% 22.0%
5 Celestial 1.0 -7 87 50 49.0% 22.0%
6 Deepov 0.4 -14 90 50 48.0% 16.0%
7 FracTal 1.0 -21 74 50 47.0% 42.0%
8 TSCP 1.81 -63 93 50 41.0% 14.0%
9 Shallow Blue 2.0 -78 94 50 39.0% 14.0%
10 Mizar 3 -173 103 50 27.0% 14.0%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Clueless 1.4 1882 62 59 100 69% 1729 9%
2 Wukong JS 1.4 1837 58 56 100 64% 1729 15%
3 Deepov 0.4 1825 58 56 100 63% 1729 15%
4 CDrill Build 4 1799 58 57 100 59% 1729 11%
5 Pigeon 1.5.1 1790 53 52 100 59% 1729 34%
6 Rustic Alpha 2 rc5 1781 25 25 500 52% 1767 21%
7 TSCP 1.81 1738 57 57 100 51% 1729 10%
8 Celestial 1.0 1728 55 55 100 50% 1729 22%
9 FracTal 1.0 1709 51 52 100 47% 1729 39%
10 Shallow Blue 2.0 1708 56 56 100 47% 1729 16%
11 Rustic Alpha 1 1677 25 26 500 39% 1767 16%
12 Mizar 3 1657 56 58 100 41% 1729 13%
Code: Select all
Score of MinimalChess 0.3 vs MinimalChess Dev: 10 - 7 - 39 [0.527] 56
... MinimalChess 0.3 playing White: 6 - 2 - 20 [0.571] 28
... MinimalChess 0.3 playing Black: 4 - 5 - 19 [0.482] 28
... White vs Black: 11 - 6 - 39 [0.545] 56
Elo difference: 18.6 +/- 50.3, LOS: 76.7 %, DrawRatio: 69.6 %%
The reason I include a lot of engines is because every engine behaves differently. For example: I _know_ Alpha 1 plays badly against Mizar 3.0 and Celestial, but I also know it performs well against Shallow Blue (and even better against Pulse; so well, that I actually decided to not include it for fear of skewing the results.) These engines are all in the 1650-1725 range, but Rustic performs in a 1600-1800 range against them.lithander wrote: ↑Tue Mar 09, 2021 8:28 pm I really like that you include so many different engines included in the gauntlets! You've got your own little CCRL version here!
But when I run a match between two engines of 1000 games I still get only a result within an error window of +/- 15.5 ELO. How much bigger is your error window with only 50 games per engine? Could it explain the surprising results regarding Fractal?
Yes. I'm happy with 500 games in a gauntlet, because it puts the error bars at +/- 30. I don't know how many games I would need to play to get that down to let's say +/- 10 Elo. I'm neither a statistician, nor a mathematician.The above versions should be identical. But one won 3 more games than the other. The calculated ELO could mislead me into thinking one is better than the other. But given that the error window is 100 ELO wide the test really just shows that I need to run more tests before I conclude anything.Code: Select all
Score of MinimalChess 0.3 vs MinimalChess Dev: 10 - 7 - 39 [0.527] 56 ... MinimalChess 0.3 playing White: 6 - 2 - 20 [0.571] 28 ... MinimalChess 0.3 playing Black: 4 - 5 - 19 [0.482] 28 ... White vs Black: 11 - 6 - 39 [0.545] 56 Elo difference: 18.6 +/- 50.3, LOS: 76.7 %, DrawRatio: 69.6 %%