Latest Stockfish has UCI_Elo feature. It is now included in the test for 1500.
Also make some revisions to Deuterium, now it is limited to 300 nodes and randomize piece values for [Q, R, B and N] at a max of 50% of the time for all moves in a game.
TC 40/2m
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 Cheng 4.39 ucielo 1500 : 2282.8 121.0 134.5 156 86
2 Cheese 2.1 ucielo 1500 : 2258.0 114.7 132.0 156 85
3 Fruit reloaded v3.21 ucielo 1500 : 2233.2 113.5 127.5 154 83
4 Amyan 1.72 ucielo 1500 : 2220.8 113.6 128.0 156 82
5 Ufim v8.02 ucielo 1500 : 2109.9 102.7 123.0 170 72
6 Rhetoric 1.4.3 ucielo 1500 : 2070.8 99.8 107.5 154 70
7 DanaSah 7.9 ucielo 1500 : 2039.2 89.6 74.0 124 60
8 Houdini 3 ucielo 1500 : 2006.3 104.8 101.5 144 70
9 MadChess 2.2 ucielo 1500 : 1994.0 112.5 104.5 170 61
10 Deuterium v2019.2.37.59 ucielo 1500 : 1862.7 94.7 117.5 256 46
11 Stockfish 2019.07.14 ucielo 1500 : 1842.8 96.9 112.5 256 44
12 Discocheck 5.2 ucielo 1500 : 1795.4 104.3 69.0 156 44
13 Iota 1.0 ccrl 1019 : 1766.3 117.5 26.0 74 35
14 CT800 V1.34 ucielo 1500 : 1762.9 86.2 67.5 172 39
15 Arasan 21.3 ucielo 1500 : 1648.5 104.5 50.5 172 29
16 Hiarcs 14 ucielo 1500 : 1534.2 96.4 35.5 170 21
17 NSVChess v0.14 ccrl 946 : 1500.0 ---- 25.5 228 11
18 DanaSah 7.9 human ucielo 1500 : 1295.1 132.4 9.5 224 4
Meanwhile created a test set for these UCI_Elo 1500 engines. The test positions are from human players with Elo 1450 to 1550. The main goal is to find which uci engines has the greatest number of matches in the test. The test epd would look something like this,
Code: Select all
3r2k1/p5p1/1pR4p/4R3/3r4/8/PP4PP/6K1 b - - bm Rd2; ce 0; c0 "Rd1+"; c1 "154";
That bm Rd2 is the move by a human player with an Elo rating within 1450 to 1550 from an actual game.
That ce 0 is the centipawn eval score of move Rd2 based on stockfish dev at 1sec of analysis on i7 3.4 Ghz PC.
That Rd1+ is the move preferred by stockfish dev and has a score of 154 cp. I have collected around 60k test positions.
Now to test these uci elo 1500 engines, the position is given to the engine and allow it to search at 1sec of analysis per pos. Whenever the engine bestmove and bm is the same then a match counter is incremented. Aside from the match counter, I also save how many positions are there where the bm of human is not the same to the bestmove of engine. The bestmove of engine can be stronger or weaker than human move. If engine move is stronger I record it in High counter, if engine move is weaker than that of human move I record it in Low counter. Other items are also recorded like the average difference between the move score of engine and the move score of human when the bestmove of engine and the bestmove of human are not the same.
Results on 1000 test positions.
Code: Select all
UCI_Elo 1500 engine test results on FIDE Elo 1500
Test positions are taken from players with FIDE Elo 1450 to 1550
Engine Total Match High Low HACD LACD
Deuterium v2019.2.37.59 UCI_Elo 1500 1000 362 291 347 357 335
Arasan 21.3 UCI_Elo 1500 1000 305 266 429 494 313
Ufim v8.02 UCI_Elo 1500 1000 428 280 292 475 332
CT800 V1.34 UCI_Elo 1500 1000 333 244 423 268 739
DanaSah 7.9 Human UCI_Elo 1500 1000 332 250 418 392 577
Stockfish 2019.07.14 UCI_Elo 1500 1000 254 263 483 493 356
Cheng 4.39 UCI_Elo 1500 1000 408 325 267 442 329
Discocheck 5.2 UCI_Elo 1500 1000 368 276 356 250 422
Houdini 3 UCI_Elo 1500 1000 360 263 377 393 217
Amyan 1.72 UCI_Elo 1500 1000 348 239 413 399 738
Rhetoric 1.4.3 UCI_Elo 1500 1000 359 286 355 240 664
Hiarcs 14 UCI_Elo 1500 1000 342 239 419 286 440
Cheese 2.1 UCI_Elo 1500 1000 432 312 256 441 432
Code: Select all
::Legend::
Total: Number of test positions from human games.
Match: Count of pos, where engine and human move are the same.
High : Count of pos, where engine move is stronger than human move.
Low : Count of pos, where engine move is weaker than human move.
HACD : High Average Centipawn Difference, engine move is stronger
than human move by Centipawn amount, according to Stockfish 2019.04.16.
LACD : Low Average Centipawn Difference, engine move is weaker
than human move by Centipawn amount, according to Stockfish 2019.04.16.
Table interpretation:
Deuterium was able to match the human move by 362 or 100*362/1000 or 36.2%. In relative comparison, the engine that got the most matches is Cheese 2.1 at 43.2%. The HACD of Deuterium is 357 or 357 cp or around 3 and a half pawns. HACD means that if Deuterium move is stronger than human move, it has a difference of 357 cp above that of human move score. In order to simulate a human play, its HACD should be minimum, of the engines tested Rhetoric has the best at 240 cp. This means that when you play against Rhetoric, it can play stronger moves at an average of 240 cp above the human move. LACD is the opposite of HACD. In LACD the engine move is weaker than human move. For Deuterium LACD is 335 cp, that would mean that when Deuterium plays a bad move, on average it gives 335 cp advantage to its opponent. Looking at the table the engine that gives away some advantage to its opponent are CCT800 at 739 cp and Amyan at 738 cp. For humans at lower rating range these engines are good to play, but be aware of its HACD values too.
So how do we rank engines that plays like humans based on human test positions?
I can list the following criteria:
1. Match (max is better)
2. High (min is better)
3. Low (max is better)
4. HACD (min is better)
5. LACD (max is better)
It seems like this is an MCDA (Multi-Criteria Decision Analysis) issue, where alternatives are ranked based on criteria. One technique to rank alternatives is by using TOPSIS.
TOPSIS ref.
https://en.wikipedia.org/wiki/TOPSIS
https://www.slideshare.net/pranavmishra ... g-approach
With that table I tried to rank those engines using TOPSIS utilizing skcriteria python module.
Here are the results with the application of weight for each criteria.
match, weight=0.6
High, weight=0.05
Low, weight=0.05
HACD, weight=0.2
LACD, weight=0.1
Total weight is 1.0. In the table there are also indications in the column if min and max is preferrable, That is my input too.
Code: Select all
TOPSIS (mnorm=vector, wnorm=sum) - Solution:
ALT./CRIT. Match (max) W.0.6 High (min) W.0.05 Low (max) W.0.05 HACD (min) W.0.2 LACD (max) W.0.1 Rank
------------------------------------ ------------------- ------------------- ------------------ ------------------ ------------------ ------
Deuterium v2019.2.37.59 UCI_Elo 1500 362 291 347 357 335 7
Arasan 21.3 UCI_Elo 1500 305 266 429 494 313 12
Ufim v8.02 UCI_Elo 1500 428 280 292 475 332 2
CT800 V1.34 UCI_Elo 1500 333 244 423 268 739 6
DanaSah 7.9 Human UCI_Elo 1500 332 250 418 392 577 11
Stockfish 2019.07.14 UCI_Elo 1500 254 263 483 493 356 13
Cheng 4.39 UCI_Elo 1500 408 325 267 442 329 5
Discocheck 5.2 UCI_Elo 1500 368 276 356 250 422 4
Houdini 3 UCI_Elo 1500 360 263 377 393 217 10
Amyan 1.72 UCI_Elo 1500 348 239 413 399 738 8
Rhetoric 1.4.3 UCI_Elo 1500 359 286 355 240 664 3
Hiarcs 14 UCI_Elo 1500 342 239 419 286 440 9
Cheese 2.1 UCI_Elo 1500 432 312 256 441 432 1
According to the weight assigned Cheese 2.1 is the best at rank #1, followed by Ufim and Rhetoric.
If you have weight values that you would like to run post it and I will try to run it.
Next I will be testing these engines at 5000 positions.