Laskos wrote: ↑Wed Aug 05, 2020 9:03 am
They are veterans of 2000-2700 Elo ratings of top engines compared to humans, and the lower part of the table should be fairly accurate with respect to human ratings.
The problem is that it's possible all the things engine beyond 2900 level are doing are only good against each other. Exploiting their tactical and evaluation flaws, engine's exclusive flaws unrelated to humans. So a human could find more difficult to play a 3000 elo engine than a 3500 elo engine (those 500 elo that exploit weaker engines are irrelevant against humans), we don't know, but style could be more important to avoid any human draw.
If this is the case an engine like Benjamin could be capable of performing better against humans than Stockfish NNUE (what if no human can ever draw Benjamin? is that an infinite elo superiority?), and the latter's 1000 elo advantage would not matter.
To answer these questions we need to stop the handicap matches and instead get a GM to play an engine until they get a draw. The incentive can be that if they get a draw in the first 10 games they get big bucks, but every 10 games their prize is cut in half, until they quit because they'd not win much.
I'm of the opinion that chess959 GM vs engine tests would be the best way to ascertain raw chess skill (chess959 is chess960 without the standard start position).
The heavy memorization associated with the standard start position create move quality aberrations, with a massive drop in move strength once the player is out of book. While this memorization is part of a human chess player's strength in normal chess, it's creating severe distortions when attempting to compare ratings.
Obviously, GM would score much worse in chess959 than in regular chess, but the resulting strength estimate would be a much fairer assessment on their abilities in general positions they aren't already familiar with. It's also more logical when having them play against a bookless engine, because otherwise why not fit the engine with a very strong/tricky book.
But indeed, I am thinking a longer time about it (restart FCP-Rating-List) because I have two i9-10900k.
But I need the other one most of time for other things.
I have interest to do:
1. One time in the year a FCP Tourney with 41.000 games (need around 5 months). For that reason it's important that the site for the first of that tourneys will be perfect. So it's very easy to use the work for the next tourneys.
2. On the other 7 months of the year I have interest to test Wasp for John. Not sure how long John will working on Wasp. But at the moment John have a lot of fun with his engine and I am thinking John will do that a longer time. With others words, if John is working on Wasp, I have fun in testing Wasp.
No interest to use two PCs for computer chess.
Energy is to expensive.
For looking on my TV ... ten still running matches on Intel i9-10900k is more as enough for me.
Best
Frank
PS: Much more interest to work with the FCP tourney database on Excel statistics with Klaus Wlotzka (Excel expert). Really an event to working with Klaus. FCP Tourney is important because for testing Wasp vs. the others I have my own ratings. Better as to use the ratings from the others I think.
Current SSDF rating list looks interesting, and I think they use something similar to ELOStat. https://ssdf.bosjo.net/list.htm
They are veterans of 2000-2700 Elo ratings of top engines compared to humans, and the lower part of the table should be fairly accurate with respect to human ratings.
Here are some 165 engines listed using their database and ELOStat.
How can you tell from the list how many threads were used? Some say "MP", some don't, does this mean that if "MP" isn't stated it is single thread? If it is "MP", does that always mean 4 thread? Do the processor numbers give a clue?
Laskos wrote: ↑Wed Aug 05, 2020 9:03 am
They are veterans of 2000-2700 Elo ratings of top engines compared to humans, and the lower part of the table should be fairly accurate with respect to human ratings.
The problem is that it's possible all the things engine beyond 2900 level are doing are only good against each other. Exploiting their tactical and evaluation flaws, engine's exclusive flaws unrelated to humans. So a human could find more difficult to play a 3000 elo engine than a 3500 elo engine (those 500 elo that exploit weaker engines are irrelevant against humans), we don't know, but style could be more important to avoid any human draw.
If this is the case an engine like Benjamin could be capable of performing better against humans than Stockfish NNUE (what if no human can ever draw Benjamin? is that an infinite elo superiority?), and the latter's 1000 elo advantage would not matter.
To answer these questions we need to stop the handicap matches and instead get a GM to play an engine until they get a draw. The incentive can be that if they get a draw in the first 10 games they get big bucks, but every 10 games their prize is cut in half, until they quit because they'd not win much.
Measuring elo differences of more than around 200 by direct matches isn't very accurate, for multiple reasons. Beyond 191 elo, even winning every game with White and drawing every game with Black will lose elo. So the stronger player has to play bad openings as Black to minimize the risk of White reaching an easy draw just by memory. For engines, it means that special opening books to do this are needed, and then we're rating the book, not the engine. Ideally chess competition and ratings should change to reflect this problem; all parings could be two game matches with only the winner of the match counting, or all games could be replayed until someone wins (at faster time controls, or with some total time and increment for the match). The point is that White should never benefit from a draw, it is contrary to the logic of chess. This is not a big issue when the players are reasonably close in strength, as the top humans are, but if we have engines playing humans without a handicap it becomes a huge issue.
Current SSDF rating list looks interesting, and I think they use something similar to ELOStat. https://ssdf.bosjo.net/list.htm
They are veterans of 2000-2700 Elo ratings of top engines compared to humans, and the lower part of the table should be fairly accurate with respect to human ratings.
Here are some 165 engines listed using their database and ELOStat.
How can you tell from the list how many threads were used? Some say "MP", some don't, does this mean that if "MP" isn't stated it is single thread? If it is "MP", does that always mean 4 thread? Do the processor numbers give a clue?
I understood that all 1800X are on all 8 cores, all Q6600 are on all 4 cores, the rest one core. AFAIK, they are very conservative, using only 2 hours games and full CPU for an engine with Ponder=ON (playing on 2 PCs). In fact pretty remarkably so.
Alayan wrote: ↑Wed Aug 05, 2020 3:13 pm
I'm of the opinion that chess959 GM vs engine tests would be the best way to ascertain raw chess skill (chess959 is chess960 without the standard start position).
The heavy memorization associated with the standard start position create move quality aberrations, with a massive drop in move strength once the player is out of book. While this memorization is part of a human chess player's strength in normal chess, it's creating severe distortions when attempting to compare ratings.
Obviously, GM would score much worse in chess959 than in regular chess, but the resulting strength estimate would be a much fairer assessment on their abilities in general positions they aren't already familiar with. It's also more logical when having them play against a bookless engine, because otherwise why not fit the engine with a very strong/tricky book.
We played four games of chess959 in June with Komodo 14 (both regular and MCTS) giving knight odds to GM Alex Lenderman (2642 FIDE Rapid) at 15' + 10". Lenderman scored two wins, one loss, and one draw (one win was vs. regular K, other games vs. mcts). Knight odds is more than 1000 elo at this level, so this gives some idea of "raw chess skill". At least at this rapid tc, a rating above 3500 looks justified.