Repeating games with switched colors reduces Elo error. All matches should be done like this

Ovyron · Post by **Ovyron** » Tue Feb 25, 2020 11:44 pm

Alayan wrote: ↑Tue Feb 25, 2020 8:46 pm When an engine is rated, it should not be rated over a tiny subset of chess positions it likes and can go in if using an opening book meant to exclude most of the lines it does worse in.

I disagree. Why isn't the performance of 1.g4 or 1.f3 tested for the engine? It's a subset where most engines underperform so we prune those variations as bad. If an engine really underperforms in some variation of the Semi-Slav, then forcing this engine to play it makes as much sense as making it play from 1.f3.

It's like Margnus Carlsen the world champion and his bad performance on Chess960. Imagine he was only as good as his rating indicates for the openings he wants to play. Would you force him to play Chess960 from positions he's not good at? No, we let him maximize his performance with opening selection.

Why don't we allow engines to do the same? We have the technology. An engine has no reason to play from a position it doesn't like (just like they have no reason to play 1.f3), so their ratings are distorted when we make them play openings they wouldn't play if they had a choice.

mwyoung · Post by **mwyoung** » Tue Feb 25, 2020 11:51 pm

Ovyron wrote: ↑Tue Feb 25, 2020 11:44 pm
Alayan wrote: ↑Tue Feb 25, 2020 8:46 pm When an engine is rated, it should not be rated over a tiny subset of chess positions it likes and can go in if using an opening book meant to exclude most of the lines it does worse in.
I disagree. Why isn't the performance of 1.g4 or 1.f3 tested for the engine? It's a subset where most engines underperform so we prune those variations as bad. If an engine really underperforms in some variation of the Semi-Slav, then forcing this engine to play it makes as much sense as making it play from 1.f3.

It's like Margnus Carlsen the world champion and his bad performance on Chess960. Imagine he was only as good as his rating indicates for the openings he wants to play. Would you force him to play Chess960 from positions he's not good at? No, we let him maximize his performance with opening selection.

Why don't we allow engines to do the same? We have the technology. An engine has no reason to play from a position it doesn't like (just like they have no reason to play 1.f3), so their ratings are distorted when we make them play openings they wouldn't play if they had a choice.

I agree. I want to see a wide range of openings. And I want to expose weaknesses.

Alayan · Post by **Alayan** » Wed Feb 26, 2020 12:03 am

Ovyron wrote: ↑Tue Feb 25, 2020 11:44 pm I disagree. Why isn't the performance of 1.g4 or 1.f3 tested for the engine? It's a subset where most engines underperform so we prune those variations as bad. If an engine really underperforms in some variation of the Semi-Slav, then forcing this engine to play it makes as much sense as making it play from 1.f3.

1. g4 is bad for white, so of course as white all engines will get a worse score than with a "normal opening". But if you match engine A and B, both playing each side of the opening, and engine A wins with black while drawing with white, then A performs better than B in this position and this tells us something about the strength of both engines. 1. g4 is so one-sided that you'll get a rather low elo-spread compared to many normal openings, which makes it a rather poor choice for testing, but that's it.

Whether or not the engines would willingly go into the variation they are made to play is completely irrelevant.

Rating the start position performance of an engine is futile. Results will be significantly skewed by a few early preferences and by the interaction with the early preferences of the opposing engine ; the elo-spread will be poor, and you results will be off by dozens of elo up or down when it comes to the analysis strength of the engine over a wide range of positions.

mwyoung · Post by **mwyoung** » Wed Feb 26, 2020 12:18 am

Alayan wrote: ↑Wed Feb 26, 2020 12:03 am
Ovyron wrote: ↑Tue Feb 25, 2020 11:44 pm I disagree. Why isn't the performance of 1.g4 or 1.f3 tested for the engine? It's a subset where most engines underperform so we prune those variations as bad. If an engine really underperforms in some variation of the Semi-Slav, then forcing this engine to play it makes as much sense as making it play from 1.f3.
1. g4 is bad for white, so of course as white all engines will get a worse score than with a "normal opening". But if you match engine A and B, both playing each side of the opening, and engine A wins with black while drawing with white, then A performs better than B in this position and this tells us something about the strength of both engines. 1. g4 is so one-sided that you'll get a rather low elo-spread compared to many normal openings, which makes it a rather poor choice for testing, but that's it.

Whether or not the engines would willingly go into the variation they are made to play is completely irrelevant.

Rating the start position performance of an engine is futile. Results will be significantly skewed by a few early preferences and by the interaction with the early preferences of the opposing engine ; the elo-spread will be poor, and you results will be off by dozens of elo up or down when it comes to the analysis strength of the engine over a wide range of positions.

You are correct to a point. But the player decides how the engine plays. And I can force a engine to play any opening I want it to play. In testing the point is not to achieve the exact rating of the engine. For most people the Ranking is more important. What engines is going to give the best performance in the most positions. That is why I try to give a wide range of openings and many games. As no engine is perfect.

Ovyron · Post by **Ovyron** » Wed Feb 26, 2020 12:59 am

Alayan wrote: ↑Wed Feb 26, 2020 12:03 am 1. g4 is bad for white

And <insert variation here> is bad for white when this engine plays that variation. Forcing it to play it makes as much sense as making it play 1.g4.

If an engine plays really bad a line from a book we only need 1 game to see it and then we can remove this variation from the book of this engine so that its actual strength isn't affected by it, for this engine this line is as if it was forced to start with material handicap or low on time which wouldn't happen in a real game, so this engine wouldn't have to play this line in a real game.

What we'd want from ELO is to be able to predict superiority in a setting like the World Championship, what would be the expected result if both engines played at their absolute best, and that includes a tournament book that would maximize its performance, and that wouldn't have this line on there just because the opponent had it in its book and now has to be played with reversed colors.

Switching colors gives the expected average strength that the engine would show if it played random openings, what I propose would show the maximum strength that it could show if all participants played their strongest openings. And nothing stops an engine from displaying this strength, except from testers that insist on using generic openings, so switching colors is a kind of handicap when engines are forced to play openings they're bad at.

Alayan · Post by **Alayan** » Wed Feb 26, 2020 1:12 am

If an engine is bad at generalizing over many different type of openings, it wouldn't make it any weaker in your settings. But if it's bad at generalizing, it makes it weaker in commonly accepted computer chess rating methods. I prefer this method which values being good over many positions over being slightly stronger in a limited subset. I don't want the elo of rating lists to predict best result in a world championship setting, I want it to predict best result if taking an imperfect position from a game and playing it on.

Your method also create many practical issues. How is the tester supposed to make sure he makes the engine play the lines it's best at ? While once you've got some opening positions set for testing it's plug-and-play. Engines getting rid of their included book is one of the biggest improvement in the computer chess world from the point of view of chess engine authors. Nothing preventing CC players or such to add their book on top for their start position competitions. DragonMist, former ICCF world champion, told me that at this point he considers top-level CC dead because it's becoming near-impossible to get wins against strong well-prepared opponents.

We can agree to disagree I guess.

bob · Post by **bob** » Wed Feb 26, 2020 1:26 am

mwyoung wrote: ↑Tue Feb 25, 2020 7:57 pm
mmt wrote: ↑Tue Feb 25, 2020 1:42 pm This only applies to tests with the same opening book for both sides. It makes intuitive sense that the results will be more accurate if player A and player B play both sides of all openings. But I couldn't find any empirical results so I wrote a utility to test it out myself.

First, I've compared predictions that can be made after the first n games (multiple runs ordered randomly for higher accuracy) about the rest of the match. The results of matches with switched colors give more accurate predictions about the rest of the match.

Then I've used the bootstrap method of Elo error estimation. Playing with switched colors reduces Elo error. The tools like Ordo do not take the switched colors games into account and as a result, their error is too large. Instead of taking a match with 100 games (50 pairs) and picking individual games they should treat this match as 50 games, each having a result of 0, 0.5, 1, 1.5, or 2.

I can run this test for many matches if somebody wants to see the hard numbers. But it's clear that matches not using the switched color system are unnecessarily wasting CPU/GPU time by having to run more games to get the same accuracy as the matches with switched colors.
I find it better to use repeated opening with color switching. I do not use a set of positions. But a good opening book, but it is still a book of games. And with today's engines. They do find errors that are losing from the opening from time to time. Playing reverse colors cancels out that opening to a draw result.

How so? Two equal programs, lost or won positions will likely split between the two. But I've always used that as a test to check the positions and probably delete those that do this. But with two not-so-equal engines, the not-so-good engine might well lose both. Which just adds another loss that would come somewhere else anyway.

As far as Elo accuracy goes, I've done it both ways and found no difference between the two, so long as all the positions are randomly chosen, with random side-to-move, so that the same program does not always get white or get a winning position. About the only real advantage for playing both sides is picking out those 1-1 opening positions that are likely too imbalanced to be useful and also it reduces the number of positions you need by a factor of two.

mwyoung · Post by **mwyoung** » Wed Feb 26, 2020 2:26 am

bob wrote: ↑Wed Feb 26, 2020 1:26 am
mwyoung wrote: ↑Tue Feb 25, 2020 7:57 pm
mmt wrote: ↑Tue Feb 25, 2020 1:42 pm This only applies to tests with the same opening book for both sides. It makes intuitive sense that the results will be more accurate if player A and player B play both sides of all openings. But I couldn't find any empirical results so I wrote a utility to test it out myself.

First, I've compared predictions that can be made after the first n games (multiple runs ordered randomly for higher accuracy) about the rest of the match. The results of matches with switched colors give more accurate predictions about the rest of the match.

Then I've used the bootstrap method of Elo error estimation. Playing with switched colors reduces Elo error. The tools like Ordo do not take the switched colors games into account and as a result, their error is too large. Instead of taking a match with 100 games (50 pairs) and picking individual games they should treat this match as 50 games, each having a result of 0, 0.5, 1, 1.5, or 2.

I can run this test for many matches if somebody wants to see the hard numbers. But it's clear that matches not using the switched color system are unnecessarily wasting CPU/GPU time by having to run more games to get the same accuracy as the matches with switched colors.
I find it better to use repeated opening with color switching. I do not use a set of positions. But a good opening book, but it is still a book of games. And with today's engines. They do find errors that are losing from the opening from time to time. Playing reverse colors cancels out that opening to a draw result.
How so? Two equal programs, lost or won positions will likely split between the two. But I've always used that as a test to check the positions and probably delete those that do this. But with two not-so-equal engines, the not-so-good engine might well lose both. Which just adds another loss that would come somewhere else anyway.

As far as Elo accuracy goes, I've done it both ways and found no difference between the two, so long as all the positions are randomly chosen, with random side-to-move, so that the same program does not always get white or get a winning position. About the only real advantage for playing both sides is picking out those 1-1 opening positions that are likely too imbalanced to be useful and also it reduces the number of positions you need by a factor of two.

I do delete them, but I can not do so before I find them. And the testing will just keep improving the book.

Ovyron · Post by **Ovyron** » Wed Feb 26, 2020 3:22 am

Alayan wrote: ↑Wed Feb 26, 2020 1:12 am I don't want the elo of rating lists to predict best result in a world championship setting, I want it to predict best result if taking an imperfect position from a game and playing it on.

But what game is it? If it's some random position of some random game then Stockfish or Leela with 2GPUs are already going to play much better this position than this engine, so all your elo would say is how weaker than top engines it'd do.

If it's a specific line from a game the engine is playing, doing so the way I say could reveal a position that this engine likes so much that it plays it better than Stockfish or Leela, because they just have higher ELO in general over the subset of opening the tester chooses (which isn't bigger than than the subset the engine plays best; I know because I built a chess map and can compare sizes), and if only this line was played the engine would show a higher elo than them, unless the tester also did this for them and made them avoid the positions they're bad at (that aren't necessarily bad positions, just positions current top engines don't know how to play.)

Alayan wrote: ↑Wed Feb 26, 2020 1:12 am How is the tester supposed to make sure he makes the engine play the lines it's best at ?

You start generic and delete the bad lines from the custom books of engines. What if there's an engine that excels with 1.f4? You test that and if it's good you keep it, if not you delete it until no engine plays the bird, but you don't make those decisions a priori for all engines by choosing a generic set that supposedly generalizes chess openings and call it a day, if one engine doesn't like one of your choices you have no reason to force it on it.

Alayan wrote: ↑Wed Feb 26, 2020 1:12 amDragonMist, former ICCF world champion, told me that at this point he considers top-level CC dead because it's becoming near-impossible to get wins against strong well-prepared opponents.

I'm not convinced about it, I think that if the prize was 1000000 dollars we'd see top-level CC alive and kicking with some amazing chess we have yet to witness, and that time travelers with software and hardware from 2025 would destroy today's top-level CC, so those winning strings exist, but nobody's life have depended on finding them so they'd rather play an easy game they can draw than getting into a complex position with 5% drawing chances that they could lose, but win as well.

Alayan wrote: ↑Wed Feb 26, 2020 1:12 amWe can agree to disagree I guess.

That's fine. I'll just claim that my method is better, and the only culprit is the time requirement that it has. Nobody has the time to prepare the strongest possible opening book for every engine they test, but I don't know if it was done some engine like Houdini 6 would rise to the top (it'd just need to play from both sides the best openings that it plays instead of this generic mess that has the top engines static.)

jp · Post by jp » Wed Feb 26, 2020 4:08 am

Ovyron wrote: ↑Tue Feb 25, 2020 11:44 pm It's like Margnus Carlsen the world champion and his bad performance on Chess960. Imagine he was only as good as his rating indicates for the openings he wants to play. Would you force him to play Chess960 from positions he's not good at? No, we let him maximize his performance with opening selection.

He does not have bad performances in Chess960. He had one day (or 1 1/2) of bad performance in Chess960. He sometimes self-destructs when he's in a certain mood, both in traditional chess and Chess960. Outside of that one day, his performances have been very good. e.g. look at his match with Nakamura.

The Chess960 swings in performance we see now is partly due to unfamiliarity with how players should approach the game. That will probably diminish in the future. Players who now have more experience in Chess960 will definitely have an advantage. This worked in Wesley So's favor. In qualifying, he lost and had to go through the long route (the system used did not eliminate him immediately), but that gave him more experience.

Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this

Re: Repeating games with switched colors reduces Elo error. All matches should be done like this