Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Alayan · Post by **Alayan** » Thu Feb 06, 2020 5:05 am

Forcing a stalemate or an ending with minor against bare king against an opponent that try to avoid them is not much easier than checkmating.

Ovyron · Post by **Ovyron** » Thu Feb 06, 2020 6:17 am

lkaufman wrote: ↑Thu Feb 06, 2020 4:55 am This is not a relevant argument today.

But the relevant argument today is that top engines have become so good that winning games have become very hard, so actually managing to win games should be rewarded more, instead of draws suddenly being good enough (and engines that win and lose scoring less than ones that almost win their games.)

If this is going to be implemented, I suggest that first we decrease the points that draws give, to say, 0.20, then over this margin engines that are stronger and outplay their opponents but not enough to win get rewarded, up to the normal 0.50 (but never over this so engines that do win aren't opaqued). Soccer's 3-1-0 system is equal to having draws being 0.333, so like this most engines would use this system, the ones outplayed that can save their games get 0.20 for draws, and the ones that outplay their opponents but can't win them get 0.5 for draws.

mmt · Post by **mmt** » Thu Feb 06, 2020 6:40 am

Ovyron wrote: ↑Thu Feb 06, 2020 6:17 am But the relevant argument today is that top engines have become so good that winning games have become very hard, so actually managing to win games should be rewarded more, instead of draws suddenly being good enough (and engines that win and lose scoring less than ones that almost win their games.)

Just a note: this is not what I'm proposing. I just want to get more info out of the engine games, especially draws. Rule changes are a different matter.

Ovyron · Post by **Ovyron** » Thu Feb 06, 2020 6:45 am

But it's the same thing, if you get a look at what happened in the games to extract information about the participants to modify their ELO, it's as if you inserted fake games into programs that calculate ELO to get the same modification (programs that get outplayed on draws would get less ELO, while programs that outplay others get more ELO) that you'd get if you scored draws differently, and if this isn't done carefully, engines that actually win their games are going to end with less ELO than those that draw a lot but outplay their opponents frequently.

mmt · Post by **mmt** » Thu Feb 06, 2020 7:08 am

Unless the engines game the Maz score somehow, we can get additional info. I wrote before how the basic way to game it won't work. Yes, engines that win more games might end up with worse Maz ELO using this scoring system on rare occasions. But what's the problem with that? I think it will be more indicative of engine strength than normal ELO and that's what matters. The data from real games will answer whether this is a better way to score or not. Normal ELO scoring will still work unchanged and I doubt Maz will become popular enough for the engines to change anything.

I will try to find time to get some solid data and think about how to do this right.

jp · Post by jp » Thu Feb 06, 2020 9:03 am

lkaufman wrote: ↑Thu Feb 06, 2020 4:55 am Every decent engine knows what is mating material and what is not. The typical situation that ends up with (for example) king and knight vs king is that one side gets a favorable ending, neither engine can read out to the end whether Black can reach the drawn king vs king and knight or not, but they both know that otherwise White will win. So they score it something like plus 1.00. If it turns out that Black can eliminate the last pawn and reach the draw, it doesn't change the fact that both engines assessed that White had outplayed Black, and Black was just lucky to have enough resources (unknown to both engines) to reach the draw. White should be considered the stronger engine in this example if there is no other information available, though by a lesser margin than if White had actually won.

I just confirmed that for KNN vs K, Komodo 13.2's eval is +0.06 to +0.07.

But other standard endgame draws with few pieces are not known by Komodo 13.2. e.g. +2.70, depth=45.

Ovyron · Post by **Ovyron** » Thu Feb 06, 2020 3:40 pm

mmt wrote: ↑Thu Feb 06, 2020 7:08 am I think it will be more indicative of engine strength than normal ELO and that's what matters.

If this is implemented naively, engines that fail to win won games because they go for high scoring positions that are drawn and engines that misevaluate drawn positions and give them high scores will raise to the top. Hopefully this is implemented right.

But ELO is a system that is supposed to help us predict the results of engines, and I wouldn't find predictions of high probability that an engine reaches a highly scoring position but fails to win it useful, I'd rather see the one that wins those even if it loses more on others.

lkaufman · Post by **lkaufman** » Thu Feb 06, 2020 5:49 pm

jp wrote: ↑Thu Feb 06, 2020 9:03 am
lkaufman wrote: ↑Thu Feb 06, 2020 4:55 am Every decent engine knows what is mating material and what is not. The typical situation that ends up with (for example) king and knight vs king is that one side gets a favorable ending, neither engine can read out to the end whether Black can reach the drawn king vs king and knight or not, but they both know that otherwise White will win. So they score it something like plus 1.00. If it turns out that Black can eliminate the last pawn and reach the draw, it doesn't change the fact that both engines assessed that White had outplayed Black, and Black was just lucky to have enough resources (unknown to both engines) to reach the draw. White should be considered the stronger engine in this example if there is no other information available, though by a lesser margin than if White had actually won.
I just confirmed that for KNN vs K, Komodo 13.2's eval is +0.06 to +0.07.

But other standard endgame draws with few pieces are not known by Komodo 13.2. e.g. +2.70, depth=45.

Yes, we chose a small positive value for KNN vs K since the KNN side can still win on time or by a blunder, and also because a spectator will assume that Komodo was close to winning and just couldn't quite convert. I think for KN or KB vs K we use a really tiny score since only the second reason applies. Regarding more complex drawn simple endgames, we use numbers that represent practical chances if TBs are not used; we didn't devote a lot of time to trying to program myriad simple endgames that are normally covered by TBs.

mmt · Post by **mmt** » Thu Feb 06, 2020 7:20 pm

Ovyron wrote: ↑Thu Feb 06, 2020 3:40 pm But ELO is a system that is supposed to help us predict the results of engines, and I wouldn't find predictions of high probability that an engine reaches a highly scoring position but fails to win it useful, I'd rather see the one that wins those even if it loses more on others.

I mean if it works in predicting actual 1-0.5-0 scores better then it doesn't really matter how we arrived at the score. If it works, I think it'd be especially useful for developers because they wouldn't have to run as many games between different versions of their engine to see if a change made it better or worse.

Ovyron · Post by **Ovyron** » Thu Feb 06, 2020 9:54 pm

mmt wrote: ↑Thu Feb 06, 2020 7:20 pm If it works, I think it'd be especially useful for developers because they wouldn't have to run as many games between different versions of their engine to see if a change made it better or worse.

Not if if produces a lot more games where the engine presents a high eval in a draw game but less wins in general.

Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?

Re: Could we use partial scoring for draws to improve evaluation of engine vs engine matches and tournaments?