Issue with self play testing

CRoberson · Post by **CRoberson** » Fri May 18, 2018 4:31 am

I have been testing a new Ares based on an issue that came up in a game between Ares and Myrddin on Graham's site.
The issue pertains to king safety. The change makes Ares more aware of the potential for a certain type of king attack/defense.
After playing Ares-old vs Ares-new, I saw the new version made the attacks that the old version wasn't aware of and the rating
gain was 28 Elo. Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.

Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.

MikeB · Post by **MikeB** » Fri May 18, 2018 5:43 am

Interesting. Typically, self play ( or very similar engine) testing over estimates the rating gain.

Evert · Post by **Evert** » Fri May 18, 2018 7:56 am

CRoberson wrote: ↑Fri May 18, 2018 4:31 am Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.

Yes, and that's why the gain is typically less than it is in self-play: the other opponent may not have been so blind, so you gain less by playing against them.
Of course this very much depends on the opponent you measure against and the gaps in the evaluation that they have.

Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.

Yes. Self-testing can make you blind for gaps in the evaluation function. It works fairly well for optimising the evaluation weights in features that you have, but you need to test against other engines to find out what your weaknesses are.
Or you need to try adding loads of different terms and see what sticks (which is sortof what SF does), or you need to extract evaluation features in addition to evaluation weights (neural nets).

cdani · Post by **cdani** » Fri May 18, 2018 2:08 pm

The last month I'm testing every change vs previous Andscacs version and vs Stockfish. It happens often that a change is good against one and bad against the other.

CRoberson · Post by **CRoberson** » Fri May 18, 2018 10:14 pm

Thanks Evert. I do know all that. I was just posting about an interesting issue in self play testing.
I've used various other engines for gauntlets and such... and my published research from the mid 1990s is in neural nets.
Of course, most don't know that. I should apologize. I am sure you are trying to help.
I see you live in the Netherlands - neat. I was there for 2 weeks in 2002: Amsterdam, Utrecht then Maastricht. A very nice country. I rather liked it.
Continuing to like the US is getting more difficult with all the Republican __((**E&&#@___
If you are up to date on the fairest opening books or positions to use for testing, I would be very interested in hearing about that.

Evert wrote: ↑Fri May 18, 2018 7:56 am
CRoberson wrote: ↑Fri May 18, 2018 4:31 am Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.
Yes, and that's why the gain is typically less than it is in self-play: the other opponent may not have been so blind, so you gain less by playing against them.
Of course this very much depends on the opponent you measure against and the gaps in the evaluation that they have.

Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.
Yes. Self-testing can make you blind for gaps in the evaluation function. It works fairly well for optimising the evaluation weights in features that you have, but you need to test against other engines to find out what your weaknesses are.
Or you need to try adding loads of different terms and see what sticks (which is sortof what SF does), or you need to extract evaluation features in addition to evaluation weights (neural nets).

Greg Strong · Post by **Greg Strong** » Sat May 19, 2018 1:16 am

Nice to hear that you are working on a new version of Ares

I test almost exclusively against eight other engines. I rotate them from time to time, but Ares is one of the engines that I have used a lot.

Issue with self play testing

Issue with self play testing

Re: Issue with self play testing

Re: Issue with self play testing

Re: Issue with self play testing

Re: Issue with self play testing

Re: Issue with self play testing