Issue with self play testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Post Reply
CRoberson
Posts: 1957
Joined: Mon Mar 13, 2006 1:31 am
Location: North Carolina, USA
Contact:

Issue with self play testing

Post by CRoberson » Fri May 18, 2018 2:31 am

I have been testing a new Ares based on an issue that came up in a game between Ares and Myrddin on Graham's site.
The issue pertains to king safety. The change makes Ares more aware of the potential for a certain type of king attack/defense.
After playing Ares-old vs Ares-new, I saw the new version made the attacks that the old version wasn't aware of and the rating
gain was 28 Elo. Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.

Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.

User avatar
MikeB
Posts: 2510
Joined: Thu Mar 09, 2006 5:34 am
Location: Pen Argyl, Pennsylvania

Re: Issue with self play testing

Post by MikeB » Fri May 18, 2018 3:43 am

Interesting. Typically, self play ( or very similar engine) testing over estimates the rating gain.

User avatar
Evert
Posts: 2898
Joined: Fri Jan 21, 2011 11:42 pm
Location: NL
Contact:

Re: Issue with self play testing

Post by Evert » Fri May 18, 2018 5:56 am

CRoberson wrote:
Fri May 18, 2018 2:31 am
Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.
Yes, and that's why the gain is typically less than it is in self-play: the other opponent may not have been so blind, so you gain less by playing against them.
Of course this very much depends on the opponent you measure against and the gaps in the evaluation that they have.
Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.
Yes. Self-testing can make you blind for gaps in the evaluation function. It works fairly well for optimising the evaluation weights in features that you have, but you need to test against other engines to find out what your weaknesses are.
Or you need to try adding loads of different terms and see what sticks (which is sortof what SF does), or you need to extract evaluation features in addition to evaluation weights (neural nets).

User avatar
cdani
Posts: 2041
Joined: Sat Jan 18, 2014 9:24 am
Location: Andorra
Contact:

Re: Issue with self play testing

Post by cdani » Fri May 18, 2018 12:08 pm

The last month I'm testing every change vs previous Andscacs version and vs Stockfish. It happens often that a change is good against one and bad against the other.

CRoberson
Posts: 1957
Joined: Mon Mar 13, 2006 1:31 am
Location: North Carolina, USA
Contact:

Re: Issue with self play testing

Post by CRoberson » Fri May 18, 2018 8:14 pm

Thanks Evert. I do know all that. I was just posting about an interesting issue in self play testing.
I've used various other engines for gauntlets and such... and my published research from the mid 1990s is in neural nets.
Of course, most don't know that. I should apologize. I am sure you are trying to help.
I see you live in the Netherlands - neat. I was there for 2 weeks in 2002: Amsterdam, Utrecht then Maastricht. A very nice country. I rather liked it.
Continuing to like the US is getting more difficult with all the Republican __((**E&&#@___
If you are up to date on the fairest opening books or positions to use for testing, I would be very interested in hearing about that.
Evert wrote:
Fri May 18, 2018 5:56 am
CRoberson wrote:
Fri May 18, 2018 2:31 am
Upon reflection, I see that the Elo gain is possibly 2x that: the old Ares never made such attacks and thus
the ability to defend against them went untested and unmeasured.
Yes, and that's why the gain is typically less than it is in self-play: the other opponent may not have been so blind, so you gain less by playing against them.
Of course this very much depends on the opponent you measure against and the gaps in the evaluation that they have.
Thus, self play testing can lead to insufficient test cases resulting in an under estimate of the rating gain.
Yes. Self-testing can make you blind for gaps in the evaluation function. It works fairly well for optimising the evaluation weights in features that you have, but you need to test against other engines to find out what your weaknesses are.
Or you need to try adding loads of different terms and see what sticks (which is sortof what SF does), or you need to extract evaluation features in addition to evaluation weights (neural nets).

User avatar
Greg Strong
Posts: 388
Joined: Sun Dec 21, 2008 5:57 pm
Location: Washington, DC

Re: Issue with self play testing

Post by Greg Strong » Fri May 18, 2018 11:16 pm

Nice to hear that you are working on a new version of Ares :)

I test almost exclusively against eight other engines. I rotate them from time to time, but Ares is one of the engines that I have used a lot.

Post Reply