search extensions

bob · Post by **bob** » Wed Nov 12, 2014 2:57 am

Henk wrote:
bob wrote:
Henk wrote:
bob wrote:
Henk wrote:
bob wrote:
Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.
Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.
For there is no stronger human opposition. It might be that the best engines are not playing 400 ELO better than top five grandmasters but only 150 ELO unless they play often against humans as well.

Self-play also holds for a group. There are only two members the engine-group and the human-group. Humans against humans is self-play. Engines against engines is also self-play.

Perhaps there might be more groups for instance brute force like engines.
Humans against humans is anything but "self-play" unless you play yourself. I don't follow that. There is a big difference between two identical opponents playing and a group of "similar" opponents playing. I agree that a larger group is better, but with humans you introduce a lot of noise into the measurement, where computers are stable in their playing level but humans can be wildly variable.
For there are groups of similar opponents a test set should contain members which represent each group. For instance engines are mostly tactical players so you should have positional players too.
It's "good in theory, but impossible in reality." You need consistent results. Computers provide that quick easily. Humans, never. Over 30K games, a computer's Elo varies by +/- 3 Elo or so. A human, +/- 200 when you take the edge cases of being tired, depressed, sick, hungry, angry, etc...
Then the test set should contain engines that represent each group. For instance a test set should not only contain tactical playing engines but also strategic/positional playing engines.

Isn't this just a bit on the obvious side??? Most of us include engines stronger and weaker than ours also... for that same reason.

search extensions

Re: search extensions