Hardware vs Software

BubbaTough · Post by **BubbaTough** » Sat Dec 06, 2008 9:41 am

bob wrote: It's not just bad, it is no good. You can't compare the ratings whatsoever if you play group A and group B. The stastics are no good there so you could not compare eval to no eval and have any idea at all about how much was lost by stripping the eval out.

Of course you can compare them, as long as you have some faith that you already know the ratings in group A and B. That is why all the players outside of your cluster (human and computer) are able to have ratings. Granted, no one has the accuracy you do using your system for measuring small changes...but as this situation shows, your system is awful at measuring large changes. Remember, the discussion is not whether an ELO change in +- 5 points, Uri was guessing things were off by about 600 ELO. It does not take 32,000 games to gather an opinion on that..

That said, I agree with what you said in another post...if you don't really care about measuring the effect of no eval, its not worth bothering to set up.

-Sam

hgm · Post by **hgm** » Sat Dec 06, 2008 12:34 pm

Uri is right: you cannot get a meaningful rating by one-sided testing (i.e. only against much stronger or much weaker opponents). In such a case the result becomes critically dependent on the with of the score-vs-opponent-rating curve. The rating model assumes some standard width for this. But in practice, different engines can have vastly different width.

In my experience, especially engines without knowledge can have a very wide distribution of their results (i.e. be prone to surprises). Against engines that are 500 Elo stronger they win much more frequently than the Elo model predicts, but they also lose much more frequently against engines that are 500 Elo weaker. So basing their rating only on results against opponents that are much stronger, tends to overestimate the rating. At the same time, it is also very bad for statistics: even if this systematical error would not exist, scoring 10% out of 32,000 games results in about the same statistical error in the rating as 3,200 games from which you score 50%.

So the rating of the evaluation-less Crafty is likely much much lower than a mere 400 Elo drop. But there is indeed no reason for Bob wasting time on this. Any of us could get a pretty accurate estimate of the difference, by simply measuring the position of the weakened Crafty on a known rating, with just couple of hundred games. Who cares if the difference is 630 or 650?

bob · Post by **bob** » Sat Dec 06, 2008 4:55 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
Yes
Not making progress is part of the problem.

Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.

It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.

The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.

Uri
However, which of those four opponents would you believe that Crafty is tactically out-searching? I believe they are all pretty equal from a search perspective, having played tens of millions of games against the group. So there is no "weaker opponent" it could stumble into a draw against and do worse, there is no stronger opponent that it can tactically out-search. So I guess I do not get the "point". See my post to Sam for further reasons why I believe any test will show a bias, except for one so large it is intractable for me to deal with.
Crafty only material may outsearch fruit2.1.
remember that only material is faster than normal evaluation so Crafty only material may search deeper than normal Crafty.

I agree with you that generally testing against significantly stronger opponents is not a good way to get estimate for rating difference but I read that Crafty only material got 9% against the opponents when I believe that normal programs at similiar strength(to Crafty only material) get less than it.

Uri

I still don't follow. If I rip out the eval, I rip out something like 50% of the total time being used. So does a crippled program have to just give up that 50% time loss rather than using it in its search instead??? Playing enough opponents to guarantee that both crafty ratings are bracketed is too much effort for zero return.

bob · Post by **bob** » Sat Dec 06, 2008 5:02 pm

BubbaTough wrote:
bob wrote: It's not just bad, it is no good. You can't compare the ratings whatsoever if you play group A and group B. The stastics are no good there so you could not compare eval to no eval and have any idea at all about how much was lost by stripping the eval out.
Of course you can compare them, as long as you have some faith that you already know the ratings in group A and B. That is why all the players outside of your cluster (human and computer) are able to have ratings. Granted, no one has the accuracy you do using your system for measuring small changes...but as this situation shows, your system is awful at measuring large changes. Remember, the discussion is not whether an ELO change in +- 5 points, Uri was guessing things were off by about 600 ELO. It does not take 32,000 games to gather an opinion on that..

That said, I agree with what you said in another post...if you don't really care about measuring the effect of no eval, its not worth bothering to set up.

-Sam

You know the math. I know the math. I know that you know the math. If you form two groups A and B, You can _not_ compare ratings for any player in group A with any player in group B, _unless_ there is cross-contamination produced by players in one group playing against players in another. If you do not do this, the ratings are _absolutely_ incomparable. And I _do_ mean _absolutely_. Elo is not an exact scale. It is a relative scale within a pool of players, which predicts how players within that pool would score against each other.

This isn't that difficult to understand. And this kind of testing is less than useless. The current ratings are at least comparable, even though we know that the "edge" ratings for a pool are inaccurate. But at least the "scale" of the ratings is somewhat reasonable, which two distinct pools would absolutely not have.

BubbaTough · Post by **BubbaTough** » Sat Dec 06, 2008 5:17 pm

bob wrote:
You know the math. I know the math. I know that you know the math. If you form two groups A and B, You can _not_ compare ratings for any player in group A with any player in group B, _unless_ there is cross-contamination produced by players in one group playing against players in another. If you do not do this, the ratings are _absolutely_ incomparable. And I _do_ mean _absolutely_. Elo is not an exact scale. It is a relative scale within a pool of players, which predicts how players within that pool would score against each other.

This isn't that difficult to understand. And this kind of testing is less than useless. The current ratings are at least comparable, even though we know that the "edge" ratings for a pool are inaccurate. But at least the "scale" of the ratings is somewhat reasonable, which two distinct pools would absolutely not have.

I agree with everything you said. But if you accept some sort of mildly wrong premise (like the relative ratings from some testing group or another are not off by over 100 ELO) than something can be done because assumably that testing group has some cross-over. Otherwise, it is as you say.

Sigh...I have probably spent 15 minutes typing on the message board over the last few days...I could have tweaked a number and started a test for my program with that time. Where are my priorities

.

-Sam

Karlo Bala · Post by **Karlo Bala** » Sat Dec 06, 2008 6:32 pm

bob wrote:
Karlo Bala wrote:
bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be different
I use a contempt factor of 0.00 all the time. I see no reason to tune one program differently than another...

...to avoid draws caused by repetition undesired for stronger crafty

BubbaTough · Post by **BubbaTough** » Sat Dec 06, 2008 6:47 pm

contempt for no-eval crafty seems like it would have a significant effect, since no-eval crafty considers so many positions dead even. That said, I would consider it somewhat cheating to add contempt for it.

-Sam

Uri Blass · Post by **Uri Blass** » Sat Dec 06, 2008 6:48 pm

Karlo Bala wrote:
bob wrote:
Karlo Bala wrote:
bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be different
I use a contempt factor of 0.00 all the time. I see no reason to tune one program differently than another...
...to avoid draws caused by repetition undesired for stronger crafty

My experience with no contempt factor is that draws by repetition are
simply rare(assuming very big rating difference).

Movei has no contempt factor but I almost never see draws if I play with it against significantly weaker opponents.

Uri

Karlo Bala · Post by **Karlo Bala** » Sat Dec 06, 2008 7:59 pm

BubbaTough wrote:contempt for no-eval crafty seems like it would have a significant effect, since no-eval crafty considers so many positions dead even. That said, I would consider it somewhat cheating to add contempt for it.

-Sam

Bigger contempt is for stronger crafty (that with eval) to avoid unnecessary draws. Stronger engine should always be dissatisfied with draw. If stronger crafty have weaker position without bigger contempt he will be satisfied with draw, although weaker crafty doesn't know that he have better position. Idea is that stronger crafty should ply for win and give weaker crafty chance to make bad move (because he doesn't know anything about position)

This is just a theory, doesn't mean it is working...

BubbaTough · Post by **BubbaTough** » Sat Dec 06, 2008 8:05 pm

Uri Blass wrote:
My experience with no contempt factor is that draws by repetition are
simply rare(assuming very big rating difference).

Movei has no contempt factor but I almost never see draws if I play with it against significantly weaker opponents.

Uri

I bet this would not be the same if you had a material only eval function

.

-Sam

Hardware vs Software

Re: Hardware vs Software - test results

Re: Hardware vs Software - test results

Re: Hardware vs Software - test results

Re: Hardware vs Software - test results

Re: Hardware vs Software - test results

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software