Hardware vs Software

Discussion of chess software programming and technical issues.

Moderator: Ras

Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: Hardware vs Software

Post by Karlo Bala »

bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.

Code: Select all

Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be different
Best Regards,
Karlo Balla Jr.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

Uri Blass wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
Yes
Not making progress is part of the problem.

Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.

It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.

The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.

Uri
However, which of those four opponents would you believe that Crafty is tactically out-searching? I believe they are all pretty equal from a search perspective, having played tens of millions of games against the group. So there is no "weaker opponent" it could stumble into a draw against and do worse, there is no stronger opponent that it can tactically out-search. So I guess I do not get the "point". See my post to Sam for further reasons why I believe any test will show a bias, except for one so large it is intractable for me to deal with.
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Hardware vs Software - test results

Post by BubbaTough »

bob wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
I think you have to suck it up and play the two against different pools of opponents. Bad I know...but probably not as bad as the alternative. If I was trying to figure out how good a human was that I had seen beat masters in blitz, and his little brother that had just recently learned how to move the pieces, I would suggest they enter different tournaments. There just are not other options. We are not really trying to measure 1 elo differences anyway in this case, we are trying to measure within 100's, so perfection is not required.

Regarding how "scientific" this is...well, in wishy washy sciences (like physics, psychology, and medicine) you often have to do much much worse studies in order to draw conclusions using the same statistical tool set. They seem to muddle through.


-Sam
Uri Blass
Posts: 10788
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Hardware vs Software - test results

Post by Uri Blass »

bob wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
If A is significantly stronger than B then the only way to get accurate rating for B is by having different tournaments when everybody play indirectly against other players.

I do not play directly against GM but I play against players who are 100 elo better when they also play with players that are 100 elo better and we continue in this way to get rating.

The same idea works here.

Crafty only material may play against opponents that score 60-70% against it.
The opponents may play against opponents that score 60-70% against them.

You can continue in this way until the opponents can score 30-40% against default Crafty.

This is the only way to have a good estimate for big rating difference.

Uri
Uri Blass
Posts: 10788
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Hardware vs Software - test results

Post by Uri Blass »

bob wrote:
Uri Blass wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
Yes
Not making progress is part of the problem.

Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.

It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.

The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.

Uri
However, which of those four opponents would you believe that Crafty is tactically out-searching? I believe they are all pretty equal from a search perspective, having played tens of millions of games against the group. So there is no "weaker opponent" it could stumble into a draw against and do worse, there is no stronger opponent that it can tactically out-search. So I guess I do not get the "point". See my post to Sam for further reasons why I believe any test will show a bias, except for one so large it is intractable for me to deal with.
Crafty only material may outsearch fruit2.1.
remember that only material is faster than normal evaluation so Crafty only material may search deeper than normal Crafty.

I agree with you that generally testing against significantly stronger opponents is not a good way to get estimate for rating difference but I read that Crafty only material got 9% against the opponents when I believe that normal programs at similiar strength(to Crafty only material) get less than it.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software

Post by bob »

CRoberson wrote:
bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.

Code: Select all

Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Yes, I forgot about that. What about position learning? Is that on or off?
If on, how about turning it off - I think learning would skew the results.
No learning of any kind is enabled... And the engines are restarted for each individual game so no internal learning or hash stuff is kept either...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software

Post by bob »

Karlo Bala wrote:
bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.

Code: Select all

Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be different
I use a contempt factor of 0.00 all the time. I see no reason to tune one program differently than another...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

BubbaTough wrote:
bob wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
I think you have to suck it up and play the two against different pools of opponents. Bad I know...but probably not as bad as the alternative. If I was trying to figure out how good a human was that I had seen beat masters in blitz, and his little brother that had just recently learned how to move the pieces, I would suggest they enter different tournaments. There just are not other options. We are not really trying to measure 1 elo differences anyway in this case, we are trying to measure within 100's, so perfection is not required.

Regarding how "scientific" this is...well, in wishy washy sciences (like physics, psychology, and medicine) you often have to do much much worse studies in order to draw conclusions using the same statistical tool set. They seem to muddle through.


-Sam
It's not just bad, it is no good. You can't compare the ratings whatsoever if you play group A and group B. The stastics are no good there so you could not compare eval to no eval and have any idea at all about how much was lost by stripping the eval out.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

Uri Blass wrote:
bob wrote:
BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
If A is significantly stronger than B then the only way to get accurate rating for B is by having different tournaments when everybody play indirectly against other players.

I do not play directly against GM but I play against players who are 100 elo better when they also play with players that are 100 elo better and we continue in this way to get rating.

The same idea works here.

Crafty only material may play against opponents that score 60-70% against it.
The opponents may play against opponents that score 60-70% against them.

You can continue in this way until the opponents can score 30-40% against default Crafty.

This is the only way to have a good estimate for big rating difference.

Uri
I understand exactly how the math works here. And I am not going to undertake an intractable problem to answer a question that is hardly that important in the first place. I really don't care how much the eval influences the rating. My first chess program in 1968 had an evaluation, as does Crafty of today, so that is a highly uninteresting point to me. I did it because Charles asked me to. But I need a range of players in this mix from significantly above crafty to significantly below crafty. That makes the test take too long for such a small return...