Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be differentbob wrote:I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.CRoberson wrote:Ok, here is another test. No book. Combine that with the full Craftybob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01 2650 5 5 31128 51% 2644 21% Crafty-22.9R02 2261 5 6 31128 9% 2644 7%
and the raw material Crafty.
Hardware vs Software
Moderator: Ras
-
- Posts: 373
- Joined: Wed Mar 22, 2006 10:17 am
- Location: Novi Sad, Serbia
- Full name: Karlo Balla
Re: Hardware vs Software
Best Regards,
Karlo Balla Jr.
Karlo Balla Jr.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.BubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
However, which of those four opponents would you believe that Crafty is tactically out-searching? I believe they are all pretty equal from a search perspective, having played tens of millions of games against the group. So there is no "weaker opponent" it could stumble into a draw against and do worse, there is no stronger opponent that it can tactically out-search. So I guess I do not get the "point". See my post to Sam for further reasons why I believe any test will show a bias, except for one so large it is intractable for me to deal with.Uri Blass wrote:YesBubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
Not making progress is part of the problem.
Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.
It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.
The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.
Uri
-
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Hardware vs Software - test results
I think you have to suck it up and play the two against different pools of opponents. Bad I know...but probably not as bad as the alternative. If I was trying to figure out how good a human was that I had seen beat masters in blitz, and his little brother that had just recently learned how to move the pieces, I would suggest they enter different tournaments. There just are not other options. We are not really trying to measure 1 elo differences anyway in this case, we are trying to measure within 100's, so perfection is not required.bob wrote:I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.BubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
Regarding how "scientific" this is...well, in wishy washy sciences (like physics, psychology, and medicine) you often have to do much much worse studies in order to draw conclusions using the same statistical tool set. They seem to muddle through.
-Sam
-
- Posts: 10788
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Hardware vs Software - test results
If A is significantly stronger than B then the only way to get accurate rating for B is by having different tournaments when everybody play indirectly against other players.bob wrote:I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.BubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
I do not play directly against GM but I play against players who are 100 elo better when they also play with players that are 100 elo better and we continue in this way to get rating.
The same idea works here.
Crafty only material may play against opponents that score 60-70% against it.
The opponents may play against opponents that score 60-70% against them.
You can continue in this way until the opponents can score 30-40% against default Crafty.
This is the only way to have a good estimate for big rating difference.
Uri
-
- Posts: 10788
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Hardware vs Software - test results
Crafty only material may outsearch fruit2.1.bob wrote:However, which of those four opponents would you believe that Crafty is tactically out-searching? I believe they are all pretty equal from a search perspective, having played tens of millions of games against the group. So there is no "weaker opponent" it could stumble into a draw against and do worse, there is no stronger opponent that it can tactically out-search. So I guess I do not get the "point". See my post to Sam for further reasons why I believe any test will show a bias, except for one so large it is intractable for me to deal with.Uri Blass wrote:YesBubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
Not making progress is part of the problem.
Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.
It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.
The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.
Uri
remember that only material is faster than normal evaluation so Crafty only material may search deeper than normal Crafty.
I agree with you that generally testing against significantly stronger opponents is not a good way to get estimate for rating difference but I read that Crafty only material got 9% against the opponents when I believe that normal programs at similiar strength(to Crafty only material) get less than it.
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software
No learning of any kind is enabled... And the engines are restarted for each individual game so no internal learning or hash stuff is kept either...CRoberson wrote:Yes, I forgot about that. What about position learning? Is that on or off?bob wrote:I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.CRoberson wrote:Ok, here is another test. No book. Combine that with the full Craftybob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01 2650 5 5 31128 51% 2644 21% Crafty-22.9R02 2261 5 6 31128 9% 2644 7%
and the raw material Crafty.
If on, how about turning it off - I think learning would skew the results.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software
I use a contempt factor of 0.00 all the time. I see no reason to tune one program differently than another...Karlo Bala wrote:Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be differentbob wrote:I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.CRoberson wrote:Ok, here is another test. No book. Combine that with the full Craftybob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01 2650 5 5 31128 51% 2644 21% Crafty-22.9R02 2261 5 6 31128 9% 2644 7%
and the raw material Crafty.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
It's not just bad, it is no good. You can't compare the ratings whatsoever if you play group A and group B. The stastics are no good there so you could not compare eval to no eval and have any idea at all about how much was lost by stripping the eval out.BubbaTough wrote:I think you have to suck it up and play the two against different pools of opponents. Bad I know...but probably not as bad as the alternative. If I was trying to figure out how good a human was that I had seen beat masters in blitz, and his little brother that had just recently learned how to move the pieces, I would suggest they enter different tournaments. There just are not other options. We are not really trying to measure 1 elo differences anyway in this case, we are trying to measure within 100's, so perfection is not required.bob wrote:I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.BubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
Regarding how "scientific" this is...well, in wishy washy sciences (like physics, psychology, and medicine) you often have to do much much worse studies in order to draw conclusions using the same statistical tool set. They seem to muddle through.
-Sam
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
I understand exactly how the math works here. And I am not going to undertake an intractable problem to answer a question that is hardly that important in the first place. I really don't care how much the eval influences the rating. My first chess program in 1968 had an evaluation, as does Crafty of today, so that is a highly uninteresting point to me. I did it because Charles asked me to. But I need a range of players in this mix from significantly above crafty to significantly below crafty. That makes the test take too long for such a small return...Uri Blass wrote:If A is significantly stronger than B then the only way to get accurate rating for B is by having different tournaments when everybody play indirectly against other players.bob wrote:I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.BubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
I do not play directly against GM but I play against players who are 100 elo better when they also play with players that are 100 elo better and we continue in this way to get rating.
The same idea works here.
Crafty only material may play against opponents that score 60-70% against it.
The opponents may play against opponents that score 60-70% against them.
You can continue in this way until the opponents can score 30-40% against default Crafty.
This is the only way to have a good estimate for big rating difference.
Uri