Hardware vs Software

bob · Post by **bob** » Sun Dec 07, 2008 4:35 am

Karlo Bala wrote:
bob wrote:
Karlo Bala wrote:
bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Do you use contempt factor in stronger crafty? I think that with bigger contempt result will be different
I use a contempt factor of 0.00 all the time. I see no reason to tune one program differently than another...
...to avoid draws caused by repetition undesired for stronger crafty

The more you change from one run to another, the more likely you will draw wrong conclusions. I'm not playing any opponents so weak I consider a draw to be bad, nor any opponents so strong that I consider a draw some sort of moral victory. Hence draw = 0.00

Pablo Vazquez · Post by **Pablo Vazquez** » Mon Dec 08, 2008 8:57 am

Hi Bob,

If you have the time, could you please test crafty with and evaluation consisting of material + (pst and/or mobility) + king centralization in the endgame?

Thanks

bob · Post by **bob** » Mon Dec 08, 2008 9:38 pm

Pablo Vazquez wrote:Hi Bob,

If you have the time, could you please test crafty with and evaluation consisting of material + (pst and/or mobility) + king centralization in the endgame?

Thanks

I will try, but that is considerably harder, since it would involve commenting out code all over the place...

Don · Post by **Don** » Thu Jan 08, 2009 3:22 pm

bob wrote:
Uri Blass wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.

Branching factor proves nothing because programs that do more pruning play weaker at fixed depth but I can say not based on branching factor that the improvement in software in the last years is very big and bigger than the improvement in hardware(not sure about improvement since 1992 because it is not clear how we define it but sure about improvement from 2005 to 2008).

Note that the tests of Bob can show only that hardware helped more than software for Crafty.

Tests of the SSDF showed the following results.

Rybka 3 A1200 - Deep Shredder 11 Q6600 20-19
Rybka 3 A1200 -Zappa Mexico II Q6600 20-20

A1200 = 1 x 1.2 GHz
Q6600 = 4 x 2.4 GHz

Note that both Zappa and Shredder are clearly stronger than Fruit that was the leading program in 2005 for single processor machines

I think that we can safely say that the software improvement in the last 3 years were more than 10:1 and I do not see hardware improvement of 10:1 in the last 3 years.

Uri
Can we stop with the amateurish comparisons/ Why pick different programs to compare. We had better and worse programs in 1995 as well. The question was, what have the software advances actually produced.

Feel free to name significant ones. Most would put null-move at the top, and LMR right behind it. Then we could factor in razoring, futility and extended futility from Heinz, but I already know those are very small improvements from testing by turning each off a month back or so.

what else is _significant_? Evaluation is not so interesting. I was doing passed pawn races in the 1970's. Outside passed pawns in 1995 in Crafty. So what _new_ thing since 1995 is such a big contributor? No hand-waving, no talking about ideas that _might_ have been developed. Actual documented techniques...

I have not read this whole thread, it is a large one. But I don't even see how this is arguable. If someone wants to argue it, just bring back the old machines and see how modern chess programs do on them. You use emulators but you have to throttle down their speed and memory to match what it used to be. I think everyone would see that their programs are hundreds of ELO weaker, even if the software is a couple of hundred ELO stronger. The hardware is by far going to be what caused the most gain.

I don't think Bob claimed that software hasn't advanced, only that it's not responsible for most of the gain. This is not even arguable.

I wold not be surprised if software put on ancient hardware was not even stronger (or much stronger) because things like null move and LMR require some depth to really see benefit.

I assume hash tables are not part of this discussion - I think that was a pretty big advance too and that clearly didn't become possible until hardware (memory) was there to support it. I think these other enhancement are like that, not possible or at least not very effective until the hardware was up to it. That gives us much to argue about - was it the hardware or the software?

But go back in a time machine with what you know now, and I would bet that you could not write a significantly better program. Better, perhaps, but nothing you couldn't get by waiting a couple of years for computers to double in speed.

Don · Post by **Don** » Thu Jan 08, 2009 4:03 pm

bob wrote: I have even run without the check extension, the only one that is left. Many think the purpose of this extension is to find deep mates. That's wrong. The purpose is to try to expose horizon-effect moves and avoid them when possible. I also want to test a restricted check extension, where it is only applied in the last N plies where the horizon effect is most notable, where right now I apply them everywhere with no limit, always +1 ply added to depth.

I'm pretty interested in this. Please let us know what you find and I'll probably do some tests of my own. Checks are real speed killers.

I doubt this is anything new but here is what I am looking at:

Right now I am considering the possibility of throwing out losing checks except on the last ply. At least that is an attempt at something a bit more intelligent and I believe in the middle game most checks are losing moves like [Q|B]xf7+ followed by Kxf7. This doesn't address endless series of non-losing checks in the endgame or otherwise so I'm still thinking about how to handle this. Perhaps not allow more than 1 check in a row by the same piece?

Deep mates do not have a huge impact on the strength because if you have deep mate, you already have a won game - the main issue is if you are throwing away a win because you were not opportunistic enough. The other side of the coin is similar, if your opponent has a deep mate against you, he probably has a win even if he misses the mate. Not always of course, but usually. You get extra depth as a trade off, so missing these deep mates is certainly not all down-side.

Over the years I have tended to be more about pruning and less about extending. It simplifies things in a lot of ways. It may not be that horrible to NEVER extend checks - they effectively get extended anyway with heavy LMR and pruning because they never get considered for reductions.

Extenting anything has a minor side-effect which is bad, it is used by the search as a device to opportunistically conceal something even if just positionally. It is used to change the "parity" of the search - and you are returning scores that are from odd and even depths, which are not 100% compatible. So sometimes a program will throw in a check in order to get the make the last move for instance. If you having the move bonus is too high it might do just the opposite, throw in the check in order to get the bonus.

Don · Post by **Don** » Thu Jan 08, 2009 4:23 pm

Tord Romstad wrote: By the way, it's amusing to see that LMR is now generally accepted as effective. Back when I started advocating it, the technique was largely abandoned since many years, and those few programmers I managed to convince to give it a try mostly reported that it didn't work for them.
Tord

I think a LOT of techniques that are probably workable have been tried and then abandoned. If I have learned anything it is that if you THINK you have a good idea, don't give up on it easily.

Many years ago, I stumbled upon the basic principle even though it wasn't called LMR or anything back then. I made some versions and experimented and tested them. The speedup was amazing and it could solve problems and I sensed that it could be very promising. But I eventually gave up on it because I only spent a couple of days on it and finally concluded that you cannot cheat the Gods. I'm still kicking myself for that one. Back then you couldn't test with thousands or even hundreds of games easily.

I have another anecdote that is pretty funny. Years before I knew anything about null move pruning, I was explaining to my Father how the chess program worked. He is a weak chess player and knows nothing about programming, but in the context of a discussion about threats, he said, "why not try 2 moves in a row to see just to see if there is a threat?" I dismissed this idea right away. I'm kicking myself for that one two, he had just invented the null move idea and I didn't recognize it.

Don · Post by **Don** » Thu Jan 08, 2009 4:43 pm

bob wrote: I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...

What I do is to have "bridge" players so that everyone has opponents to play that are not too badly mismatched. My own tester supports this by favoring matches between similarly rated players - but probabilistically any match is still possible. This is much more efficient (for the reasons you stated) when testing a wide variety of strengths.

My tester ensures that any 2 players get a new opening and both play the white side of it. The openings are shallow and I have about 8000 of them so I can play about 16,000 games between any 2 players, because I still want to test positions close to the opening.

mhull · Post by **mhull** » Thu Jan 08, 2009 5:21 pm

Don wrote:
bob wrote:
Uri Blass wrote:
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).

I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.

Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576

So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?

Also, I believe eval improvements have caused an improvement in
rating scores.

An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.

Branching factor proves nothing because programs that do more pruning play weaker at fixed depth but I can say not based on branching factor that the improvement in software in the last years is very big and bigger than the improvement in hardware(not sure about improvement since 1992 because it is not clear how we define it but sure about improvement from 2005 to 2008).

Note that the tests of Bob can show only that hardware helped more than software for Crafty.

Tests of the SSDF showed the following results.

Rybka 3 A1200 - Deep Shredder 11 Q6600 20-19
Rybka 3 A1200 -Zappa Mexico II Q6600 20-20

A1200 = 1 x 1.2 GHz
Q6600 = 4 x 2.4 GHz

Note that both Zappa and Shredder are clearly stronger than Fruit that was the leading program in 2005 for single processor machines

I think that we can safely say that the software improvement in the last 3 years were more than 10:1 and I do not see hardware improvement of 10:1 in the last 3 years.

Uri
Can we stop with the amateurish comparisons/ Why pick different programs to compare. We had better and worse programs in 1995 as well. The question was, what have the software advances actually produced.

Feel free to name significant ones. Most would put null-move at the top, and LMR right behind it. Then we could factor in razoring, futility and extended futility from Heinz, but I already know those are very small improvements from testing by turning each off a month back or so.

what else is _significant_? Evaluation is not so interesting. I was doing passed pawn races in the 1970's. Outside passed pawns in 1995 in Crafty. So what _new_ thing since 1995 is such a big contributor? No hand-waving, no talking about ideas that _might_ have been developed. Actual documented techniques...
I have not read this whole thread, it is a large one. But I don't even see how this is arguable. If someone wants to argue it, just bring back the old machines and see how modern chess programs do on them. You use emulators but you have to throttle down their speed and memory to match what it used to be. I think everyone would see that their programs are hundreds of ELO weaker, even if the software is a couple of hundred ELO stronger. The hardware is by far going to be what caused the most gain.

I don't think Bob claimed that software hasn't advanced, only that it's not responsible for most of the gain. This is not even arguable.

I wold not be surprised if software put on ancient hardware was not even stronger (or much stronger) because things like null move and LMR require some depth to really see benefit.

I assume hash tables are not part of this discussion - I think that was a pretty big advance too and that clearly didn't become possible until hardware (memory) was there to support it. I think these other enhancement are like that, not possible or at least not very effective until the hardware was up to it. That gives us much to argue about - was it the hardware or the software?

But go back in a time machine with what you know now, and I would bet that you could not write a significantly better program. Better, perhaps, but nothing you couldn't get by waiting a couple of years for computers to double in speed.

I recently junked a working, 90 Mhz 586 pentium with 40 MB of RAM (a circa 1995 system), with Red Hat 7.x installed. That would have been a nice platform to test with linux-based programs.

And maybe one could have built a new Gentoo system for that hardware that would even support the latest gcc compiler. Then run period crafty versions against the newest versions, or against current fruit, glaurung, etc.

bob · Post by **bob** » Thu Jan 08, 2009 8:54 pm

Don wrote:
bob wrote: I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.

So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.

In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
What I do is to have "bridge" players so that everyone has opponents to play that are not too badly mismatched. My own tester supports this by favoring matches between similarly rated players - but probabilistically any match is still possible. This is much more efficient (for the reasons you stated) when testing a wide variety of strengths.

My tester ensures that any 2 players get a new opening and both play the white side of it. The openings are shallow and I have about 8000 of them so I can play about 16,000 games between any 2 players, because I still want to test positions close to the opening.

My only problem is that I have no reason to have opponents that are over 400 points below Crafty, as the data from those games is worthless. So in the test I was asked to do, the "dumb crafty" rating won't be accurate and I don't really have any good way to (a) pick opponents that are that weak; and (b) the time or interest in doing so since it would only be useful for this specific experiment.

Otherwise my testing scheme is pretty well known, playing 2 games per position per opponent so that any asymmetric positions will "even out" if they favor one side more than the other.

bob · Post by **bob** » Thu Jan 08, 2009 8:56 pm

Don wrote:
Tord Romstad wrote: By the way, it's amusing to see that LMR is now generally accepted as effective. Back when I started advocating it, the technique was largely abandoned since many years, and those few programmers I managed to convince to give it a try mostly reported that it didn't work for them.
Tord
I think a LOT of techniques that are probably workable have been tried and then abandoned. If I have learned anything it is that if you THINK you have a good idea, don't give up on it easily.

Many years ago, I stumbled upon the basic principle even though it wasn't called LMR or anything back then. I made some versions and experimented and tested them. The speedup was amazing and it could solve problems and I sensed that it could be very promising. But I eventually gave up on it because I only spent a couple of days on it and finally concluded that you cannot cheat the Gods. I'm still kicking myself for that one. Back then you couldn't test with thousands or even hundreds of games easily.

I have another anecdote that is pretty funny. Years before I knew anything about null move pruning, I was explaining to my Father how the chess program worked. He is a weak chess player and knows nothing about programming, but in the context of a discussion about threats, he said, "why not try 2 moves in a row to see just to see if there is a threat?" I dismissed this idea right away. I'm kicking myself for that one two, he had just invented the null move idea and I didn't recognize it.

Bruce Moreland and I played with the concept of LMR in 1996 prior to the WMCCC event that year. We both liked the speed, but we did not spend the necessary time to impose any limits on what was reduced. As a result, predictably, it looked good but played bad. Of course the depths back then were half of what they are today, which makes a big difference in everything. But we eventually discarded the idea and I never thought to experiment further until everyone started talking about "history pruning" as it was known in Fruit...

Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software

Re: Hardware vs Software - test results

Re: Hardware vs Software

Re: Hardware vs Software - test results

Re: Hardware vs Software

Re: Hardware vs Software - test results

Re: Hardware vs Software