CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).
I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.
Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576
So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?
Also, I believe eval improvements have caused an improvement in
rating scores.
An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.
Branching factor proves nothing because programs that do more pruning play weaker at fixed depth but I can say not based on branching factor that the improvement in software in the last years is very big and bigger than the improvement in hardware(not sure about improvement since 1992 because it is not clear how we define it but sure about improvement from 2005 to 2008).
Note that the tests of Bob can show only that hardware helped more than software for Crafty.
Tests of the SSDF showed the following results.
Rybka 3 A1200 - Deep Shredder 11 Q6600 20-19
Rybka 3 A1200 -Zappa Mexico II Q6600 20-20
A1200 = 1 x 1.2 GHz
Q6600 = 4 x 2.4 GHz
Note that both Zappa and Shredder are clearly stronger than Fruit that was the leading program in 2005 for single processor machines
I think that we can safely say that the software improvement in the last 3 years were more than 10:1 and I do not see hardware improvement of 10:1 in the last 3 years.
Uri
Can we stop with the amateurish comparisons/ Why pick different programs to compare. We had better and worse programs in 1995 as well. The question was, what have the software advances actually produced.
Feel free to name significant ones. Most would put null-move at the top, and LMR right behind it. Then we could factor in razoring, futility and extended futility from Heinz, but I already know those are very small improvements from testing by turning each off a month back or so.
what else is _significant_? Evaluation is not so interesting. I was doing passed pawn races in the 1970's. Outside passed pawns in 1995 in Crafty. So what _new_ thing since 1995 is such a big contributor? No hand-waving, no talking about ideas that _might_ have been developed. Actual documented techniques...
I have not read this whole thread, it is a large one. But I don't even see how this is arguable. If someone wants to argue it, just bring back the old machines and see how modern chess programs do on them. You use emulators but you have to throttle down their speed and memory to match what it used to be. I think everyone would see that their programs are hundreds of ELO weaker, even if the software is a couple of hundred ELO stronger. The hardware is by far going to be what caused the most gain.
I don't think Bob claimed that software hasn't advanced, only that it's not responsible for most of the gain. This is not even arguable.
I wold not be surprised if software put on ancient hardware was not even stronger (or much stronger) because things like null move and LMR require some depth to really see benefit.
I assume hash tables are not part of this discussion - I think that was a pretty big advance too and that clearly didn't become possible until hardware (memory) was there to support it. I think these other enhancement are like that, not possible or at least not very effective until the hardware was up to it. That gives us much to argue about - was it the hardware or the software?
Neither! That is a synergistic effect.
Miguel
But go back in a time machine with what you know now, and I would bet that you could not write a significantly better program. Better, perhaps, but nothing you couldn't get by waiting a couple of years for computers to double in speed.
bob wrote:
I have even run without the check extension, the only one that is left. Many think the purpose of this extension is to find deep mates. That's wrong. The purpose is to try to expose horizon-effect moves and avoid them when possible. I also want to test a restricted check extension, where it is only applied in the last N plies where the horizon effect is most notable, where right now I apply them everywhere with no limit, always +1 ply added to depth.
I'm pretty interested in this. Please let us know what you find and I'll probably do some tests of my own. Checks are real speed killers.
I doubt this is anything new but here is what I am looking at:
Right now I am considering the possibility of throwing out losing checks except on the last ply. At least that is an attempt at something a bit more intelligent and I believe in the middle game most checks are losing moves like [Q|B]xf7+ followed by Kxf7. This doesn't address endless series of non-losing checks in the endgame or otherwise so I'm still thinking about how to handle this. Perhaps not allow more than 1 check in a row by the same piece?
Deep mates do not have a huge impact on the strength because if you have deep mate, you already have a won game - the main issue is if you are throwing away a win because you were not opportunistic enough. The other side of the coin is similar, if your opponent has a deep mate against you, he probably has a win even if he misses the mate. Not always of course, but usually. You get extra depth as a trade off, so missing these deep mates is certainly not all down-side.
Over the years I have tended to be more about pruning and less about extending. It simplifies things in a lot of ways. It may not be that horrible to NEVER extend checks - they effectively get extended anyway with heavy LMR and pruning because they never get considered for reductions.
Extenting anything has a minor side-effect which is bad, it is used by the search as a device to opportunistically conceal something even if just positionally. It is used to change the "parity" of the search - and you are returning scores that are from odd and even depths, which are not 100% compatible. So sometimes a program will throw in a check in order to get the make the last move for instance. If you having the move bonus is too high it might do just the opposite, throw in the check in order to get the bonus.
I am not sure it is bad. Both sides have the tools to play this trick. Whichever side "wins" the fight for getting that bonus, demonstrate the potential resources of the positions are in its favor. I think that is related to what we chess players call initiative: The ability of dictate the pace and direction of the game.
bob wrote:
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
What I do is to have "bridge" players so that everyone has opponents to play that are not too badly mismatched. My own tester supports this by favoring matches between similarly rated players - but probabilistically any match is still possible. This is much more efficient (for the reasons you stated) when testing a wide variety of strengths.
Exactly, that is how different big "energy levels" between molecules are calculated in chemistry & biochemistry. The Elo scale is like an energy scale. You need players in between to increase accuracy. If you do not have a ruler to measure the height of a building, measure the height of each story and add them up.
Miguel
My tester ensures that any 2 players get a new opening and both play the white side of it. The openings are shallow and I have about 8000 of them so I can play about 16,000 games between any 2 players, because I still want to test positions close to the opening.
I recently added a single layer of checks to the q-search. I _never_ enter the qsearch if the side to move is in check, because I extend when I give check, rather than extending when I have to get out of check. So the first ply of qsearch knows it is not in check to begin with. I now search the normal non-losing capture moves, and then use a special move generator that just generates checks that are non-captures (to avoid duplication). From these, I use SEE as I try them and toss the losers out immediately. That has helped a bit in that it prevents a null-move blindness where you play a null in a position where you are about to be mated, and now the mate is hidden because you drop right into the q-search. with this, I discover the check that leads to mate or win of material and avoid that specific kind of null-move failure. It also lets me use R=3 everywhere, rather than the adaptive approach R varies from 3 down to 2 near the tips, which was another gain.
Finally I want to test further restricting check extensions because there are obvious positions where they are futile. But it takes some care to avoid missing a rook sac that leads to mate because you find yourself down a rook and the position might appear hopeless, but it is not. But something tells me there is something to be gained. BTW the check extension is not a _huge_ addition anyway. We are talking about 20-30 Elo or so although I don't remember the exact number. Fortunately, with the two clusters I have here, I can answer most any question quickly and with enough games to be fairly accurate...
This has really made a difference with Crafty's development, as many things that "sounded good" turned out to hurt performance overall, even though it was not obvious with quick tests. I can now change something and get a very accurate good/no-good evaluation back in an hour.
bob wrote:BTW the check extension is not a _huge_ addition anyway. We are talking about 20-30 Elo or so although I don't remember the exact number.
Hi Bob,
Can you please clarify what did you mean by the post above? Did you mean that the check extension is only worth that much or it is your restrictions to the check extension that has that worth in Elo?
By the way, have you tried the idea in Toga where it extend the check only in pv nodes and only when SEE >= -[value of pawn] in non-pv nodes?
Don wrote:
I have not read this whole thread, it is a large one. But I don't even see how this is arguable. If someone wants to argue it, just bring back the old machines and see how modern chess programs do on them. You use emulators but you have to throttle down their speed and memory to match what it used to be. I think everyone would see that their programs are hundreds of ELO weaker, even if the software is a couple of hundred ELO stronger. The hardware is by far going to be what caused the most gain.
I don't think Bob claimed that software hasn't advanced, only that it's not responsible for most of the gain. This is not even arguable.
I wold not be surprised if software put on ancient hardware was not even stronger (or much stronger) because things like null move and LMR require some depth to really see benefit.
I assume hash tables are not part of this discussion - I think that was a pretty big advance too and that clearly didn't become possible until hardware (memory) was there to support it. I think these other enhancement are like that, not possible or at least not very effective until the hardware was up to it. That gives us much to argue about - was it the hardware or the software?
But go back in a time machine with what you know now, and I would bet that you could not write a significantly better program. Better, perhaps, but nothing you couldn't get by waiting a couple of years for computers to double in speed.
Well, the thing is that we aren't talking about hardware vs software here. The way this discussion has been going we have gone three categories:
1. Hardware
2. Software techniques
3. all other software improvements.
When it is limited to the first two as Bob is doing, it is clear that hardware is the dominating factor. What isn't so clear is whether or not 2 & 3 combined keeps up with 1. It's not uncommon to see a top level program get updated on the same hardware with a significant gain in strength.
CRoberson wrote:While at the 2008 ACCA Pan American Computer Chess Championships,
Bob claimed he didn't believe software played a serious role in all the
rating improvements we've seen. He thought hardware deserved the
credit (assuming I understood the statement correctly. We were jumping
across several subjects and back that night.).
I beleive software has had much to do with it for several reasons.
I will start with one. The EBF with only MiniMax is 40. With Alpha-Beta
pruning, it drops to 6. In the early 1990's, the EBF was 4. Now, it is 2.
Dropping the EBF from 2 to 4 is huge. Lets look at a 20 ply search.
The speedup of EBF=2 vs EBF=4 is:
4^20/2^20 = 2^20 = 1,048,576
So, that is over a 1 million x speed up. Has hardware produced that much
since 1992?
Also, I believe eval improvements have caused an improvement in
rating scores.
An example of nonhardware improvements is on the SSDF rating list.
Rybka 1.0 beta score 2775 on a 450 MHz AMD.
Branching factor proves nothing because programs that do more pruning play weaker at fixed depth but I can say not based on branching factor that the improvement in software in the last years is very big and bigger than the improvement in hardware(not sure about improvement since 1992 because it is not clear how we define it but sure about improvement from 2005 to 2008).
Note that the tests of Bob can show only that hardware helped more than software for Crafty.
Tests of the SSDF showed the following results.
Rybka 3 A1200 - Deep Shredder 11 Q6600 20-19
Rybka 3 A1200 -Zappa Mexico II Q6600 20-20
A1200 = 1 x 1.2 GHz
Q6600 = 4 x 2.4 GHz
Note that both Zappa and Shredder are clearly stronger than Fruit that was the leading program in 2005 for single processor machines
I think that we can safely say that the software improvement in the last 3 years were more than 10:1 and I do not see hardware improvement of 10:1 in the last 3 years.
Uri
Can we stop with the amateurish comparisons/ Why pick different programs to compare. We had better and worse programs in 1995 as well. The question was, what have the software advances actually produced.
Feel free to name significant ones. Most would put null-move at the top, and LMR right behind it. Then we could factor in razoring, futility and extended futility from Heinz, but I already know those are very small improvements from testing by turning each off a month back or so.
what else is _significant_? Evaluation is not so interesting. I was doing passed pawn races in the 1970's. Outside passed pawns in 1995 in Crafty. So what _new_ thing since 1995 is such a big contributor? No hand-waving, no talking about ideas that _might_ have been developed. Actual documented techniques...
I have not read this whole thread, it is a large one. But I don't even see how this is arguable. If someone wants to argue it, just bring back the old machines and see how modern chess programs do on them. You use emulators but you have to throttle down their speed and memory to match what it used to be. I think everyone would see that their programs are hundreds of ELO weaker, even if the software is a couple of hundred ELO stronger. The hardware is by far going to be what caused the most gain.
I don't think Bob claimed that software hasn't advanced, only that it's not responsible for most of the gain. This is not even arguable.
I wold not be surprised if software put on ancient hardware was not even stronger (or much stronger) because things like null move and LMR require some depth to really see benefit.
I assume hash tables are not part of this discussion - I think that was a pretty big advance too and that clearly didn't become possible until hardware (memory) was there to support it. I think these other enhancement are like that, not possible or at least not very effective until the hardware was up to it. That gives us much to argue about - was it the hardware or the software?
But go back in a time machine with what you know now, and I would bet that you could not write a significantly better program. Better, perhaps, but nothing you couldn't get by waiting a couple of years for computers to double in speed.
null move and LMR work at all time controls and not only at tournament time control so I am sure that they could work at least in 120/40 with ancient hardware.
Improvement in software is both by better search and better evaluation
and I think that software is responsible for most of the gain in this century
Rybka3 can clearly use hardware of 2000 so it is easy to test it.
Give Rybka3 hardware of 2000 and give program of your choice hardware of today(but the software need to be from 2000 or earlier)
bob wrote:
I disagree, for one simple reason. Elo ranges are bad on the edges, and the math is quite easy to follow. For example, If I play opponents that are over 400 Elo below me, I will likely win every game. And my estimated Elo can only be around their average + 400 or so. So if I play against very strong opponents, my Elo will be wildly off, as if the opponents are 2800, and I lose every game, I could be 2200, or 1000, and there is no way to tell The same is true on the other end. If I play much weaker programs, then my rating will likely be well understated.
So, for comparison, what to choose? No matter what, one of the two tests are going to be way off. I can't use two different groups of programs, and then compare the Elos for Crafty with and without evaluation, as the Elo between two disparate rating pools is meaningless.
In short, I can either play weak crafty against weak programs, and then use strong crafty vs the same group and get an estimated Elo for the strong version that will be understated, or else I can play against strong programs, where the weak crafty will be overstated in its Elo. There is no scientific solution other than to have a truly large rating pool to play both against, and that is computationally intractable...
What I do is to have "bridge" players so that everyone has opponents to play that are not too badly mismatched. My own tester supports this by favoring matches between similarly rated players - but probabilistically any match is still possible. This is much more efficient (for the reasons you stated) when testing a wide variety of strengths.
Exactly, that is how different big "energy levels" between molecules are calculated in chemistry & biochemistry. The Elo scale is like an energy scale. You need players in between to increase accuracy. If you do not have a ruler to measure the height of a building, measure the height of each story and add them up.
Miguel
My tester ensures that any 2 players get a new opening and both play the white side of it. The openings are shallow and I have about 8000 of them so I can play about 16,000 games between any 2 players, because I still want to test positions close to the opening.
Again, that is why I have programs both above and below Crafty in my test scheme. But I don't have any programs that are 400-500 below crafty because those results would be useless for normal testing.
bob wrote:BTW the check extension is not a _huge_ addition anyway. We are talking about 20-30 Elo or so although I don't remember the exact number.
Hi Bob,
Can you please clarify what did you mean by the post above? Did you mean that the check extension is only worth that much or it is your restrictions to the check extension that has that worth in Elo?
By the way, have you tried the idea in Toga where it extend the check only in pv nodes and only when SEE >= -[value of pawn] in non-pv nodes?
Removing the check extension only drops Crafty's Elo by 20-30. I have tried that and other ideas. the SEE test actually hurt, with the one exception that for qsearch checks, I do limit them that way and it was better. In the case of Crafty, that is. YMMV.
Don wrote:
I have not read this whole thread, it is a large one. But I don't even see how this is arguable. If someone wants to argue it, just bring back the old machines and see how modern chess programs do on them. You use emulators but you have to throttle down their speed and memory to match what it used to be. I think everyone would see that their programs are hundreds of ELO weaker, even if the software is a couple of hundred ELO stronger. The hardware is by far going to be what caused the most gain.
I don't think Bob claimed that software hasn't advanced, only that it's not responsible for most of the gain. This is not even arguable.
I wold not be surprised if software put on ancient hardware was not even stronger (or much stronger) because things like null move and LMR require some depth to really see benefit.
I assume hash tables are not part of this discussion - I think that was a pretty big advance too and that clearly didn't become possible until hardware (memory) was there to support it. I think these other enhancement are like that, not possible or at least not very effective until the hardware was up to it. That gives us much to argue about - was it the hardware or the software?
But go back in a time machine with what you know now, and I would bet that you could not write a significantly better program. Better, perhaps, but nothing you couldn't get by waiting a couple of years for computers to double in speed.
Well, the thing is that we aren't talking about hardware vs software here. The way this discussion has been going we have gone three categories:
1. Hardware
2. Software techniques
3. all other software improvements.
When it is limited to the first two as Bob is doing, it is clear that hardware is the dominating factor. What isn't so clear is whether or not 2 & 3 combined keeps up with 1. It's not uncommon to see a top level program get updated on the same hardware with a significant gain in strength.
I don't agree with the category definitions. I would use:
(1) hardware
(2) software
(3) software features only possible because of hardware improvements
(1) is clear in what it means.
(2) is also clear in what it means.
(3) is the interesting case. For example, null move R=3 would fail miserably on a pentium-60mhz program because the depth would be so shallow, null-move errors would be rampant. But this category is _still_ a direct result of hardware improvements. The first version of "blitz" searched around 1 node per second in 1970. By 1980 on a Cray, we were doing 1K nodes per second. Today I am seeing 20M on a simple and readily available 8 core box. 20M times faster than 1970. 200,000 times faster than the first version of Cray Blitz that had a 2250 USCF rating. Assuming a branching factor of 6 for the 1980 version of Cray Blitz, todays hardware would add 6 plies to the depth. What would 6 plies add in terms of rating? Hardware is and has been the dominant factor in computer chess strength and it will continue to be.