harware vs software advances

bob · Post by **bob** » Sat Sep 11, 2010 9:08 pm

Don wrote:
bob wrote:The alpha was not particularly superior to Intel at the time, if you exclude the obvious 64 bit advantage it had for bitboards. ...
I meant to comment on this particular quote. I wrote a program for MIT that was to work on both the Pentium and Alpha and it did run on both. It was immediately apparent that there was no comparison. I wish I had real numbers to report but it was a dog on the Pentium compared to the Alpha. The 64 bit issue gives you some, but it's not even a doubling. But I remember that if you assumed the 64 bit on 32 bit Pentium crippled it (even) by 2 to 1, we were something like 2 or 3 times slower. The alpha was ridiculously superior.

But what do you think? Should we base the superiority of a 64 bit chip on a program written for a 32 bit program? You have already given your answer so it's a rhetorical question.

By the way, I DO agree with your reasoning that hardware speed measurements should take into consideration the hardware the program was designed on. But it's VERY difficult to separate that from real program improvements. Any change should be solely in the interest of making the program compatible with new hardware, NOT an actual improvement to the program. This is for hardware comparisons only of course. A reasonable Litmus test for this would be to ask if that change could reasonably be applied to the old hardware.

However, I don't believe that is a huge issue. I don't think it's more than 3/2 but I would be generous and allow 2 to 1 for this. The single biggest thing you could do is recompile the old source code on modern hardware. That will probably make up the biggest difference because it's only fair when comparing hardware with software benchmarks that the software is compiled to the target machine. I agree with you on this one.

I have done that, exactly. I have the 10.x version of Crafty, which is the exact version I was running on the P5, until I got my P6/200 in 1996. Unfortunately, my P5 was the 133mhz version. The nps for this exact version was 30K. This same version is somewhat faster (NPS) than the current version, using identical hardware. And using parallel search, it really is about 1500x faster than the old P5/90 (using 90 / 135 * 30K = 20K.)

So I have a very accurate measurement for Crafty, that shows at least a factor of 1000x on hardware from the P5/90 to my 8-core E5345 (which is not quite as fast as a single 6 core i7, but it is all I have at the moment to test on). So I have the hardware factor specifically for Crafty. Whether it is the same for others or not, I don't know. I've been optimizing code for 40+ years now and tend to do pretty well in this endeavour, meaning that each Crafty version was pretty well optimized for what I had at the time. Hardly perfectly optimized, but "pretty well optimized". I am a bit surprised that the old program runs so well today, since it doesn't have some of the cache optimizations I have done, but apparently whatever I did back then works like magic today, still.

Now I am comparing 1995 crafty (software) to today's crafty (software) running both on the same modern hardware on my cluster. Won't be long until I have a "software improvement" number for Crafty covering 1995-2010. Will have to add another 200-300 for Rybka, will figure out that exact number later.

All that is left is to measure the Elo gain for that huge hardware performance gain. I am not quite sure how to do that yet. I suppose I could use a fixed node search and when Crafty (old) calculates a target time, it then multiplies that by 22K nodes per second and stops after searching that many nodes. Going to kill new Crafty to do that but it seems necessary to run new and old on old hardware, and on new hardware, to close the loop on the comparisons...

That is next, once I get the 1995 version on new hardware Elo results later today...

Right now it is hovering at -370 Elo compared to 1995 Crafty. Significantly less than I thought. But in 1995 we had null-move and futility pruning, just no LMR or more significant forward pruning. My eval is really not all that different except for king safety however.

Don · Post by **Don** » Sat Sep 11, 2010 10:10 pm

bob wrote: I do not quite see the point for all the tangents. It would seem to me we have a pretty good idea of what/how to test.

Says you.

Rebel is only 100 times faster on modern hardware. You add a zero to this and consider it a reasonable test. You are rounding every possible variable in YOUR direction and calling it "a pretty good idea of what to test."

Your first rounding error is that you compare Crafty on 32 bit to Crafty on 64 bit. Crafty is not optimal on 32 bit hardware so you are breaking your own rule.

Your second "rounding" error is to compare the best possible hardware that a high end hobbiest might own today to another platform that was considered INFERIOR in it's day compared to other (not just Alpha) workstations that were available.

You justify this by claiming that nobody cares about anything other than Intel hardware which is probably true, but has nothing to do with MY contention of hardware advancement. Your test won't prove anything I said is wrong - but of course if you redefine what people say in order to suit your own purposes, then you can "pretend" you are proving them wrong in politician fashion.

Your next rounding error is to run time odds games based on the Nodes per second difference when everyone knows that 6 cores is not better than a machine that is 6 times faster.

Your 1000 to 1 test is just not a reasonable test. You can FIX the last "rounding error" by calculating the time difference based on just 1 core instead of the quad. Then run the single processor OLD program against an MP version of the new program. But just saying that it does 1000X more nodes per second, therefore it should get 1000 to 1 time odds is asinie. If you like doing that then let's run a match between Komodo and Crafty. You can run Crafty on 4 cores as long as you let Komodo run on a single cores machine that is 4x faster than each of your cores. We will both be doing the same number of (hardware adjusted) nodes per second so it must be fair, right?

But that would not cure the other rounding errors.

I have a 1995 version of Crafty that seems to be running correctly on my cluster, and seems to be running well over 1000x faster than it did on 1995 P90. I am getting close to being able to announce just how strong (or weak) that 1995 program is on today's hardware, We can subtract that from Crafty's rating, add whatever fudge factor is needed to raise the standard to Rybka, and voila' we have the software gain from 1995.

Good for you. Run a fair test and I might be interested.

If you want to salvage this, allow for the test to be verifiable. It should be possible for anyone who is willing to independently verify your results, run their own tests under their own conditions and do any experiment they need to in order to be satisfied. If you don't do that, then this is meaningless. If you make it possible then I am still interested in seeing this test done under fair conditions.

Make all the source code available including the original source code and your fixed source code. If possible make is so that we can run both programs under win-board protocol and provide a way for us to verify that it's the correct source code, perhaps an old web site where the sources of old crafty versions are posted.

This is NOT an accusation of dishonesty or anything like that, but it's just common sense. Anyone can run a test and make a mistake that gives unintended results. I have done it myself where I almost came to the wrong conclusion because one program was crashing and racking up losses for instance. It's very easy to overlook some testing issue that you didn't' think of and it's part of the wisdom of independent verification.

Don · Post by **Don** » Sat Sep 11, 2010 10:42 pm

bob wrote:
Don wrote:
bob wrote:The alpha was not particularly superior to Intel at the time, if you exclude the obvious 64 bit advantage it had for bitboards. ...
I meant to comment on this particular quote. I wrote a program for MIT that was to work on both the Pentium and Alpha and it did run on both. It was immediately apparent that there was no comparison. I wish I had real numbers to report but it was a dog on the Pentium compared to the Alpha. The 64 bit issue gives you some, but it's not even a doubling. But I remember that if you assumed the 64 bit on 32 bit Pentium crippled it (even) by 2 to 1, we were something like 2 or 3 times slower. The alpha was ridiculously superior.

But what do you think? Should we base the superiority of a 64 bit chip on a program written for a 32 bit program? You have already given your answer so it's a rhetorical question.

By the way, I DO agree with your reasoning that hardware speed measurements should take into consideration the hardware the program was designed on. But it's VERY difficult to separate that from real program improvements. Any change should be solely in the interest of making the program compatible with new hardware, NOT an actual improvement to the program. This is for hardware comparisons only of course. A reasonable Litmus test for this would be to ask if that change could reasonably be applied to the old hardware.

However, I don't believe that is a huge issue. I don't think it's more than 3/2 but I would be generous and allow 2 to 1 for this. The single biggest thing you could do is recompile the old source code on modern hardware. That will probably make up the biggest difference because it's only fair when comparing hardware with software benchmarks that the software is compiled to the target machine. I agree with you on this one.
I have done that, exactly. I have the 10.x version of Crafty, which is the exact version I was running on the P5, until I got my P6/200 in 1996. Unfortunately, my P5 was the 133mhz version. The nps for this exact version was 30K. This same version is somewhat faster (NPS) than the current version, using identical hardware. And using parallel search, it really is about 1500x faster than the old P5/90 (using 90 / 135 * 30K = 20K.)

So I have a very accurate measurement for Crafty, that shows at least a factor of 1000x on hardware from the P5/90 to my 8-core E5345 (which is not quite as fast as a single 6 core i7, but it is all I have at the moment to test on). So I have the hardware factor specifically for Crafty. Whether it is the same for others or not, I don't know. I've been optimizing code for 40+ years now and tend to do pretty well in this endeavour, meaning that each Crafty version was pretty well optimized for what I had at the time. Hardly perfectly optimized, but "pretty well optimized". I am a bit surprised that the old program runs so well today, since it doesn't have some of the cache optimizations I have done, but apparently whatever I did back then works like magic today, still.

Now I am comparing 1995 crafty (software) to today's crafty (software) running both on the same modern hardware on my cluster. Won't be long until I have a "software improvement" number for Crafty covering 1995-2010. Will have to add another 200-300 for Rybka, will figure out that exact number later.

All that is left is to measure the Elo gain for that huge hardware performance gain. I am not quite sure how to do that yet. I suppose I could use a fixed node search and when Crafty (old) calculates a target time, it then multiplies that by 22K nodes per second and stops after searching that many nodes. Going to kill new Crafty to do that but it seems necessary to run new and old on old hardware, and on new hardware, to close the loop on the comparisons...

That is next, once I get the 1995 version on new hardware Elo results later today...

Right now it is hovering at -370 Elo compared to 1995 Crafty. Significantly less than I thought. But in 1995 we had null-move and futility pruning, just no LMR or more significant forward pruning. My eval is really not all that different except for king safety however.

Bob,

Can you make the source code for everything available?

Here are some observation that I think would "fix" your test because I think it is off considerably but can be repaired:

Crafty IS a 64 bit program. No way to get around that. It ran well on 32 bit machines but I believe it's far from optimal on a 32 bit machine so when you compare nodes per second you are comparing something somewhat crippled to something running on the hardware it was designed for. I think 3/2 is a reasonable adjustment for this.

8 processors is too much. It's true that you can get an 8 processor machine but almost nobody has one unless they have a lot of disposable income. If you want to be "reasonable" then you should compare only to a quad core. It was YOU that said nobody cares about non Intel workstations so by that same reasoning nobody cares about how well Crafty runs on a machine that almost nobody has.

So 4 processors is a "reasonable" comparison if you want to just consider machines that the average family can have without getting a loan from the bank. Otherwise I should also be able to consider more expensive workstations too such as the powerful Alpha and others. (There was also the HP which blew away the pentium and perhaps the SGI machines too.) Stanback ran on an HP in those days and won tournaments partly because this was superior to the Pentiums everyone else was using.

Also, it is really unreasonable to notice that Crafty does 4x more nodes per second on 4 processors, and consider the full 4 to 1 a reasonable time handicap. You should either run on 4 actual cores in your test, or reduce the advantage of 4 cores (4 cores is NOT 1 core running 4x faster.)

We have solid numbers for 4 cores on the rating lists. Crafty is almost exactly 100 ELO stronger on 4 cores and that seems to be more than Rybka so it's generous. I think Crafty running 4x faster is much more than 100 ELO stronger. We can actually find out if you want to make the adjustment fairly. Just find out exactly what kind of odds Crafty needs in order to get 100 ELO improvement. In fact I can run that test on Komodo and I'll bet the numbers work out the same.

Another way to do this test is to run Crafty on 4 cores against Crafty running on 1 core but with 4x more CPU time.

If you make those adjustments then the test is definitely reasonable. Of course we have the Rybka factor too. We can compare Crafty in 1995 to the best program THEN if the rating lists are available and Crafty today to the best program and make that our adjustment for this issue.

If you don't do those things, the test in my opinion is not really testing something that is very interesting.

I would also like to know more about how you intend to run these tests. What time controls are you using? Do you intend to do this with heavy time odds games?

And finally I just want to be able to duplicate this test myself.

mhull · Post by **mhull** » Sat Sep 11, 2010 10:52 pm

Don wrote:
bob wrote:I do not quite see the point for all the tangents. It would seem to me we have a pretty good idea of what/how to test.
Says you.

Rebel is only 100 times faster on modern hardware. You add a zero to this and consider it a reasonable test.

I don't think it persuasive to introduce an orange in a comparison of apples. Crafty is apples to apples. It's ok to compare oranges to oranges if you've got the resources to do that. Use the available P90 and DOSEMU on modern hardware to compare Rebel and Genius for then and now. It will be good if you run with that as an independent test. The difference will be that crafty has a larger hardware increase potential. But that shouldn't matter as the functions for both test should map to a similar curve (in theory).

bob · Post by **bob** » Sat Sep 11, 2010 10:53 pm

Don wrote:
bob wrote: I do not quite see the point for all the tangents. It would seem to me we have a pretty good idea of what/how to test.
Says you.

Rebel is only 100 times faster on modern hardware. You add a zero to this and consider it a reasonable test. You are rounding every possible variable in YOUR direction and calling it "a pretty good idea of what to test."

Bullshit. I have not rounded _anything_ in my favor. 1995 Crafty searched 30K on P5/133. 20K on P5/90. 1995 Crafty searches 30M on my current box, and has another 10% once I get PGO going.

So how can you say 30M / 22K = 1000x is "rounding every possible variable in MY direction?"

If you want to be snotty, let's use the eact number which is 1500x based on the above actual data, not on guessing or extrapolation. Once I make sure everything works correctly, I will be more than happy to make this source available, although it is only going to run on 64 bits. One can always download 10.18 from my ftp box to get the original 32-bit only version. Only problem is that the original version will not work with current winboard/xboard as the protocol has changed dramatically since 1995 when Tim Mann and I were whacking it everywhere.

Your first rounding error is that you compare Crafty on 32 bit to Crafty on 64 bit. Crafty is not optimal on 32 bit hardware so you are breaking your own rule.

Don, give it a rest. We started the discussion concerning two issues:

(1) what part of today's strength came from hardware advances between 1995 and 2010?

(2) what part of today's strength came from programming advances between 1995 and 2010?

For hardware, we had 32 bit P5/90 in 1995. For software, we had Crafty. Which was highly competitive with anyone in 1995. And Crafty ran on 32 bit hardware, just as surely as Slate used 60 bit hardware for his 64 bit chess engines in the 70's. Bitboards are not optimal on 32 bit hardware. But they are certainly the equal of non-bitboard programs on that hardware. So what?

I have simply taken 1995 crafty on 1995 hardware, and then run 1995 crafty on 2006 hardware and compared the search speeds. If you use actual numbers, I see 1500x. I've rounded that _down_ in your favor to 1000x. That's not a guestimated number, it is not a contrived number. It just compares 1995 HW to 2006 HW for Crafty.

The 1500x or 1000x is a real number. You apparently want to run every program you can find and use the worst ratio. I simply want to run my program because I have the numbers from 1995 and today (now).

Software improvements for my program are getting computed as I post this. If we can trust some big rating list, we can then figure out how much more Rybka has produced via software. And that gives us the software improvement.

I'm not trying to make up any numbers. I started off thinking new crafty might be 2x faster, as I mentioned. And was prepared to accept whatever number popped out as the "crafty hardware performance improvement" number. The old version is doing better than I expected. It is as fast in NPS, and appears to be within 350 Elo in terms of rating.

Nothing is rounded. Nothing is estimated. Nothing is cherry-picked.

Your second "rounding" error is to compare the best possible hardware that a high end hobbiest might own today to another platform that was considered INFERIOR in it's day compared to other (not just Alpha) workstations that were available.

Sorry, but your math sucks. The best hardware a high-end hobbiest might have today is at least a 24 core box. I have at least 3 students with those that I could have send you an email for confirmation. We have limited this to a decent single-chip i7, which is _not_ the high-end of the platforms, the duals and quads (chips, not cores) are the higher-end, and there are platforms beyond that.

P5/90 was not "inferior". A quad alpha would toast it. A normal alpha running a 32 bit program would not toast the thing at all. How many 64 bit chess engines do you think there were in 1995? I can count 'em on one hand and two were mine (Cray Blitz and Crafty). No micro-bitboarders in 1995 besides yours-truly. So the standard program was 32 bits. Mine needed 64. And ran pretty fast anyway.

Ask around and identify some "hobbiests" that had alphas in 1995. I doubt you can find a single computer chess person, other than those at a university or lab, that could even put hands-on an alpha.

You justify this by claiming that nobody cares about anything other than Intel hardware which is probably true, but has nothing to do with MY contention of hardware advancement. Your test won't prove anything I said is wrong - but of course if you redefine what people say in order to suit your own purposes, then you can "pretend" you are proving them wrong in politician fashion.

Then you define the rule. But you can _not_ ignore deep thought and deep blue if you are going to venture out beyond Intel. That ruins the discussion immediately.

I am proceeding. I am currently measuring old and new software on current hardware. I am going to figure out how to slow the hardware down and measure old and new on old hardware. I'll report the numbers. If you aren't happy with 'em, feel free to compute whatever you want. This seems to be about the fairest way I can think of. The 1000x is not even important in my tests, because I _know_ how fast Crafty searched in 1995, and I am pretty sure I can make it search at that _same_ speed in 2010 to see what one of those 1995 p5/90's wouild do today. Then I will know _exactly_ what hardware and software offers for Crafty, and can extrapolate with pretty good accuracy to get to Rybka's level for the true software-only improvement. At present, it seems that we might be at +600 for software. May end up that hardware is about the same. Don't know yet. But I am going to find out without getting bogged down in alphas, rs6000's, sparcs and ultra-sparcs, MIPS, Crays, Fujitsus, Hitachis, deep blues, belles, and you name it other hardware platforms. Nobody cares. Everybody has been using PCs since 1995. Except for 2-3-4 of us, and since 1994 I have used absolutely nothing but the x86/amd64 processors for chess tournaments. I used to bring the biggest hammer of all, but even I moved to the PC exclusively.

Your next rounding error is to run time odds games based on the Nodes per second difference when everyone knows that 6 cores is not better than a machine that is 6 times faster.

So what? I am not using 6 cores in any of these tests. So I fail to see your point. I used the 6-core observation purely as a method to compute hardware speedup since 1995. That 1000x number will not influence my test results at all, I am simply going to make old crafty search at p5/90 speed and see how much weaker it gets. That's not so hard to understand, is it?

Your 1000 to 1 test is just not a reasonable test. You can FIX the last "rounding error" by calculating the time difference based on just 1 core instead of the quad. Then run the single processor OLD program against an MP version of the new program.

Why would I do that. The old program had a parallel search. I can't use that today? We had parallel machines in 1995. We had parallel 386 boxes in 1986. I guess I therefore miss your point. I claim that _raw hardware_ is 1500x faster in running Crafty. I doubt you can refute that since anyone can compute the numbers. But I am not using that in my testing anywhere. I just wanted a number. That's a major Elo boost, however. But I am not going to turn 1500x into an estimated Elo gain. I'm going to test and produce an _actual_ elo difference between old at 22k and old at 4M (which is just using 1 core). I may (later) try old with 8 cores, but right now I am giving you every possible edge I can. One core. Not even the fastest current processor. And yet you are complaining about things that have no bearing on the results whatsoever.

But just saying that it does 1000X more nodes per second, therefore it should get 1000 to 1 time odds is asinie. If you like doing that then let's run a match between Komodo and Crafty. You can run Crafty on 4 cores as long as you let Komodo run on a single cores machine that is 4x faster than each of your cores. We will both be doing the same number of (hardware adjusted) nodes per second so it must be fair, right?

You can make up whatever stuff you want. You've not seen _me_ talk about 1000:1 time odds. You have seen me talk about 22K in 1995 and that is the speed I am going to eventually make old crafty run at on my new hardware, to see what slowing it from 4M to 22K is going to do. I may well then come back and run old up to 32M using 8 cores to get a _better_ estimate of hardware improving the rating. But I have not done that yet.

But that would not cure the other rounding errors.

I have a 1995 version of Crafty that seems to be running correctly on my cluster, and seems to be running well over 1000x faster than it did on 1995 P90. I am getting close to being able to announce just how strong (or weak) that 1995 program is on today's hardware, We can subtract that from Crafty's rating, add whatever fudge factor is needed to raise the standard to Rybka, and voila' we have the software gain from 1995.
Good for you. Run a fair test and I might be interested.

If you want to salvage this, allow for the test to be verifiable. It should be possible for anyone who is willing to independently verify your results, run their own tests under their own conditions and do any experiment they need to in order to be satisfied. If you don't do that, then this is meaningless. If you make it possible then I am still interested in seeing this test done under fair conditions.

I'm not quite sure what you are implying, but chew on this. Of the two of us, which one makes _all_ of their source code publicly available so that any claims they make can be verified easily?

think about it.

You live in secrecy, and then accuse me of being dishonest in my test results? I always make everything available. you should try it. Means unexpected errors don't slip into a crack either.

Make all the source code available including the original source code and your fixed source code. If possible make is so that we can run both programs under win-board protocol and provide a way for us to verify that it's the correct source code, perhaps an old web site where the sources of old crafty versions are posted.

This is NOT an accusation of dishonesty or anything like that, but it's just common sense.

Sorry, but get real. It _is_ an accusation. But it is a point I always address anyway. I'm not about to claim my old program will work with winboard/xboard. It might or might not. I simply made it work with my referee, which doesn't need the ping/pong/done=0/done=1 crap. But I will make the source available with the proviso that (a) it is 64 bit only, and (b) it is minimally compliant with new xboard/winboard stuff...

Anyone can run a test and make a mistake that gives unintended results. I have done it myself where I almost came to the wrong conclusion because one program was crashing and racking up losses for instance. It's very easy to overlook some testing issue that you didn't' think of and it's part of the wisdom of independent verification.

Don · Post by **Don** » Sat Sep 11, 2010 11:02 pm

mhull wrote:
Don wrote:
bob wrote:I do not quite see the point for all the tangents. It would seem to me we have a pretty good idea of what/how to test.
Says you.

Rebel is only 100 times faster on modern hardware. You add a zero to this and consider it a reasonable test.
I don't think it persuasive to introduce an orange in a comparison of apples. Crafty is apples to apples.

Rebel vs Rebel - apples to apples.

Crafty 1995 vs Crafty 2010 - apples vs oranges.

Also, Note the hardware is 32 bit vs 64 bit. 32 bit program run fine on 64 bit hardware (top 5 programs today still 32 bit) but 64 bit programs take a hit - so that is more apples vs oranges.

It's ok to compare oranges to oranges if you've got the resources to do that. Use the available P90 and DOSEMU on modern hardware to compare Rebel and Genius for then and now.

I found dosemu no good for this. I actually went to the trouble to make a DOS install USB stick.

In my opinion it would be a more or less a fair comparison to recompile Rebel, then run a match on modern hardware against Rybka 4 and see how much ELO difference there is. But the difference in ELO means it would take a huge number of games to get within 100 ELO with any confidence.

It will be good if you run with that as an independent test. The difference will be that crafty has a larger hardware increase potential. But that shouldn't matter as the functions for both test should map to a similar curve (in theory).

rbarreira · Post by **rbarreira** » Sat Sep 11, 2010 11:02 pm

If I may interject something, I was a bit surprised by the implied claim that multi-CPU boxes were not available in 1995. I was pretty young at that time, but even then I already seem to recall advertisements for multi-CPU servers. So it might be fair to say that we should consider a multi-CPU Pentium Pro or something like that.

bob · Post by **bob** » Sat Sep 11, 2010 11:16 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:The alpha was not particularly superior to Intel at the time, if you exclude the obvious 64 bit advantage it had for bitboards. ...
I meant to comment on this particular quote. I wrote a program for MIT that was to work on both the Pentium and Alpha and it did run on both. It was immediately apparent that there was no comparison. I wish I had real numbers to report but it was a dog on the Pentium compared to the Alpha. The 64 bit issue gives you some, but it's not even a doubling. But I remember that if you assumed the 64 bit on 32 bit Pentium crippled it (even) by 2 to 1, we were something like 2 or 3 times slower. The alpha was ridiculously superior.

But what do you think? Should we base the superiority of a 64 bit chip on a program written for a 32 bit program? You have already given your answer so it's a rhetorical question.

By the way, I DO agree with your reasoning that hardware speed measurements should take into consideration the hardware the program was designed on. But it's VERY difficult to separate that from real program improvements. Any change should be solely in the interest of making the program compatible with new hardware, NOT an actual improvement to the program. This is for hardware comparisons only of course. A reasonable Litmus test for this would be to ask if that change could reasonably be applied to the old hardware.

However, I don't believe that is a huge issue. I don't think it's more than 3/2 but I would be generous and allow 2 to 1 for this. The single biggest thing you could do is recompile the old source code on modern hardware. That will probably make up the biggest difference because it's only fair when comparing hardware with software benchmarks that the software is compiled to the target machine. I agree with you on this one.
I have done that, exactly. I have the 10.x version of Crafty, which is the exact version I was running on the P5, until I got my P6/200 in 1996. Unfortunately, my P5 was the 133mhz version. The nps for this exact version was 30K. This same version is somewhat faster (NPS) than the current version, using identical hardware. And using parallel search, it really is about 1500x faster than the old P5/90 (using 90 / 135 * 30K = 20K.)

So I have a very accurate measurement for Crafty, that shows at least a factor of 1000x on hardware from the P5/90 to my 8-core E5345 (which is not quite as fast as a single 6 core i7, but it is all I have at the moment to test on). So I have the hardware factor specifically for Crafty. Whether it is the same for others or not, I don't know. I've been optimizing code for 40+ years now and tend to do pretty well in this endeavour, meaning that each Crafty version was pretty well optimized for what I had at the time. Hardly perfectly optimized, but "pretty well optimized". I am a bit surprised that the old program runs so well today, since it doesn't have some of the cache optimizations I have done, but apparently whatever I did back then works like magic today, still.

Now I am comparing 1995 crafty (software) to today's crafty (software) running both on the same modern hardware on my cluster. Won't be long until I have a "software improvement" number for Crafty covering 1995-2010. Will have to add another 200-300 for Rybka, will figure out that exact number later.

All that is left is to measure the Elo gain for that huge hardware performance gain. I am not quite sure how to do that yet. I suppose I could use a fixed node search and when Crafty (old) calculates a target time, it then multiplies that by 22K nodes per second and stops after searching that many nodes. Going to kill new Crafty to do that but it seems necessary to run new and old on old hardware, and on new hardware, to close the loop on the comparisons...

That is next, once I get the 1995 version on new hardware Elo results later today...

Right now it is hovering at -370 Elo compared to 1995 Crafty. Significantly less than I thought. But in 1995 we had null-move and futility pruning, just no LMR or more significant forward pruning. My eval is really not all that different except for king safety however.
Bob,

Can you make the source code for everything available?

Here are some observation that I think would "fix" your test because I think it is off considerably but can be repaired:

Crafty IS a 64 bit program. No way to get around that. It ran well on 32 bit machines but I believe it's far from optimal on a 32 bit machine so when you compare nodes per second you are comparing something somewhat crippled to something running on the hardware it was designed for. I think 3/2 is a reasonable adjustment for this.

The point is, "IT DOESN'T MATTER" (emphasis intended.) This hardware speedup for crafty is what it is. You would not take two completely different cars (say a diesel truck and a honda S-2000) and compare tach speeds at 70mph. One would be 1500, other would be 4000+. So what? I'm not using that 1000x factor _anywhere_ in my testing. I simply wanted to know about how much faster I am today than in 1995. Answer: 1500x assuming P5/90 in 1995, 8xE5345 (which is a bit slower than 6 x fastest I7 core speed). However, what I am going to compute is the Elo loss in slowing the program from the 1-cpu speed of 4M nodes per second to the P5/90 speed of 22K. And I am then going to try to run the test on our 8-core cluster to see what the extra 7 cores does to old crafty's elo to bring it up to near 2010 NPS speeds. Right now I am using just one core on the tests, which is nowhere near current real hardware speeds. But it is a good _starting_ point since I can play 8x more games in a given amount of time using 1 core per game rather than 8.

8 processors is too much. It's true that you can get an 8 processor machine but almost nobody has one unless they have a lot of disposable income. If you want to be "reasonable" then you should compare only to a quad core. It was YOU that said nobody cares about non Intel workstations so by that same reasoning nobody cares about how well Crafty runs on a machine that almost nobody has.

It has become _very_ difficult to buy a non-multicore box. I just bought a new machine for my wife, under $700 bucks with LCD display, quad-cores. I have a bunch of students with dual quads, dual 6-cores, and 3 I know of with 4x6 AMD boxes. 8 on my E5345 box is slower than a 6-core i7 box. But I have a bunch of those 8-core boxes and may soon have a cluster with a bunch of dual 6-core i7's as well... However, at the moment, 8 is the closest I can get to a current i7.

So 4 processors is a "reasonable" comparison if you want to just consider machines that the average family can have without getting a loan from the bank. Otherwise I should also be able to consider more expensive workstations too such as the powerful Alpha and others. (There was also the HP which blew away the pentium and perhaps the SGI machines too.) Stanback ran on an HP in those days and won tournaments partly because this was superior to the Pentiums everyone else was using.

Not quite. John always ran on an experimental box that was grossly overclocked most of the time, not something you could walk in and buy.

Also, it is really unreasonable to notice that Crafty does 4x more nodes per second on 4 processors, and consider the full 4 to 1 a reasonable time handicap. You should either run on 4 actual cores in your test, or reduce the advantage of 4 cores (4 cores is NOT 1 core running 4x faster.)

Can't begin to follow that reasoning. I am not using any "time handicap" in my testing. The current test is single-cpu E5345 with old and new programs running at almost exactly the same speed. Then I am going to slow them both down to 22K (which is a real 1995 number) and play against the same gauntlet running full-speed. I'm about as certain as one can get that this is going to drop _WAY_ more than the current +370 Elo I am seeing when comparing 2010 crafty to 1995 crafty. But I am going to run the test to see.

We have solid numbers for 4 cores on the rating lists. Crafty is almost exactly 100 ELO stronger on 4 cores and that seems to be more than Rybka so it's generous. I think Crafty running 4x faster is much more than 100 ELO stronger. We can actually find out if you want to make the adjustment fairly. Just find out exactly what kind of odds Crafty needs in order to get 100 ELO improvement. In fact I can run that test on Komodo and I'll bet the numbers work out the same.

Another way to do this test is to run Crafty on 4 cores against Crafty running on 1 core but with 4x more CPU time.

I don't need to run that test. That's been a "decided issue" for years. Crafty on 4 cores is between 3.1 and 3.3 times faster than on 1 core. For 1-16 cores, a _very_ good estimation is:

speedup = 1 + (Ncpus -1) * 0.7

Which gives 3.1 for 4, where I ran a huge test a few years ago and someone here went thru the logs and computed 3.3. So pretty close.

If you make those adjustments then the test is definitely reasonable. Of course we have the Rybka factor too. We can compare Crafty in 1995 to the best program THEN if the rating lists are available and Crafty today to the best program and make that our adjustment for this issue.

If you don't do those things, the test in my opinion is not really testing something that is very interesting.

I would also like to know more about how you intend to run these tests. What time controls are you using? Do you intend to do this with heavy time odds games?

And finally I just want to be able to duplicate this test myself.

I am first using my usual fast time control to make sure everything is working. 10s + 0.1s. Once I am convinced all is well, I will likely run just 10.x and 23.x against the gauntlet at something longer. I'd sort of like to use 60+60, but need to see how long that will take since we are only using 1/2 the cluster. That would keep the 1995 searches from being so shallow as to be ugly. More once I get some preliminary numbers and see how things look.

bob · Post by **bob** » Sat Sep 11, 2010 11:21 pm

Don wrote:
mhull wrote:
Don wrote:
bob wrote:I do not quite see the point for all the tangents. It would seem to me we have a pretty good idea of what/how to test.
Says you.

Rebel is only 100 times faster on modern hardware. You add a zero to this and consider it a reasonable test.
I don't think it persuasive to introduce an orange in a comparison of apples. Crafty is apples to apples.
Rebel vs Rebel - apples to apples.

Crafty 1995 vs Crafty 2010 - apples vs oranges.

only problem is you apparently don't know squat about fruit.

Nobody is comparing crafty 1995 to crafty 2010 for hardware speed. I have speed numbers for 1995. Running same program on 2010 is most definitely as "apples to apples" as the rebel vs rebel. They are identical, in fact.

Also, Note the hardware is 32 bit vs 64 bit. 32 bit program run fine on 64 bit hardware (top 5 programs today still 32 bit) but 64 bit programs take a hit - so that is more apples vs oranges.

It was, however, a hit I _chose_ to take in 1995. Same hit Slate chose to take when running on 60 bit CDC hardware, in fact. No question you get more performance if you have 64 bit data density. Running a 32 bit program on 64 bits is a good idea? You lose 8 extra registers. Doesn't the amd64 extensions count for anything in hardware improvements? If they were short-sighted enough to continue using 32 bit approaches, with the alpha already out and available (along with others doing 64 bits) it seemed to me that the writing was on the wall and 64 bits was going to be the way to go. Someone makes a poor choice, that doesn't mean I have to be penalized for it... Just means they had little foresight.

It's ok to compare oranges to oranges if you've got the resources to do that. Use the available P90 and DOSEMU on modern hardware to compare Rebel and Genius for then and now. It will be good if you run with that as an independent test. The difference will be that crafty has a larger hardware increase potential. But that shouldn't matter as the functions for both test should map to a similar curve (in theory).

bob · Post by **bob** » Sat Sep 11, 2010 11:26 pm

rbarreira wrote:If I may interject something, I was a bit surprised by the implied claim that multi-CPU boxes were not available in 1995. I was pretty young at that time, but even then I already seem to recall advertisements for multi-CPU servers. So it might be fair to say that we should consider a multi-CPU Pentium Pro or something like that.

Pentium pro wasn't out in 1995. But Sequent had a 30 cpu 386 box in 1986. So they were available. But that box cost over 1/2 million bucks so it was not a "normal chess machine." I had argued from the beginning that parallel search programming improvements were not a post-1995 issue, parallel search was done by several of us in the late 70's. We won the 1983 WCCC with a 2-cpu Cray, and 1986 with an 8-cpu YMP. But for the PC world there was not a single SMP chess program until Crafty. I worked off and on on the SMP stuff from early 1995 until I finally released a version in early 1997 that had smp support.

harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances

Re: harware vs software advances