Crafty tests show that Software has advanced more.

bob · Post by **bob** » Sun Sep 12, 2010 11:03 pm

Don wrote:
bob wrote:
Don wrote:I didn't really expect that Bob's test would show this as I consider his test rather biased in favor of hardware. Nevertheless, it is still showing that software is a bigger contributor to computer chess advancement over the years than hardware is.

Here are some of his intermediate results:
Code: Select all
   Crafty-23.4        2703    4    4 30000   66%  2579   22% 
   Crafty-23.3        2693    4    4 30000   65%  2579   22% 
   Crafty-23.1        2622    4    4 30000   55%  2579   23% 
   Glaurung 2.2       2606    3    3 60277   46%  2636   22% 
   Toga2              2599    3    3 60275   45%  2636   23% 
   Fruit 2.1          2501    3    3 60248   32%  2636   21% 
   Glaurung 1.1 SMP   2444    3    3 60267   26%  2636   17% 
   Crafty-10.18       2326   19   19  1327   20%  2580   14% 
Here is the calculation to show that software is the bigger contributor:

It's well known that each hardware doubling is worth about 60 ELO of rating improvement. (For example Crafty running on a quad is almost exactly 100 ELO stronger than the single processor equivalent program.)

Bob's test shows that Crafty gained 377 ELO with small error margins. Bob agreed that we should add about 300 ELO to represent true Software advancement because Rybka 4 represents the state of the art in 2010 and it's over 300 ELO stronger than Crafty.

So this test estimates that we have gained 377 + 300 = 677 ELO over a 15 year period.

So the question is how much speed do we need in order to gain 677 ELO if a doubling is worth 60 ELO?

677 / 60 = 11.3 doublings. 11.3 doublings is a factor of 2521. We need a computer well over 2,500 times faster to get 677 ELO.

Bob estimated that hardware increased only 1500 times. Therefore more of the improvement has come from software than hardware using his estimates of hardware improvements.

I would like to mention that I believe Bob's numbers are flawed for several reasons I will briefly outline here and in fact the software is even MORE than Bob estimates.

The first reason is that his numbers do not reconcile with a test I did using Rebel. We compared rebel on old and new hardware. The 1 processor speedup for Rebel is about 100 to 1. Allowing for running on a Octal, you could multiply this by 8 to get 800 to 1. For chess, an octal does NOT give you a true 800 to 1 speedup but Bob is using the Nodes per second calculation anyway. This number still disagrees with Bob's number by about 2 to 1.

Another reason Bob's numbers are distorted is that he decided arbitrarily which machines should be compared. It's a question of defining something to remain a constant such as price, form factor, etc. For example we could say that anything you can purchase for less than 1000 bucks, or anything that is called a "workstation" and that you can easily move around. Of all the possible things to remain constant and with much hand waving he decided the constant should be that it must be Intel hardware. Of all the possible things to compare, this is the one that exaggerates the difference the most. In 1995 more powerful machines were available than the P90, so calling the P90 state of the art is a joke. But calling the i7 state of the art is not.
Fine. For state of the art I choose deep blue. 1,000,000,000 nodes per second. Pick _any_ 1995 platform you want and let's compare speedup. Or turn it around and I pick a Cray T932 which is more computer than any single chip PC today in any type of measurement. So we have 0 hardware improvement.

Or we use the machine that _everybody_ was using in 1995, which was Intel/windows, and we use the machine that _everybody_ is using today, which is the i3/i5/i7.

Personally, I have no problem determining which is the test to run. Everybody that has run on a T932, raise your hand. Looking around I see _one_ hand up. Everyone that has run on the big SP cluster with special-purpose chess processors, raise your hand. Again, I see _one_ hand up.

The point is that if you ask _anyone_ here what they were using in 1995, from the SSDF list, to ICC, to WMCCC/WCCC events, the most common answer, by probably 30-1 is going to be an Intel PC. That's the machine class almost everyone today is using as well. So the noise about the alphas and such is pure nonsense. Because if we include alphas we have to include every other rarely used box, of which there are many, and they were/are extremely fast. And extremely expensive...
I don't give a hoot about what everyone was running - I wasn't running on a P90 back then, I was running on an Alpha. The issue is about software advancement and hardware advancement. If you want to know how much hardware has advanced since a certain date you have to pick hardware that represent the state of the art. What's so difficult to understand about that?

If you want to compare hardware that doesn't represent the state of the art in 1995 to hardware that does in 2010, then go ahead. Just don't pretend that it's a correct comparison.

If it will make you happy then we can stop saying that we are talking about hardware advancement and say that this is about Intel advances.

In order to measure the hardware difference Bob chose to use 2 different versions of Crafty, both of which are optimized to run on 64 bit systems.

Jeez, Don, can't you at least read and get this right? My speed comparison was with crafty 10.x from 1995. I had numbers for the P5/133, I ran it on my hardware to get the speedup today. Not two different versions. The _exact_ same version. With about two dozen changes to add the xboard protocol changes to make it work on my cluster. Not a single change to the engine itself.
The main problem is the 64 bit vs 32 bit difference. It was just a few hours ago that I found out that you succeded in recompling the old version and I was still going by your earlier statements.

However that is a very MINOR issue compared to the 32 vs 64 bit issue.

Why do you insist on continuing to make such a stupid statement (two different versions.) It is _clearly_ false. I doubt a single person here (perhaps excepting yourself) has somehow misunderstood that specific detail, which has been explained enough for anyone to finally see the light.

To emphasize: version 10.x was run in 1995 on 1995 hardware. I had a few test positions that backed up my 30K recollection. I ran the same few positions using that same version, but used my E5345 (single cpu) machine. It ran at 4M nodes per second. I ran crafty 23.4 on the same positions, same processor. Almost exactly the same NPS. I posted the numbers yesterday.

Again: 10.x on P5/133 searches 30K. On a P5/90, 20K. On an E5345 single CPU, 4M. It will take a little work to get the smp search to work, because the pthread library changed from way back and it doesn't compile cleanly. In addition, the lock stuff (xchg lock) has to be modified to work with 64 bit stuff as it uses the wrong register names. Those versions scaled perfectly with NPS, as the current version does. So 30M+ is the expected number. I will verify this once I get the pthread stuff to working over the next couple of days.

Should I repeat it one more time? Not two different versions. _same_ version.
No need to repeat - but you need to explain how breaking your own rule is fair all of a sudden. The P133 hardware is 32 bit. The i7 is 64 bit. The chess program is 64 bit. YOU are the one that insists that to be fair the programs being compared should be optimized to run on the hardware they were designed for. The program you are using was NOT optimized for a P133, It was optimized for a future 64 bit machine.

What rule am I breaking? You want to run a program compiled for 1995 hardware (Rebel) and run that _same_ binary on 2010 hardware, and use that number? That _really_ shows what hardware has done? It completely negates the last 8+ years of hardware advances, from 64 bits (I think the first opteron was released in 2003) to multiple cores, to improved CPU design with 8 additional registers. You are completely removing all of that from the equation, and then claiming that _your_ number is representative. That's nonsense.

You don't like my using a program from 1995, that was designed to run on 1995 hardware and be fast, and now because it happens to be able to use the last 8 years of hardware improvements, that is a no-no. It is breaking some mythical rule that you attribute to me but I do not recall ever making such.

You also apparently would not use _any_ 64 bit program from today to do this analysis, you are stuck on using a very old, out-of-date, architecturally inefficient program (rebel) without even trying to get Ed to re-compile it to at least use part of the new hardware.

And you say _I_ am biasing things in my favor. My test is about as good as it gets. Pick a program from 1995 that is still being worked on today. Might be 2-3 of them perhaps. Can you get the source from their 1995 version to compile today for modern hardware? I can, for mine.

I've chosen the most accurate test I can think of. And in every post I have _always_ made the note that this is "Crafty's hardware speedup" or "Crafty's programming improvements" from 1995 to 2010. There is no single magic number that fits all. There is a very precise number that fits me.

I am currently running some 1/2 speed and then 1/4 speed runs to see what 2 doublings does to 10.18. I'm expecting something in the range of 70. Just for fun, here is the current results after 3600 games:
[/quote]
Crafty-10.18-1 2417 4 4 30000 22% 2655 14%
Crafty-10.18-2 2416 4 4 30000 22% 2655 14%
Crafty-10.18 2340 12 12 3613 15% 2655 11%

If you subtract, it looks like -75 Elo if I run 10.18 at 1/2 the normal speed, while leaving all the opponents running at normal speed. Given that, if it holds up, Crafty got +360 from software, 750+ (using 10 doublings, although it is actually more than 10 by some fraction, something like 10.25). So you might want to re-think what you say my data has proven with respect to which has given the greatest gain. This seems to support the common 2x faster = +70 Elo we have been measuring for years. More once these finish and I get the 1/4 speed results.

As coincidence would have it, Crafty runs on 64 bit hardware and looks especially good on 64 bit hardware. So he looks at some log files and eventually produces the number 1500 as the value for how much hardware has advanced over the last 15 years and claims he is being generous to do that. The log files show the speed of a 1995 Crafty running on 32 bit hardware. But even back then Crafty was designed to run on a 64 bit machine.
That is a false statement. Crafty was designed to use 64 bit values for the bitboards. It was _designed_ to run on 32 bit hardware, which was what we had in the PC world back then. You only have to look at the move generation stuff (COMPACT_ATTACKS, USE_SPLIT_SHIFTS, etc) that was explicitly designed to work efficiently on 32 bit boxes. Yes it gains some on 64 bit hardware. But in 1995 it was most certainly designed to run well on 32 bit hardware.
You designed it for the future, not the present in 1995. You were all over the forum with that 15 years ago.
Nope, sorry. I designed it in 1995 because I wanted to give bitboards a try. After having talked to Slate for years, I set out to see what they could do. I spent a ton of time optimizing things for 32 bit hardware. yes, I knew 64 bit hardware would one day become the norm. But I did _not_ try to write a program that would be crappy until that happened. Reasonable 64 bit stuff arrived with the AMD opteron in the 2003 time frame somewhere. You _really_ think I wrote something that was not effective, and used it for 8 years, just waiting? Wonder why Slate did 64 bit stuff on a 60 bit CPU and had similar performance issues? Stupidity???

Your program is like Stockfish, it was designed specifically for 64 bit architectures but you did what you could to make it run as well as possible on 32 bit machines. But a bitboard program will never run as well on a 32 bit machine as a mailbox program.
Absolutely false. And easily proven. Name your benchmark and let's go. Move generation speed? The most common requirement in a chess program is to generate captures only, for the q-search that represents 90% of the nodes until we reach simple endgames. Tell me how you efficiently generate just captures, compared to how I do it in bitboards. I generate all pawn moves in one gulp. Tell me how you do that in a mailbox more efficiently. I'm not going to go thru this ridiculous argument. Bitboards are no worse than mailbox on 32 bit hardware. That's pure urban legend. Crafty has been searching as fast as any program around, from 1995 to date. This kind of misinformation doesn't fly and certainly can't be substantiated.

For example which 64 bit programs could come close to Fritz and Nimzo in nodes per second 15 year ago on 32 bit hardware?
Crafty certainly did. And Crafty was C and not assembly like frans used for years.

Hell, I'd bet you dollars to donuts you had 64 bit stuff in your 1995 code. Hashing, anyone? I've always used 64 bit hashing. Back then I did it as two 32 bit chunks, but it would clearly fit 64 bit hardware better. So was _your_ stuff designed for 64 bits only? Didn't think so.

I had 32 bit "mailbox" style program and 64 bit programs. I've done it all, been there done that as they say.

But the 64 bit programs have always run like a dog on 32 bit hardware. You can do some things to improve that situation but you can never quite get the full speed of a true 32 bit program on a 32 bit platform.
Maybe _your_ 64 bit programs "ran like a dog" on 32 bit hardware. Mine did not. Bruce was about as good as it gets when it comes to speed, and we had discussions and measured things all the time. Bitboards were not particularly superior (until you factor in the generate only captures issue, or some easy-to-do-using-bitboard evaluation tricks that turn a mailbox loop into a single AND instruction (is this pawn passed?) and such. But they most certainly were _not_ inferior, otherwise Slate would have been just as handicapped, but we know he wasn't. Duchess was a bitboard program, on a 32 bit IBM mainframe. We weren't all idiots back in the 70's and 80's.

Moores law is a much perverted and misquoted and reformulated statement of how quickly transistor density changes over the years. I think Moore said that density doubles every 18 months and then way back in 1975 modified his own "law" to every 2 years. It has often been loosely translates that performance doubles every 18 months. This was actually a reformalation based on observation by an Intel colleague of Moore's. In fact, performance on average does NOT double every 18 months, it takes longer. (I have NEVER seen a doubling in performance when I upgrade even every 2 or 3 years although sometimes it's close.)

So Bob's estimate is not in harmony with this (admittedly crude) rule of thumb that nevertheless is widely accepted. Over 15 years even if you assume a full doubling every 18 months you would get 1024 improvement. I think almost everyone things 18 months is on the very generous side.
Keep saying that to yourself enough, and perhaps you will believe it. But I have no "estimate". I have an absolute measured value. Taking the P5/90 on one end, and a 6-core i7 on the other, the speed increase for Crafty is 1500x. I did not claim that was the speed gain for any other program. I don't care about any other program. I did not 'estimate' a thing, I simply took out my "ruler" and measured both as accurately as possible. What you are talking about is something that might have been typed by a roomfull of monkeys, because it is valid words, and somewhat valid grammatical constructions, but the meaning is missing.

So get off the "estimation" and "fabrication" and "exaggeration" bandwagon and offer something useful and logical. I've explained my numbers. Feel free to shoot either the 22K or the 30M numbers down. We can certainly get a 3rd party to verify the 10.x on current hardware. We've already had confirmation by someone running crafty and seeing 22K nps on I think a P5/100mhz. Which is right in line with 20K at 90mhz and 30K at 133mhz.

So shoot at what you think is wrong, but don't try to restate what I have done, I have been precise in what I have measured. And it is nothing at all related to what you are claiming I have measured.

It's not your numbers that are off, it's your methodology for the reasons I have stated. You picked a very specific thing to measure, and no doubt measured it accurately, but my contentions is that you just measured the wrong thing.
We are measuring hardware speed improvement from 1995 to present. We are measuring software improvement from 1995 to present. What should I measure? Lines of code? Number of conditional jumps? Number of months with "r" in them? Why would I measure anything except what we are talking about?

I am also not yet ready to grant Rybka another +300 Elo on software improvements.

I was expecting that you would try to back out of this sooner or later.
I would try to back out of _anything_ that has no scientific basis. We have no idea about Rybka's background. We know it came from Fruit. But we can't trace it back to 1995. It might be that Rybka would get more from the hardware than I did, had it been started in 1995, and perhaps it might have gotten more (or less) from software improvements. Who knows? Who can measure this. I can at least accurately state what I have gotten, once I get my tests run. And notice that I am running tests, not just posting contradictory argument after contradictory argument...

Look at the numbers on all the ratings list, Crafty is more than 300 ELO weaker than Rybka 4. I'm interested to hear about how you will also find that invalid.
Simple concept. Read carefully. Then re-read before responding. What if Crafty has some significant design flaw that I don't know about? And otherwise could be better (or worse) than Rybka. Do you _know_ where the +300 for Rybka comes from? Bugs in Crafty? Improvements in Rybka? Better use of hardware in Rybka? If you don't know, what does that 300 point gap mean, other than "Rybka is 300 points better, but we don't know whether it is better software, better use of hardware, advances in rybka, bugs in Crafty, etc."

I'm not prepared to just assume facts not in evidence.

It may well be that Crafty has a serious flaw somewhere.
That doesn't affect the +300 figure as you have never been within 300 of Rybka 3 or Rybka 4.
Jeez, put brain in gear before putting fingers in motion. A major bug in Crafty doesn't affect that +300 at all? A major bug could be the _majority_ of that +300 for all anybody knows. Only way to deal with Rybka would be to use 1995 Rybka and 2010 Rybka just as I am doing. Unfortunately, there is no 1995 Rybka. And you would be complaining bitterly anyway because Rybka is a bitboard program and that is grossly unfair to run it on 1995 hardware, according to you.

However it could affect the relative difference in the modern vs the old Crafty if the old one is broken.
Or if the new one is broken. How can anyone say which?

Nevertheless, you have been proved wrong anyway. Even if you are off by 100 ELO the point has already been made that software and hardware are roughly equal in their contributions to the success of computer chess to any reasonable degree of measurement.
I have not been proved wrong at all. At present, it appears that from 1995 to 2010, 2/3 of the Elo came from hardware, 1/3 from software. I've posted the numbers to support this. I will support more accurate numbers later after the tests complete.

By the way, your 1500x figure should be taken as a figure that is too high.
Here is what you said:

Taking the P5/90 on one end, and a 6-core i7 on the other, the speed increase for Crafty is 1500x.
This is a nodes per second increase in speed based on using a modern 6 core machine and comparing it to a single P90. Like I say, this is probably an ACCURATE figure for estimating the nodes per second increase but it's not an accurate figure for measuring how much ELO you should gain which is the relevant point.

Nevertheless, that actually doesn't change the number that much but it does some.

For Crafty, it appears that 2 doublings due to more cores (going from 1 to 4) is worth 50 ELO per doubling. Of course with additional cores it's worth even less.

Ideal would be to run rybka thru the same test I am doing. But that's not an option for me since there is no source available.
If you'd read everything before posting, you will find that I scaled the 1500 back to 1200. The 1500 number is 250x per core for 6 cores. I scaled that by the fairly accurate speedup = 1 + (ncpus -1) * .7. For 6 cores, that turns into 4.5x. 4.5x times 250x = what? About 1200x? Certainly over 1000x?

bob · Post by **bob** » Sun Sep 12, 2010 11:21 pm

Don wrote:
mhull wrote:
Don wrote:The main problem is the 64 bit vs 32 bit difference.
If you believe that, then why are you testing 32-bit Rebel on 64-bit modern hardware? That's rendering Rebel as cripple-ware, because its not optimized for 64-bit. So it's unfair by your definition.
A 32 bit program is not crippled on a 64 bit machine. Run a 32 bit program on a 32 bit machine and then time it on a 64 bit machine and you will see it runs just as well.

Does not. 32 bit has 4 general registers, eax-edx. 64 bit has 12 general registers, rax-rdx + r8-r15. That provides a significant advantage even for pure 32 bit programs. Dump some gcc output to see how many times it runs out of registers and has to spill 'em to memory or the stack. Then look at the 64 bit code. Significantly faster even without taking advantage of the 2x data density that is possible.

Then do the same experiment with a 64 bit program and your eyes will be opened.

There is this argument that 32 bit is not the right way to write a program that runs on a 64 bit machine. But I don't think anyone has actually proved that. It's difficult to prove because it's a whole different way of writing a program so you cannot just compare 2 programs.

The primary argument in favor of 64 bit is Rybka, the best program happens to be 64 bit. But I have no doubt whatsoever that had Rybka chose the 32 bit way it would still be strongest programs.

My personal belief? I think 64 bit is probably a slight advantage on 64 bit hardware, but 64 bit programs are mostly a fad inspired by the fact that Rybka is 64 bit. There is no proof either way. If something came out much stronger than Rybka and it was written as a 32 bit program, you would almost certainly see a bunch of new 32 bit programs

That's about the most uninformed statement I have read here in years. So Rybka started the 64 bit stuff?

You do realize that we have and _dozens_ of bitboard programs, all done _way_ before Rybka went to bitboards? All of the rotated bitboard stuff I did, Pradu's magic move generator stuff, Gerd's stuff, all done before Vas could spell bitboards. I think that a few dozen of us will be laughing at that statement for a couple of weeks, at least... All the way back to Slate, Donskoy (Kaissa) and everywhere in between...

Most of the stuff in computer chess is inspired by a combination of fad and what works. When it's not clear authors go with what "fruit" or some other program does.

No matter what program you choose, the same hardware boundary becomes an issue according to your argument.

I agree. The Rebel comparison is not fair and the Crafty comparison is not fair either.

Why can't you just measure speedup, regardless of hardware, in terms of 10x, 20x, 100x, etc. and ELO at those speedups for the same version? Isn't this what it all really boils down to?

Or do you just like to argue for its own sake?
We can measure speedup on any individual program and get an accurate number FOR THAT PROGRAM but it is no good for measuring how much the hardware has improved in general.

Apparently the light is _slowly_ coming on for you. We've been saying this all along. Hence my continual reference, explicitly, to "Crafty's hardware speedup" and "Crafty's software improvement" as measured using a 1995 version of "Crafty" and a 2010 version of "Crafty". And hence my statement that my numbers are not guesswork, but are highly accurate, "for Crafty".

As you already can clearly see the Rebel speedup comes out different than the Crafty speedup.

Architecture 101 question: Take two machines separated in time by 15 years. Run a program that was compiled for the 1995 hardware. Run it on the 2010 hardware without re-compiling, so that the old object code, without all the new instructions (CMOV to remove branches, cache and pre-fetch hints, extra registers, much faster instructions today that could not be used in 1995 because they were so blasted slow (bsf/bsr). What is the likely result? Will the new hardware look great or like a piece of crap. This is what you have done.

Alternative. take an efficient/fast program for 1995, recompile it for 2010 hardware. How will the 2010 hardware look? great or crappy?

extra credit: what if the 1995 program jumps thru a few hoops to use 64 bit values, and now that we have 64 bit processors, those extra "hoops" go away, should that be considered in comparing the two machines or not?

I know that any of my students could answer the above reasonably, provide rational explanations, and come to the natural conclusions that most of us have reached.

Unfortunately, we have been struggling with this because Rebel isn't representative because it was not recompiled for modern hardware and Crafty is not representative because it does not represent the best way to write a 32 bit program. Bob's rule here is that you should use the best representative program for the hardware.

I believe that in 1995, the bitboard approach was _still_ the best way to write a chess engine. I stated that _many_ times. You want to frame this as a "Bob said he was designing for the future." In reality, "Bob designed using what he considered the best approach available at the time, and also knew that once 64 bit stuff came along, there would be an additional performance gain. Cray Blitz was mailbox. I know how to do mailbox fast, because we did it fast. Bitboards were better, even on 32 bit hardware. Saying anything else is simply making an uninformed comment that is incorrect. Bitboards might not have been wildly superior, but they _were_ superior in 1995. Today they are even more superior, although mailbox programs still work just fine and can be amazingly strong. But better is better.

This point is so clear I don't see how you are not getting it. What we SHOULD be arguing is whether Crafty is representative or not. Bob is now trying to make a claim that it is. Even though I disagree on this, at least it's an appropriate question.

I am not trying to claim that Crafty is anything but "Crafty". I can answer the two questions posed (hardware vs software gain) quite accurately. Tough to find any other program that existed in 1995 and still exists today. Even tougher to find source for that 1995 version. I just happened to have one.

Update:

Code: Select all

   Crafty-10.18-2       2416    4    4 30000   22%  2655   14% 
   Crafty-10.18         2340    5    5 18109   15%  2655   11%

So far, a doubling (or halving since the current 10.18 is using 1/2 the time that the previous runs used) is costing me 76 Elo. If this holds up and the /4 and /8 runs maintain this, my 10 doublings in speed are going to be worth 760 Elo...

Hardware is twice as important as software for Crafty. If we throw in the suspect +300 number for Rybka, hardware is still +100 ahead, if this holds up. More later..

mhull · Post by **mhull** » Sun Sep 12, 2010 11:47 pm

Don wrote:
mhull wrote:
Don wrote:The main problem is the 64 bit vs 32 bit difference.
If you believe that, then why are you testing 32-bit Rebel on 64-bit modern hardware? That's rendering Rebel as cripple-ware, because its not optimized for 64-bit. So it's unfair by your definition.
A 32 bit program is not crippled on a 64 bit machine. Run a 32 bit program on a 32 bit machine and then time it on a 64 bit machine and you will see it runs just as well.

Then do the same experiment with a 64 bit program and your eyes will be opened.

Don wrote: There is this argument that 32 bit is not the right way to write a program that runs on a 64 bit machine. But I don't think anyone has actually proved that. It's difficult to prove because it's a whole different way of writing a program so you cannot just compare 2 programs.

Usually, sowing a little FUD is a resource for a not-very-strong argument. I'll sprinkle some agent orange on it by saying most people think crafty is one of the faster searchers. Sure, it's not proof, just like your doubts.

Don wrote:The primary argument in favor of 64 bit is Rybka,

It could hardly have been so in the 1990s when crafty was born.

Don wrote: the best program happens to be 64 bit. But I have no doubt whatsoever that had Rybka chose the 32 bit way it would still be strongest programs.

My personal belief? I think 64 bit is probably a slight advantage on 64 bit hardware, but 64 bit programs are mostly a fad inspired by the fact that Rybka is 64 bit. There is no proof either way. If something came out much stronger than Rybka and it was written as a 32 bit program, you would almost certainly see a bunch of new 32 bit programs.

Most of the stuff in computer chess is inspired by a combination of fad and what works. When it's not clear authors go with what "fruit" or some other program does.

Don wrote:

No matter what program you choose, the same hardware boundary becomes an issue according to your argument.

I agree. The Rebel comparison is not fair and the Crafty comparison is not fair either.

Why can't you just measure speedup, regardless of hardware, in terms of 10x, 20x, 100x, etc. and ELO at those speedups for the same version? Isn't this what it all really boils down to?

Or do you just like to argue for its own sake?
We can measure speedup on any individual program and get an accurate number FOR THAT PROGRAM but it is no good for measuring how much the hardware has improved in general.

As you already can clearly see the Rebel speedup comes out different than the Crafty speedup.

Unfortunately, we have been struggling with this because Rebel isn't representative because it was not recompiled for modern hardware and Crafty is not representative because it does not represent the best way to write a 32 bit program. Bob's rule here is that you should use the best representative program for the hardware.

This point is so clear I don't see how you are not getting it. What we SHOULD be arguing is whether Crafty is representative or not. Bob is now trying to make a claim that it is. Even though I disagree on this, at least it's an appropriate question.

Trust me, I grok your point loud and clear. I just don't agree that it's relevant. The goal (as I understand it) is to gauge what portion of a chess projects gain in strength can be assigned to faster speed and what portion can be assigned to smarter software.

IMO, your argument goes astray by focusing on hardware speed rather then chess engine speed (NPS, time-to-ply, etc.). It's that speed we should be using to make a correspondence to ELO. For that version of the software, there should be a curve that maps the function of engine speed to ELO. That curve might actually be complex, but the only speed that matters is the software speed. That one project sped up more over time than another because of sub-optimal design in the past compared to the present isn't important. All that matters is the function curve of speed to ELO.

What's important is the resulting speed (NPS, time-to-ply, whatever.) and the ELO that results from that "speed". Those are the data points that should be mapped for any particular version of a project. Then you can compare the curves of two versions of a project (like crafty) and decide how much of the stronger version's ELO belongs to software improvement.

Indexing those speed points to a particular year of technology isn't even important. All you want to know is how much ELO belongs to speed and how much to software improvements.

All this will be different for each chess project. So for Rebel, it doesn't matter that you can't speed it up as much as you can Crafty. All that matters is to plot the curve of the speed point you're able to measure, and see if it's curve on a graph. Do this for multiple versions of a project and compare. That's it.

I hope that clears up my position on the topic.

Uri Blass · Post by **Uri Blass** » Mon Sep 13, 2010 12:06 am

Some comments:
1)doubling and halving are not the same because of diminishing returns.

If you make Crafty 1024 times slower then I expect bigger elo difference relative to the case that you make it 1024 times faster.

suppose the time control that you use on your cluster that cannot be very slowe time control.

If you start with Crafty10.18 and make it 1024 times slower than I expect it lose more than 750 elo and if you make it 1024 time faster then I expect it to earn less than 750 elo(I will not be surprised if you get numbers like 850 elo and 650 elo).

2)I am afraid that you cannot get reliable numbers if you test against significantly stronger opponents.

Crafty10.18 scores 22% against the opponents when the version that is twice slower scores 15%

You can get some elo estimate for the difference but
I think that it is better to use weaker opponents to get a better elo estimate.

bob · Post by **bob** » Mon Sep 13, 2010 12:23 am

Uri Blass wrote:Some comments:
1)doubling and halving are not the same because of diminishing returns.

You do realize that if we start at 1995 speed, and double every time it is appropriate, we get to 2010 speed? and if we start at 2010 speed and halve, we eventually get back to 1995. _exactly_ back to 1995? I just don't want to try games at 1995 speed if I can avoid it, because the times shrink to sub millisecond units and I can't deal with that. I am running the current test to simply get an approximate idea of what happens. I am going to then have to play a much slower match, and then start halving the time for Crafty, to get perhaps a more accurate idea. In 1995 I would certainly have tested with 10s + 0.1s time controls, but I would not have tested with 1/1000th of that, which is what I would have to do to compare to 10+.1 today... I intend to change, but wanted to see if the doubling elo for old crafty was still +70 or so...

current results:

Code: Select all

   Crafty-10.18-1       2440    4    4 30000   22%  2678   14% 
   Crafty-10.18-2       2440    4    4 30000   22%  2678   14% 
   Crafty-10.18         2366   10   10  5060   16%  2679   11% 
   Crafty-10.18-3       2364    4    4 30000   16%  2678   11%

-1 and -2 are normal time games. -3 (and the one without -4) are 1/2 time games. So first step down to 1/2 of current speed cost me about 76 Elo. I really can't go very far down, because 10.18 at those slow speeds will become too weak to measure accurately with my current testing group...

If you make Crafty 1024 times slower then I expect bigger elo difference relative to the case that you make it 1024 times faster.

I can't make it 1024 times faster. We are talking about hardware 1000x slower, so there is little choice in which way to go...

suppose the time control that you use on your cluster that cannot be very slowe time control.

If you start with Crafty10.18 and make it 1024 times slower than I expect it lose more than 750 elo and if you make it 1024 time faster then I expect it to earn less than 750 elo(I will not be surprised if you get numbers like 850 elo and 650 elo).

Slower is all that makes sense. We are talking about 1995 hardware which is way slower than today. Not 2025 hardware that might be way faster.

2)I am afraid that you cannot get reliable numbers if you test against significantly stronger opponents.

There I agree. however, if I can get 2-3 numbers I can do a pretty good estimate, because I generally believe that each doubling gives a little less than the previous doubling. Going backward, I will get the actual doubling numbers that will probably be lower than what the ones I can't simulate would be, but at least it is a reasonable estimate. I suspect that the first doubling in 1995 probably would give +100 and remember that number being used.

Crafty10.18 scores 22% against the opponents when the version that is twice slower scores 15%

You can get some elo estimate for the difference but
I think that it is better to use weaker opponents to get a better elo estimate.

Tried. Not easy to find programs that make cleanly on the cluster, and then actually do a decent job with winboard protocol and don't hang or crash...

michiguel · Post by **michiguel** » Mon Sep 13, 2010 12:46 am

mhull wrote:
Don wrote:
mhull wrote:
Don wrote:The main problem is the 64 bit vs 32 bit difference.
If you believe that, then why are you testing 32-bit Rebel on 64-bit modern hardware? That's rendering Rebel as cripple-ware, because its not optimized for 64-bit. So it's unfair by your definition.
A 32 bit program is not crippled on a 64 bit machine. Run a 32 bit program on a 32 bit machine and then time it on a 64 bit machine and you will see it runs just as well.

Then do the same experiment with a 64 bit program and your eyes will be opened.

Don wrote: There is this argument that 32 bit is not the right way to write a program that runs on a 64 bit machine. But I don't think anyone has actually proved that. It's difficult to prove because it's a whole different way of writing a program so you cannot just compare 2 programs.
Usually, sowing a little FUD is a resource for a not-very-strong argument. I'll sprinkle some agent orange on it by saying most people think crafty is one of the faster searchers. Sure, it's not proof, just like your doubts.

Don wrote:The primary argument in favor of 64 bit is Rybka,
It could hardly have been so in the 1990s when crafty was born.

Because there was none in the 90's!

Bob was heavily criticized for this design choice, and day after day we had to hear the argument "How many bitboard programs are in the top 10?" blah, blah, blah. Few believe in bitboards, or at least, it was a common strategy to disregard the technique as not practical. Many times I suspected it was a commercial strategy to disregard what you were not doing, but that was my impression. I followed this argument very closely over the years because I started to program my move generator by 1997 in bitboards. I never did anything different and I was very interested to hear the comments (now bob extrapolates that I criticize his choice

).

Rybka was finally the break that FINALLY convinced people that the technique could lead a program to the top. If the #1 is a bitboard program, of course, the previous statement was proven.

Miguel

Don wrote: the best program happens to be 64 bit. But I have no doubt whatsoever that had Rybka chose the 32 bit way it would still be strongest programs.

My personal belief? I think 64 bit is probably a slight advantage on 64 bit hardware, but 64 bit programs are mostly a fad inspired by the fact that Rybka is 64 bit. There is no proof either way. If something came out much stronger than Rybka and it was written as a 32 bit program, you would almost certainly see a bunch of new 32 bit programs.

Most of the stuff in computer chess is inspired by a combination of fad and what works. When it's not clear authors go with what "fruit" or some other program does.

Don wrote:

No matter what program you choose, the same hardware boundary becomes an issue according to your argument.

I agree. The Rebel comparison is not fair and the Crafty comparison is not fair either.

Why can't you just measure speedup, regardless of hardware, in terms of 10x, 20x, 100x, etc. and ELO at those speedups for the same version? Isn't this what it all really boils down to?

Or do you just like to argue for its own sake?
We can measure speedup on any individual program and get an accurate number FOR THAT PROGRAM but it is no good for measuring how much the hardware has improved in general.

As you already can clearly see the Rebel speedup comes out different than the Crafty speedup.

Unfortunately, we have been struggling with this because Rebel isn't representative because it was not recompiled for modern hardware and Crafty is not representative because it does not represent the best way to write a 32 bit program. Bob's rule here is that you should use the best representative program for the hardware.

This point is so clear I don't see how you are not getting it. What we SHOULD be arguing is whether Crafty is representative or not. Bob is now trying to make a claim that it is. Even though I disagree on this, at least it's an appropriate question.
Trust me, I grok your point loud and clear. I just don't agree that it's relevant. The goal (as I understand it) is to gauge what portion of a chess projects gain in strength can be assigned to faster speed and what portion can be assigned to smarter software.

IMO, your argument goes astray by focusing on hardware speed rather then chess engine speed (NPS, time-to-ply, etc.). It's that speed we should be using to make a correspondence to ELO. For that version of the software, there should be a curve that maps the function of engine speed to ELO. That curve might actually be complex, but the only speed that matters is the software speed. That one project sped up more over time than another because of sub-optimal design in the past compared to the present isn't important. All that matters is the function curve of speed to ELO.

What's important is the resulting speed (NPS, time-to-ply, whatever.) and the ELO that results from that "speed". Those are the data points that should be mapped for any particular version of a project. Then you can compare the curves of two versions of a project (like crafty) and decide how much of the stronger version's ELO belongs to software improvement.

Indexing those speed points to a particular year of technology isn't even important. All you want to know is how much ELO belongs to speed and how much to software improvements.

All this will be different for each chess project. So for Rebel, it doesn't matter that you can't speed it up as much as you can Crafty. All that matters is to plot the curve of the speed point you're able to measure, and see if it's curve on a graph. Do this for multiple versions of a project and compare. That's it.

I hope that clears up my position on the topic.

rbarreira · Post by **rbarreira** » Mon Sep 13, 2010 1:46 am

You know... regardless of some details on how the tests are conducted, and regardless of the (if I understood correctly) issue that Crafty was nearer to the state of the art in 1995 than now, I think a lot of us have already learned something from these tests... that software did contribute many hundreds of elo points in improvements.

At least to me it was a bit surprising, since I've often seen claims that there weren't many improvements in the last 1-2 decades other than null move pruning and late move reductions (and only the latter was after 1995). I don't think hundreds of elo points can be attributed to the latter, so there must have been many other important improvements, or else a ton of small incremental improvements.

Is that a fair interpretation of the results so far?

bob · Post by **bob** » Mon Sep 13, 2010 2:05 am

michiguel wrote:
mhull wrote:
Don wrote:
mhull wrote:
Don wrote:The main problem is the 64 bit vs 32 bit difference.
If you believe that, then why are you testing 32-bit Rebel on 64-bit modern hardware? That's rendering Rebel as cripple-ware, because its not optimized for 64-bit. So it's unfair by your definition.
A 32 bit program is not crippled on a 64 bit machine. Run a 32 bit program on a 32 bit machine and then time it on a 64 bit machine and you will see it runs just as well.

Then do the same experiment with a 64 bit program and your eyes will be opened.

Don wrote: There is this argument that 32 bit is not the right way to write a program that runs on a 64 bit machine. But I don't think anyone has actually proved that. It's difficult to prove because it's a whole different way of writing a program so you cannot just compare 2 programs.
Usually, sowing a little FUD is a resource for a not-very-strong argument. I'll sprinkle some agent orange on it by saying most people think crafty is one of the faster searchers. Sure, it's not proof, just like your doubts.

Don wrote:The primary argument in favor of 64 bit is Rybka,
It could hardly have been so in the 1990s when crafty was born.

Because there was none in the 90's!

Bob was heavily criticized for this design choice, and day after day we had to hear the argument "How many bitboard programs are in the top 10?" blah, blah, blah. Few believe in bitboards, or at least, it was a common strategy to disregard the technique as not practical. Many times I suspected it was a commercial strategy to disregard what you were not doing, but that was my impression. I followed this argument very closely over the years because I started to program my move generator by 1997 in bitboards. I never did anything different and I was very interested to hear the comments (now bob extrapolates that I criticize his choice ).

Rybka was finally the break that FINALLY convinced people that the technique could lead a program to the top. If the #1 is a bitboard program, of course, the previous statement was proven.

Miguel

Sorry, but this is wrong. Chess 4.x was bitboard in 1974. As were the Russians with Kaissa. As was Tom Truscott with Duchess in 1977. Bitboards have been around since the beginning of computer chess. We had dozens of bitboard programs well before Rybka became yet another bitboard program. Rybka was not "the beginning". It was well after "the end" of the debate.

Don wrote: the best program happens to be 64 bit. But I have no doubt whatsoever that had Rybka chose the 32 bit way it would still be strongest programs.

My personal belief? I think 64 bit is probably a slight advantage on 64 bit hardware, but 64 bit programs are mostly a fad inspired by the fact that Rybka is 64 bit. There is no proof either way. If something came out much stronger than Rybka and it was written as a 32 bit program, you would almost certainly see a bunch of new 32 bit programs.

Most of the stuff in computer chess is inspired by a combination of fad and what works. When it's not clear authors go with what "fruit" or some other program does.

Don wrote:

No matter what program you choose, the same hardware boundary becomes an issue according to your argument.

I agree. The Rebel comparison is not fair and the Crafty comparison is not fair either.

Why can't you just measure speedup, regardless of hardware, in terms of 10x, 20x, 100x, etc. and ELO at those speedups for the same version? Isn't this what it all really boils down to?

Or do you just like to argue for its own sake?
We can measure speedup on any individual program and get an accurate number FOR THAT PROGRAM but it is no good for measuring how much the hardware has improved in general.

As you already can clearly see the Rebel speedup comes out different than the Crafty speedup.

Unfortunately, we have been struggling with this because Rebel isn't representative because it was not recompiled for modern hardware and Crafty is not representative because it does not represent the best way to write a 32 bit program. Bob's rule here is that you should use the best representative program for the hardware.

This point is so clear I don't see how you are not getting it. What we SHOULD be arguing is whether Crafty is representative or not. Bob is now trying to make a claim that it is. Even though I disagree on this, at least it's an appropriate question.
Trust me, I grok your point loud and clear. I just don't agree that it's relevant. The goal (as I understand it) is to gauge what portion of a chess projects gain in strength can be assigned to faster speed and what portion can be assigned to smarter software.

IMO, your argument goes astray by focusing on hardware speed rather then chess engine speed (NPS, time-to-ply, etc.). It's that speed we should be using to make a correspondence to ELO. For that version of the software, there should be a curve that maps the function of engine speed to ELO. That curve might actually be complex, but the only speed that matters is the software speed. That one project sped up more over time than another because of sub-optimal design in the past compared to the present isn't important. All that matters is the function curve of speed to ELO.

What's important is the resulting speed (NPS, time-to-ply, whatever.) and the ELO that results from that "speed". Those are the data points that should be mapped for any particular version of a project. Then you can compare the curves of two versions of a project (like crafty) and decide how much of the stronger version's ELO belongs to software improvement.

Indexing those speed points to a particular year of technology isn't even important. All you want to know is how much ELO belongs to speed and how much to software improvements.

All this will be different for each chess project. So for Rebel, it doesn't matter that you can't speed it up as much as you can Crafty. All that matters is to plot the curve of the speed point you're able to measure, and see if it's curve on a graph. Do this for multiple versions of a project and compare. That's it.

I hope that clears up my position on the topic.

bob · Post by **bob** » Mon Sep 13, 2010 2:13 am

rbarreira wrote:You know... regardless of some details on how the tests are conducted, and regardless of the (if I understood correctly) issue that Crafty was nearer to the state of the art in 1995 than now, I think a lot of us have already learned something from these tests... that software did contribute many hundreds of elo points in improvements.

At least to me it was a bit surprising, since I've often seen claims that there weren't many improvements in the last 1-2 decades other than null move pruning and late move reductions (and only the latter was after 1995). I don't think hundreds of elo points can be attributed to the latter, so there must have been many other important improvements, or else a ton of small incremental improvements.

Is that a fair interpretation of the results so far?

Absolutely, if you keep your eyes open. For example, SMP search is a significant source of Elo. But it was a pre-1995 advance. Ditto for null-move. We were using null-move R=2 in 1995 and I had started experimenting with R=2 near the tips, R=3 near the root (adaptive NM). We had futility pruning although I believe this specific version was not using it. If you look at the comments in main.c for current versions, futility was in and out starting in version 6.0 or so... Even eval ideas like mobility, passed pawn races (square of the king) and such were being used in the 70's...

So the biggies seem to be the reduction stuff (LMR). More aggressive forward-pruning in the last few plies. Maybe an eval trick here and there. I'll try to compile a more complete list, but when you look at 1995 to present, the changes are hardly "huge"...

Hardware, on the other hand, was. Or at least _I_ consider a 1000x speedup "significant".

michiguel · Post by **michiguel** » Mon Sep 13, 2010 3:48 am

bob wrote:
michiguel wrote:
mhull wrote:
Don wrote:
mhull wrote:
Don wrote:The main problem is the 64 bit vs 32 bit difference.
If you believe that, then why are you testing 32-bit Rebel on 64-bit modern hardware? That's rendering Rebel as cripple-ware, because its not optimized for 64-bit. So it's unfair by your definition.
A 32 bit program is not crippled on a 64 bit machine. Run a 32 bit program on a 32 bit machine and then time it on a 64 bit machine and you will see it runs just as well.

Then do the same experiment with a 64 bit program and your eyes will be opened.

Don wrote: There is this argument that 32 bit is not the right way to write a program that runs on a 64 bit machine. But I don't think anyone has actually proved that. It's difficult to prove because it's a whole different way of writing a program so you cannot just compare 2 programs.
Usually, sowing a little FUD is a resource for a not-very-strong argument. I'll sprinkle some agent orange on it by saying most people think crafty is one of the faster searchers. Sure, it's not proof, just like your doubts.

Don wrote:The primary argument in favor of 64 bit is Rybka,
It could hardly have been so in the 1990s when crafty was born.

Because there was none in the 90's!

Bob was heavily criticized for this design choice, and day after day we had to hear the argument "How many bitboard programs are in the top 10?" blah, blah, blah. Few believe in bitboards, or at least, it was a common strategy to disregard the technique as not practical. Many times I suspected it was a commercial strategy to disregard what you were not doing, but that was my impression. I followed this argument very closely over the years because I started to program my move generator by 1997 in bitboards. I never did anything different and I was very interested to hear the comments (now bob extrapolates that I criticize his choice ).

Rybka was finally the break that FINALLY convinced people that the technique could lead a program to the top. If the #1 is a bitboard program, of course, the previous statement was proven.

Miguel

Sorry, but this is wrong. Chess 4.x was bitboard in 1974. As were the Russians with Kaissa. As was Tom Truscott with Duchess in 1977. Bitboards have been around since the beginning of computer chess. We had dozens of bitboard programs well before Rybka became yet another bitboard program. Rybka was not "the beginning". It was well after "the end" of the debate.

Very well known facts, but irrelevant to what I wrote.

Rybka was not after the end of the debate, Rybka was the end of the debate.

Miguel

Don wrote: the best program happens to be 64 bit. But I have no doubt whatsoever that had Rybka chose the 32 bit way it would still be strongest programs.

My personal belief? I think 64 bit is probably a slight advantage on 64 bit hardware, but 64 bit programs are mostly a fad inspired by the fact that Rybka is 64 bit. There is no proof either way. If something came out much stronger than Rybka and it was written as a 32 bit program, you would almost certainly see a bunch of new 32 bit programs.

Most of the stuff in computer chess is inspired by a combination of fad and what works. When it's not clear authors go with what "fruit" or some other program does.

Don wrote:

No matter what program you choose, the same hardware boundary becomes an issue according to your argument.

I agree. The Rebel comparison is not fair and the Crafty comparison is not fair either.

Why can't you just measure speedup, regardless of hardware, in terms of 10x, 20x, 100x, etc. and ELO at those speedups for the same version? Isn't this what it all really boils down to?

Or do you just like to argue for its own sake?
We can measure speedup on any individual program and get an accurate number FOR THAT PROGRAM but it is no good for measuring how much the hardware has improved in general.

As you already can clearly see the Rebel speedup comes out different than the Crafty speedup.

Unfortunately, we have been struggling with this because Rebel isn't representative because it was not recompiled for modern hardware and Crafty is not representative because it does not represent the best way to write a 32 bit program. Bob's rule here is that you should use the best representative program for the hardware.

This point is so clear I don't see how you are not getting it. What we SHOULD be arguing is whether Crafty is representative or not. Bob is now trying to make a claim that it is. Even though I disagree on this, at least it's an appropriate question.
Trust me, I grok your point loud and clear. I just don't agree that it's relevant. The goal (as I understand it) is to gauge what portion of a chess projects gain in strength can be assigned to faster speed and what portion can be assigned to smarter software.

IMO, your argument goes astray by focusing on hardware speed rather then chess engine speed (NPS, time-to-ply, etc.). It's that speed we should be using to make a correspondence to ELO. For that version of the software, there should be a curve that maps the function of engine speed to ELO. That curve might actually be complex, but the only speed that matters is the software speed. That one project sped up more over time than another because of sub-optimal design in the past compared to the present isn't important. All that matters is the function curve of speed to ELO.

What's important is the resulting speed (NPS, time-to-ply, whatever.) and the ELO that results from that "speed". Those are the data points that should be mapped for any particular version of a project. Then you can compare the curves of two versions of a project (like crafty) and decide how much of the stronger version's ELO belongs to software improvement.

Indexing those speed points to a particular year of technology isn't even important. All you want to know is how much ELO belongs to speed and how much to software improvements.

All this will be different for each chess project. So for Rebel, it doesn't matter that you can't speed it up as much as you can Crafty. All that matters is to plot the curve of the speed point you're able to measure, and see if it's curve on a graph. Do this for multiple versions of a project and compare. That's it.

I hope that clears up my position on the topic.

Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.

Re: Crafty tests show that Software has advanced more.