Deep Blue vs Rybka

bob · Post by **bob** » Tue Sep 14, 2010 9:14 pm

Don wrote:
mhull wrote:
Don wrote:
bob wrote:
Don wrote:Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.
That is not "two doublings". This is, once again, apples and oranges. SMP overhead comes in and this changes things.
I didn't say you were doing anything wrong here, I'm only making the point that we must use great care. And since I don't have time to carefully parse all the posts flooding in right now and answer every point, I just wanted to remind everyone involved of this.

For example, it would wrong to test a 1 cpu program against a 4 cpu program and say, "even after 2 doublings we only get 100 ELO", but then later say a doubling in hardware is worth a full 60 or 70 ELO without distinguishing was KIND of doubling it was.

At the very beginning I stressed that we should not even be considering MP programs in all of this - it can be worked out later. Keep it simple stupid, the KISS principle. Otherwise it gets terribly confusing. What you should have done is estimate the 1 cpu hardware improvements over 15 years, then the 1 cpu software improvement over 15 years, and left it at that. But I feel that you changed the point of reference in each case to suit whatever point you happened to be trying to make at the time. Whether you did or did not, you made it really confusing.

So let's please keep this real simple and leave out MP completely. Do the 1 core calculation only for old and new programs. THEN we can see how much we get for 6 more cores in 2010.

I cannot help but feel that your argument is really weak when you feel the need now to talk about the inferiority of various SMP machines and now are pushing to invalidate the results of the ratings agencies due to this.
So, in 1997, we (in principle) should have been comparing Deep Blue on one node against the competition?
Look at what I wrote. What I advocate is not ignoring MP, but to break the problem into more easily resolvable issues first. It's a lot easier to ignore MP and then factor it in later. We get a firm number we all agree with (yeah, right) and then we can attribute 50 ELO per SMP doubling to hardware (after arguing about it for a couple of days first.)

<cynical talk>The only problem with doing that is that it is too simple, and it makes it more difficult to construct biased experiments.</cynical talk>

Then simply don't construct them. If you look at my results, as posted, there is no SMP used at all. I did add a factor of 4.5x, which is a _real_ number and not a guess, to extrapolate from P5/90 to the best single-chip box today (and there are much better than single-chip boxes available, both of our clusters has dual-chip nodes, which are not exactly ritzy in price for a single node.

So please precisely show me where my experiment was biased in any way. And let's stop jumping around from thread to thread. The best place to discuss this specific topic is in the thread I started where I posted the software improvement, and the doubling results used to extrapolate hardware improvement.

Waiting for the bias finger to precisely pinpoint a flaw...

bob · Post by **bob** » Tue Sep 14, 2010 9:17 pm

Don wrote:Bob,

I tried the link you gave me and it's not working. Is it working for anyone else or is there something wrong with my connection?

I just did this:

ftp ftp.cis.uab.edu
login: anonymous (or you can use "ftp" without quotes)
password: hyatt@cis.uab.edu (enter your email)
cd pub/hyatt/source
ls crafty*x.tar
227 Entering Passive Mode (138,26,66,6,74,169)
150 Here comes the directory listing.
-rw-r--r-- 1 500 30 7260160 Sep 14 10:52 Crafty-10.x.tar

So it works for me...

bob · Post by **bob** » Tue Sep 14, 2010 9:21 pm

Milos wrote:
bob wrote:I can absolutely guarantee you that for what I have done so far, which stops at 64 CPUs, my linear formula is very accurate. For 1-8 cpus it is actually pessimistic. You can find a discussion about 1-8 cpu speedups a few years ago when I was running on an 8-cpu opteron box. I ran the test positions Vincent asked for, and posted the results on my ftp box. Several looked at them, Martin F. took the time to go thru several hundred positions to compute the speedup and found that my estimation formula was a bit off. For example, real was 3.3, predicted was 3.1.. for 8, real was 6.3 (I believe) and predicted was 5.9. Since, I have had the chance to run on 12, 24, 32 and 64 cpu boxes, and found the formula to fit. When I tested on 32 and 64, it took some significant tweaking to get reasonable numbers. That was 2+ years ago however.
Sure you ran test positions.
Then why don't you just run test positions to optimize the strength of your engine, but you run test matches instead???
I say test position are nonsense and you can't measure program strength with them.
You have never showed test matches data.
When you are so convinced why don't you run the test I proposed up there, it is very easy to run?

If you knew anything about parallel programming you would know the answer. The concept of "speedup" applies to a position, not to a game. How do you compute the "speedup" for a game? I know how to compute the speedup for a position. And, just to keep this on a sane level, this is the way _everybody_ has been reporting speedup for 40 years now.

I do not run test positions for the results I show here, however. I run 3,000 starting positions, taken 12 moves into GM games, all duplicates removed, and play the complete _game_ out from there. So what are you talking about? Other than clearly something you don't have a clue about.

All the Elo results I show here are from nothing bug games. All of my speedup data is from nothing but single positions, since the concept of speedup for a game is meaningless.

Jeez it is hard to discuss something when you are so far out in left field you are completely out of the stadium, and even beyond the parking lot.

bob · Post by **bob** » Tue Sep 14, 2010 9:23 pm

mhull wrote:
bob wrote:...the only solution was to find weaker opponents, which is a gigantic waste of time for anything except to answer this question.
If someone had the time to waste, perhaps Gnuchess 5.17, Gnuchess 4.x (completely different program), Phalanx XXII. Dann Corbit might have more suggestions.

The ultimate problem is getting them to compile cleanly on modern linux. #include files have changed. Took a ton of work just to get my old 10.x to compile and run. If there are any known to run, I could grab one or two, but again, this is effort with no expected return, since this is not going to help any program today get any better.

bob · Post by **bob** » Tue Sep 14, 2010 9:24 pm

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
Milos wrote:
rbarreira wrote:Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.
Sure 70 elo is much more realistic (add twice the error margin and you'll still be far of 70 elo). Give me a break. Bob is a big authority in the field but he is enormously biased when something of his own is in question!
I simply reported the doubling (really halving) Elo change as part of the hardware vs software debate. Anyone can reproduce that test if they have the time and the interest. Put Crafty in a pool of players, and play heads-up for 30,000 games. Then re-run bug only give Crafty 1/2 the time (others get original time). Elo dropped by 70. Do it again so that we are now at 1/4 the time or 4x slower. Elo dropped by 150 total, or 80 for this second "halving".

Just for the record, because I see that mentioning this will become a truth if it is not contested. This second "halving" was obtained with an engine that scored ~10% of the points in that particular pool. There are three problems. First, the error in ELO increases a lot because the difference between 10 and 11% is much bigger than 50-51%. Second, the error bar for 10% is bigger than 50%. Third, the ELO approximation at the tails of the curve may not be accurate anymore.

The difference between 70 and 80 may be just error.
I believe I mentioned that. And that the only solution was to find weaker opponents, which is a gigantic waste of time for anything except to answer this question.

Yes, but I am making sure this caveat is noticed and the reasons why. There were no comments about it before.

Uri made similar comments a day or two back, in fact...

Miguel

Miguel

You have two reasonable alternatives:

(1) run the test and post your results. If they are different from mine, then we can try to figure out why.

(2) be quiet. guessing, thinking, supposing and such have no place in a discussion about _real_ data. And "real" data is all I have ever provided here. As in my previous data about SF 1.8 vs Crafty 23.4...

bob · Post by **bob** » Tue Sep 14, 2010 9:28 pm

Ralph Stoesser wrote:
Milos wrote:
mhull wrote:Bob's tests reduce the error margin by playing more games with fewer unknown variables.

I'm not saying you're wrong by definition, I'm just saying yours is an opinion based on weaker data.
Bob's tens of thousands of games has nothing to do with accuracy. Simply his testing methodology is faulty. He could play million of games and his results would be still inaccurate.
I understand some ppl get easily impressed by this, but when different test methodologies, opening books etc., all show accordance to 10-15 elo accuracy (with 15-20 elo error margin) and Bob's result are off by almost 100 elo with his 4 elo margin everything you say is just holding for a straw.
Bob has a systematic error in his testing methodology which he (and some other ppl) are not willing to admit.
I hope you remember The Emperor's New Clothes tale...
Show us how Crafty performs against Fruit and Glaurung family engines in "official" rating lists. If you are right, you should find numbers similar to Mr. Hyatt's cluster test results.

Nah, because apparently all of my data is made up and will have no correlation to the blessed-by-the-Pope lists anywhere. Just one of those ultimate truths that can't be escaped...

I take great pride in sitting in my office and generating random test result numbers day after day, all the while Crafty is getting no better or is getting weaker as time marches on...

rbarreira · Post by **rbarreira** » Tue Sep 14, 2010 9:58 pm

bob wrote:But one could argue that since Crafty's NPS would actually be 6x or 12x faster on the above hardware, that is a software shortcoming, and the hardware should get full credit. If we can't use it effectively, is that the engineer's fault?

But for my summary of results, I certainly used the effective speedup number, which was about 1024, as opposed to the theoretical max, which was 1500x.

The hardware shouldn't get full credit, if the question being asked is "how much does hardware contribute to chess strength". If the algorithms being used can't take full advantage of the hardware, that means the hardware is contributing less. What matters are the facts of what the hardware is contributing in reality, not some pipe dream of what could be.

If alpha-beta was impossible to parallelize (fortunately it isn't), parallel hardware wouldn't contribute to chess strength, so it wouldn't get credit. That's clear as day to me...

But as you said, your results use this way of calculating, so all is well.

Milos · Post by **Milos** » Tue Sep 14, 2010 10:37 pm

bob wrote:If you knew anything about parallel programming you would know the answer. The concept of "speedup" applies to a position, not to a game. How do you compute the "speedup" for a game? I know how to compute the speedup for a position. And, just to keep this on a sane level, this is the way _everybody_ has been reporting speedup for 40 years now.

I do not run test positions for the results I show here, however. I run 3,000 starting positions, taken 12 moves into GM games, all duplicates removed, and play the complete _game_ out from there. So what are you talking about? Other than clearly something you don't have a clue about.

All the Elo results I show here are from nothing bug games. All of my speedup data is from nothing but single positions, since the concept of speedup for a game is meaningless.

Jeez it is hard to discuss something when you are so far out in left field you are completely out of the stadium, and even beyond the parking lot.

No the problem is you don't (want to) understand the word of what I'm saying.
So let me summarize for you so that you can understand it this time.
You claim the following. If you take Crafty and run it on 32 nodes machine and than run it on 64 nodes machine you will get 1.7x speedup in the sense that an average time to reach fixed depth over several (hundreds) of position is 1.7 time shorter in case of 64 nodes machine compared to 32 nodes machine.
You further claim that this 1.7x speedup is equivalent to log(1.7)/log(2)*70=54 elo difference in strength (I even took here more conservative 70 instead of 80 elo for speedup doubling).

What I claim is that if you run Crafty on 32 nodes against your standard set of opponents in 30k games match at any reasonable TC and then repeat the same for Crafty on 64 nodes, the difference in elo you obtain will never reach 54 elo. As a matter of fact it will be smaller by a large margin.

This kind of test is very simple to run but you refuse to do it. My guess is that you already know the result but are too stubborn to admit it.

Don · Post by **Don** » Tue Sep 14, 2010 10:46 pm

rbarreira wrote:
bob wrote:But one could argue that since Crafty's NPS would actually be 6x or 12x faster on the above hardware, that is a software shortcoming, and the hardware should get full credit. If we can't use it effectively, is that the engineer's fault?

But for my summary of results, I certainly used the effective speedup number, which was about 1024, as opposed to the theoretical max, which was 1500x.
The hardware shouldn't get full credit, if the question being asked is "how much does hardware contribute to chess strength". If the algorithms being used can't take full advantage of the hardware, that means the hardware is contributing less. What matters are the facts of what the hardware is contributing in reality, not some pipe dream of what could be.

If alpha-beta was impossible to parallelize (fortunately it isn't), parallel hardware wouldn't contribute to chess strength, so it wouldn't get credit. That's clear as day to me...

But as you said, your results use this way of calculating, so all is well.

I'll say "fools errand" again. You hit the nail on the head here. There is this implicit assumption that it's not fair to bring an old program into the modern world without "reworking" it to be used on modern hardware.

But it's really quite impossible to separate hardware from software. From Bob's point of view everything is hardware and it's possible to take the point of view that everything is hardware and in some sense be absolutely correct. After all, EVERYTHING you do to improve a chess programs makes it work better on whatever hardware you are using.

It's quite impossible to make a clean separation. I am now of the opinion that we should just not try. We should just see how well an old program runs on new hardware. If the old program is not MP, then too bad.

We won't be answering the exact question we set out to answer, but it's a fools errand to try because in Bob's eyes the hardware is always going to be more important and be the thing that counts, even if the software was needed.

If we go back far enough in time, Bob would say that hash tables is not a software improvement. Hash tables require memory, and memory is hardware. I know that if it were 20 years ago this is probably what we would be arguing about. The problem is that once something becomes ubiquitous, such as hash tables, or the PC platform, it becomes the defining thing but doesn't address what MY original statement was all about that started this. In my view Bob just defined for himself what I meant then set about to disprove it.

Milos · Post by **Milos** » Tue Sep 14, 2010 11:09 pm

bob wrote:I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still.

One more fishy thing related to your formula.

Speedup from 1 to 2 cores is 1.7x. Also speedup from 2 to 4 cores should be the same (1.7x). And the same for speedup from 4 to 8 cores.
So the speedup from 1 to 8 cores is 1.7^3=4.9x.
Your formula gives 1+7*0.7=5.9x. And somehow you get 5.9x (or even 6.3) in your results.

So something is terribly wrong.

Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka