hardware advances - a different perspective

bob · Post by **bob** » Fri Sep 10, 2010 1:42 am

In light of the other discussion, let's think of 3 ways to compare what hardware has done over the past 15 years.

(1) take a 1995 program, use it's speed on 1995 hardware, and then run it on today's hardware and measure the performance improvement. This is flawed, because the old program was optimized to the old hardware. There are tons of new instructions, new mechanisms (register renaming, OOE, multi-level cache, etc) that the old program will likely use inefficiently since it didn't exist back then. This "compresses" the speed difference significantly.

(2) take a 2010 program and run it on 1995 hardware. Same problem. We do things today (magic move generation is one example) that depend on today's hardware. Running this on 1995 hardware is a performance issue since 64 bit multiplies have to be done in pieces. Ditto for our hash probing that is aware of 64 bit block/line sizes, that did not exist back in 1995. This will also "compress" the speed difference artificially.

(3) take a 1995 program and run it on the 1995 hardware it was optimized for. Then take a successor of that program from today and run it on the best hardware of today. Whether this be Crafty, Fritz, or whatever is not that important. Main thing is to take a program that was optimized to 1995 hardware in 1995, and compare it to the same program optimized for 2010 hardware today. This gives a real comparison.

Of course, there is a question of "what do you compare"? Depths are not equal thanks to today's reductions and pruning, so comparing time to depth is no good. Perhaps time-to-solution for a reasonable set of positions. But this is going to take some effort since something that takes a few minutes in 1995 might turn into a second today. One could compare NPS, which for the same program is pretty constant. But I think the idea of a few tactical positions that have a very concrete solution offers the best comparison. It factors in SMP, without over-counting it since SMP loss (extra nodes searched) will figure in to the time-to-solution.

I think this latter idea is the way to compare, however I am not yet having a lot of luck with getting a 32 bit 1995 version of Crafty to run on my 64 bit cluster. The old tricky rotated bitboard stuff with compact-attacks and such do not like 64 bit registers at all. Still struggling with this. Which means option 3 is probably the best one. Only problem would be to find some good positions (say the Nolot positions) that were run against Crafty (or whatever program is used) back in 1995, since finding 1995 hardware is not exactly an easy task today...

And as I write this I am still not sure exactly how to measure performance. Do we believe our searches are more efficient in terms of finding something in fewer nodes, today? So I suppose we could factor in time to solution and nodes required, to try to factor out search improvements.

What a confusing issue.

Edmund · Post by **Edmund** » Fri Sep 10, 2010 8:57 am

How about that, let a person with good understanding of up to date search-Technics write 2 programs. One optimized for 1995 hardware, another one optimized for todays hardware. Then you could do a direct comparison of the effect of hardware improvement on computerchess. The quality of the programs itself is actually not that important. Rather one has to aim at getting the most out of the given hardware. In other words, rather than spending a lot of time on eval tuning, one would rather tune the perfect hash-table size, the amount of caching done (eval, movegen, etc.), etc.

Uri Blass · Post by **Uri Blass** » Fri Sep 10, 2010 12:05 pm

I suggest the following to estimate hardware improvement.

Take a commercial program from 1998 that should work well both for 1995 hardware and 1998(because the programmer should think about both types of customers) and find the speed improvement that it get from 1998 hardware relative to 1995 based on nodes per second.

Do the same with 2001 and 1998 and a commercial program from 2001.
Do the same with 2004 and 2001
Do the same with 2007 and 2004
Do the same with 2010 and 2007

You get 5 numbers.

If you multiply these numbers you get an estimate for the speed improvement of hardware from 1995 to 2010.

rbarreira · Post by **rbarreira** » Fri Sep 10, 2010 5:47 pm

I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

What's the efficiency of current programs with 6 threads? (meaning, the actual speedup for a suite of test positions divided by 6)

bob · Post by **bob** » Fri Sep 10, 2010 5:57 pm

rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

What's the efficiency of current programs with 6 threads?

Define "efficiency"? If you mean what is the NPS with 6 compared to the NPS with 1, for Crafty it is about 6x faster. However, using anything else for comparison is even harder. Clearly there are some hardware advances that most are not taking advantage of, say the MMX/SSE/etc stuff that is barely used at all. Yet it is there.

I still think about the best we can hope for is to take an old SMP-capable program, run it on 1995 hardware and see how strong it is. Then run the same program on today's hardware and see how strong it is. Then take one of today's programs and run it on today's hardware to see how strong it is. The difference between old-prog+new-hardware and new-prog+new-hardware would have to be software. The difference between old-prog+old-hardware and new-prog+new-hardware would clearly be hardware. Clearly the old program on new hardware is not going to take advantage of all the hardware improvements, since it was written before they came along. But it would get us into the right ballpark, even if it is not anywhere near perfect.

I am still trying to get something old to work, but it is a challenge.

rbarreira · Post by **rbarreira** » Fri Sep 10, 2010 7:04 pm

bob wrote:
rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

What's the efficiency of current programs with 6 threads?
Define "efficiency"? If you mean what is the NPS with 6 compared to the NPS with 1, for Crafty it is about 6x faster. However, using anything else for comparison is even harder. Clearly there are some hardware advances that most are not taking advantage of, say the MMX/SSE/etc stuff that is barely used at all. Yet it is there.

I still think about the best we can hope for is to take an old SMP-capable program, run it on 1995 hardware and see how strong it is. Then run the same program on today's hardware and see how strong it is. Then take one of today's programs and run it on today's hardware to see how strong it is. The difference between old-prog+new-hardware and new-prog+new-hardware would have to be software. The difference between old-prog+old-hardware and new-prog+new-hardware would clearly be hardware. Clearly the old program on new hardware is not going to take advantage of all the hardware improvements, since it was written before they came along. But it would get us into the right ballpark, even if it is not anywhere near perfect.

I am still trying to get something old to work, but it is a challenge.

Imagine a single-threaded search takes 10 seconds on average to reach a given depth. Imagine that a 4-thread search takes 5 seconds on average to reach the same depth. This would be an efficiency of 50%, since the theoretic average case would be 2.5 seconds for 4 threads, if adding threads didn't add any waste or overhead (i.e. 10 seconds divided by the number of threads).

It's just a simple way of measuring the penalty of having more threads. I guess you could measure it with a test suite of positions, or by playing actual games with fixed depth searches.

bob · Post by **bob** » Fri Sep 10, 2010 7:58 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

What's the efficiency of current programs with 6 threads?
Define "efficiency"? If you mean what is the NPS with 6 compared to the NPS with 1, for Crafty it is about 6x faster. However, using anything else for comparison is even harder. Clearly there are some hardware advances that most are not taking advantage of, say the MMX/SSE/etc stuff that is barely used at all. Yet it is there.

I still think about the best we can hope for is to take an old SMP-capable program, run it on 1995 hardware and see how strong it is. Then run the same program on today's hardware and see how strong it is. Then take one of today's programs and run it on today's hardware to see how strong it is. The difference between old-prog+new-hardware and new-prog+new-hardware would have to be software. The difference between old-prog+old-hardware and new-prog+new-hardware would clearly be hardware. Clearly the old program on new hardware is not going to take advantage of all the hardware improvements, since it was written before they came along. But it would get us into the right ballpark, even if it is not anywhere near perfect.

I am still trying to get something old to work, but it is a challenge.
Imagine a single-threaded search takes 10 seconds on average to reach a given depth. Imagine that a 4-thread search takes 5 seconds on average to reach the same depth. This would be an efficiency of 50%, since the theoretic average case would be 2.5 seconds for 4 threads, if adding threads didn't add any waste or overhead (i.e. 10 seconds divided by the number of threads).

It's just a simple way of measuring the penalty of having more threads. I guess you could measure it with a test suite of positions, or by playing actual games with fixed depth searches.

I have _plenty_ of applications that will run 8x faster on 8 cores. 16x faster on 16 cores. Just not chess. So is that a hardware issue or a software issue? That's one reason I don't see an easy way to measure what hardware gives vs what software gives. Don's test is completely ignoring 5/6 of a modern microprocessor's hardware improvements. Measuring NPS overstates the hardware gain in some sense, although one could say the NPS represents actual hardware gain, the search efficiency shows how far behind the software lags to obtain that potential gain.

On 6 cores, I'd be surprised if any decent SMP search produces less than a 4x real speedup (not NPS but time to solution or time to depth). But the hardware gain is there if we can just figure out how to get the rest of it. For the question being asked here, it is not so much "how much has the hardware progressed" but rather "how much of the difference between a 1995 program and a 2010 program is due to hardware, and how much is due to software.

My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.

mhull · Post by **mhull** » Fri Sep 10, 2010 8:40 pm

bob wrote:My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.

How difficult is it to re-work a 1995 crafty version to take full advantage of modern hardware, like replacing compact-attacks and inline functions, for example?

Either that, or "dumbing down" the modern version to use only the software search and eval techniques of the 1995 version?

rbarreira · Post by **rbarreira** » Fri Sep 10, 2010 8:47 pm

bob wrote:
rbarreira wrote:
bob wrote:
rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

What's the efficiency of current programs with 6 threads?
Define "efficiency"? If you mean what is the NPS with 6 compared to the NPS with 1, for Crafty it is about 6x faster. However, using anything else for comparison is even harder. Clearly there are some hardware advances that most are not taking advantage of, say the MMX/SSE/etc stuff that is barely used at all. Yet it is there.

I still think about the best we can hope for is to take an old SMP-capable program, run it on 1995 hardware and see how strong it is. Then run the same program on today's hardware and see how strong it is. Then take one of today's programs and run it on today's hardware to see how strong it is. The difference between old-prog+new-hardware and new-prog+new-hardware would have to be software. The difference between old-prog+old-hardware and new-prog+new-hardware would clearly be hardware. Clearly the old program on new hardware is not going to take advantage of all the hardware improvements, since it was written before they came along. But it would get us into the right ballpark, even if it is not anywhere near perfect.

I am still trying to get something old to work, but it is a challenge.
Imagine a single-threaded search takes 10 seconds on average to reach a given depth. Imagine that a 4-thread search takes 5 seconds on average to reach the same depth. This would be an efficiency of 50%, since the theoretic average case would be 2.5 seconds for 4 threads, if adding threads didn't add any waste or overhead (i.e. 10 seconds divided by the number of threads).

It's just a simple way of measuring the penalty of having more threads. I guess you could measure it with a test suite of positions, or by playing actual games with fixed depth searches.
I have _plenty_ of applications that will run 8x faster on 8 cores. 16x faster on 16 cores. Just not chess. So is that a hardware issue or a software issue? That's one reason I don't see an easy way to measure what hardware gives vs what software gives. Don's test is completely ignoring 5/6 of a modern microprocessor's hardware improvements. Measuring NPS overstates the hardware gain in some sense, although one could say the NPS represents actual hardware gain, the search efficiency shows how far behind the software lags to obtain that potential gain.

On 6 cores, I'd be surprised if any decent SMP search produces less than a 4x real speedup (not NPS but time to solution or time to depth). But the hardware gain is there if we can just figure out how to get the rest of it. For the question being asked here, it is not so much "how much has the hardware progressed" but rather "how much of the difference between a 1995 program and a 2010 program is due to hardware, and how much is due to software.

My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.

I thought the question was "how much did hardware contribute to chess strength", in which case efficiency of SMP software is definitely something that affects the answer.

bob · Post by **bob** » Fri Sep 10, 2010 9:08 pm

mhull wrote:
bob wrote:My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.
How difficult is it to re-work a 1995 crafty version to take full advantage of modern hardware, like replacing compact-attacks and inline functions, for example?

Either that, or "dumbing down" the modern version to use only the software search and eval techniques of the 1995 version?

Does the term "PITA" mean anything to you?

I renumbered the bits to match BSF/BSR. That changed _everything_. Different bit patterns for eval stuff, move generation, etc. Already thought of that. Copied inline64.h without thinking. One thermo-nuclear explosion later, realized this was not going to work.

hardware advances - a different perspective

hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective