hardware advances - a different perspective

bob · Post by **bob** » Fri Sep 10, 2010 9:11 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:
bob wrote:
rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

What's the efficiency of current programs with 6 threads?
Define "efficiency"? If you mean what is the NPS with 6 compared to the NPS with 1, for Crafty it is about 6x faster. However, using anything else for comparison is even harder. Clearly there are some hardware advances that most are not taking advantage of, say the MMX/SSE/etc stuff that is barely used at all. Yet it is there.

I still think about the best we can hope for is to take an old SMP-capable program, run it on 1995 hardware and see how strong it is. Then run the same program on today's hardware and see how strong it is. Then take one of today's programs and run it on today's hardware to see how strong it is. The difference between old-prog+new-hardware and new-prog+new-hardware would have to be software. The difference between old-prog+old-hardware and new-prog+new-hardware would clearly be hardware. Clearly the old program on new hardware is not going to take advantage of all the hardware improvements, since it was written before they came along. But it would get us into the right ballpark, even if it is not anywhere near perfect.

I am still trying to get something old to work, but it is a challenge.
Imagine a single-threaded search takes 10 seconds on average to reach a given depth. Imagine that a 4-thread search takes 5 seconds on average to reach the same depth. This would be an efficiency of 50%, since the theoretic average case would be 2.5 seconds for 4 threads, if adding threads didn't add any waste or overhead (i.e. 10 seconds divided by the number of threads).

It's just a simple way of measuring the penalty of having more threads. I guess you could measure it with a test suite of positions, or by playing actual games with fixed depth searches.
I have _plenty_ of applications that will run 8x faster on 8 cores. 16x faster on 16 cores. Just not chess. So is that a hardware issue or a software issue? That's one reason I don't see an easy way to measure what hardware gives vs what software gives. Don's test is completely ignoring 5/6 of a modern microprocessor's hardware improvements. Measuring NPS overstates the hardware gain in some sense, although one could say the NPS represents actual hardware gain, the search efficiency shows how far behind the software lags to obtain that potential gain.

On 6 cores, I'd be surprised if any decent SMP search produces less than a 4x real speedup (not NPS but time to solution or time to depth). But the hardware gain is there if we can just figure out how to get the rest of it. For the question being asked here, it is not so much "how much has the hardware progressed" but rather "how much of the difference between a 1995 program and a 2010 program is due to hardware, and how much is due to software.

My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.
I thought the question was "how much did hardware contribute to chess strength", in which case efficiency of SMP software is definitely something that affects the answer.

But not since 1995. In 1988 the parallel search done in Cray Blitz was as good or better than anything done today (I am talking shared memory SMP architecture, not clusters). In fact, a few have partially implemented DTS, but not fully (Cozzie was one IIRC). I chose to go a simpler route in Crafty, rather than rewriting my clean recursive search into an iterated (loopy) search.

So since 1995, I see zero SMP improvements, which was my point. That was already a well-understood problem in the 1980's, much less by 1995.

mhull · Post by **mhull** » Fri Sep 10, 2010 9:25 pm

bob wrote:
mhull wrote:
bob wrote:My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.
How difficult is it to re-work a 1995 crafty version to take full advantage of modern hardware, like replacing compact-attacks and inline functions, for example?

Either that, or "dumbing down" the modern version to use only the software search and eval techniques of the 1995 version?
Does the term "PITA" mean anything to you?

I renumbered the bits to match BSF/BSR. That changed _everything_. Different bit patterns for eval stuff, move generation, etc. Already thought of that. Copied inline64.h without thinking. One thermo-nuclear explosion later, realized this was not going to work.

How problematic is the dumbing-down of the current version to match the 1995 versions search and eval? The hardware optimization would already be in place.

bob · Post by **bob** » Fri Sep 10, 2010 10:19 pm

mhull wrote:
bob wrote:
mhull wrote:
bob wrote:My thought is that software needs to be applied to search, evaluation, and chess-related stuff. Not to software changes needed just to take advantage of new hardware, when it provides no new functionality with respect to chess. Organizing variables to give them spatial locality when they have temporal locality is not a software improvement. It is modifying the software to take advantage of how cache works today vs in 1995.
How difficult is it to re-work a 1995 crafty version to take full advantage of modern hardware, like replacing compact-attacks and inline functions, for example?

Either that, or "dumbing down" the modern version to use only the software search and eval techniques of the 1995 version?
Does the term "PITA" mean anything to you?

I renumbered the bits to match BSF/BSR. That changed _everything_. Different bit patterns for eval stuff, move generation, etc. Already thought of that. Copied inline64.h without thinking. One thermo-nuclear explosion later, realized this was not going to work.
How problematic is the dumbing-down of the current version to match the 1995 versions search and eval? The hardware optimization would already be in place.

The most obvious solution would be to graft at least the search and eval from old crafty into current. The most obvious problem is that same PITA, since the bits are numbered differently. At least half of the lines in evaluate.c get changed and the debugging when I changed them last time was the PITA to end all PITAs...

Don · Post by **Don** » Sun Sep 12, 2010 1:54 am

rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.

I'm trying to parse your statement and I think you said it backwards but I'm not sure.

If you are running on 4 cores, the hardware contributes less to chess strength than if you are running on 1 core that is 4 times faster. Is that what you meant?

What's the efficiency of current programs with 6 threads? (meaning, the actual speedup for a suite of test positions divided by 6)

rbarreira · Post by **rbarreira** » Sun Sep 12, 2010 1:42 pm

Don wrote:
rbarreira wrote:I agree that it's misleading to use nodes per second without adjusting for loss of efficiency from more threads. This makes the contribution of hardware smaller than it seems.
I'm trying to parse your statement and I think you said it backwards but I'm not sure.

If you are running on 4 cores, the hardware contributes less to chess strength than if you are running on 1 core that is 4 times faster. Is that what you meant?

What's the efficiency of current programs with 6 threads? (meaning, the actual speedup for a suite of test positions divided by 6)

Yes that is what I meant... Reading my post again, the second sentence isn't well written for the context.

hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective

Re: hardware advances - a different perspective