#1. I have not seen ANYONE get that kind of speedup today. Back then the trees were pretty much a fixed depth with no forward pruning, or reductions, or any of the other things that make todays trees much more irregular and unstable. So until you find ANY current program that can produce those kinds of speedups, we have to use todays numbers. Crafty is nowhere NEAR the DTS numbers. And I have not seen any program yet that has beat Crafty's numbers, since everyone is basically using the SAME parallel search approach that is used in Crafty. Ergo, 12% is meaningless.syzygy wrote:I'm looking at what you wrote, and what you wrote is that it doesn't work for Crafty with a strong suggestion that therefore it can't work for other engines.bob wrote:You DO realize that I run multi-threaded programs regularly? I have posted results of cluster testing with threads using things like stockfish, to measure Elo improvement.syzygy wrote:Let's see:bob wrote:And maybe ONE day a little science will work its way into these discussions.
(hint: "ONE PROGRAM"?
And based on 1K games, with ONE program, HT is a good idea for all?bob wrote:Regardless of urban legend, I have NEVER seen one example where using hyper threading improves the performance of a chess engine. Not a single one.If it doesn't work for Crafty, it can't work for any other engine? (Well, except for "poorly implemented" engines like Houdini, I guess.)bob wrote:There's something badly wrong with your testing. I can post a ton of data relative to Crafty and SMT (hyper-threading). And it has ALWAYS been worse on than off. Including the recent test on my macbook dual i7 with SMT enabled (I can't turn it off).
Btw, I do agree that Mike's results are not statistically significant because of the small sample size. But is it much different for the old papers on which most "conventional knowledge" is based? Just an example: 24 test positions?
That everyone else was doing the same doesn't change the fact that results from 24 positions are statistically insignificant. Let today be the day that some science gets into this discussion...bob wrote:I chose the "24 test positions" because that was what everyone else was using at the time, and it made our results directly comparable, even if not as accurate as might be desired. In my DTS paper I didn't use those positions, I changed the idea to following a real game, move by move, one that Cray Blitz had actually played...
Let's do the math. Directly from your DTS paper: 4 processors give a speedup of 3.7, 8 processors give a speedup of 6.6.bob wrote:There was no significant locking overhead in Cray Blitz. NPS scaled perfectly up to the 32-cpu T90 we used. 12% increase in NPS will NOT offset the 30% increase in nodes searched, however...
This means that on a 4-core machine with HT off, the speedup is 3.7. This is easy.
With 8 processors the speedup is 6.6, but that is including the 2x nps speedup compared to 4 processors. So on a 4-core machine with HT on and assuming 12% higher nps, the speedup is (6.6 / 2) x 1.12 = 3.7. This is still elementary.
So with more than 12% higher nps due to HT, HT is a win.
Again, this is based directly on your DTS paper. The math is not difficult.
#2. My measurements, over hundreds of positions, repeated dozens of times (you can find discussions of that here in past threads where others took my raw data (log files) and computed the speedups for themselves to verify my methodology) show that each additional thread increases the tree by about 30% for that thread, or, more clearly, for each thread added, that thread searches about 70% useful stuff, and 30% completely wasted stuff. Hence the 1.7x speedup I typically see for 2 threads. Or 3.1x speedup I typically see for 4 threads. The HT speedup has to offset that 30% for each thread (except for the first thread) just to break even. I've not seen numbers like that. I have repeatedly said that most tests for me (with HT on) are break-even at best, but doubling threads really ramps up the variability of the results (time required to reach a depth). When I first got my PIV box, I thought it was worthwhile, barely. Our first PIV cluster test, however, showed it was actually playing a little worse with SMT on, and at that point, I stopped using it. If, by some magic bullet, SMT is suddenly more efficient, that might change. But my macbook is an I7, and the tests I ran and posted here a few weeks back suggests it still is not a "winner". Granted, it might not always be a loser either. But I have not seen any evidence that it produces a measurable improvement.
If you want to measure overhead, it is pretty simple to do. Just run a position to fixed depth with one cpu, then two, and so on. Here is a one and two cpu run for a couple of positions, on my macbook:
log.001: time=32.29 mat=0 n=140444828 fh=93% nps=4.3M
log.002: time=24.72 mat=0 n=179333906 fh=93% nps=7.3M
log.001: time=1:27 mat=0 n=350299195 fh=93% nps=4.0M
log.002: time=54.90 mat=0 n=374227506 fh=93% nps=6.8M
log.001: time=6.14 mat=0 n=36634945 fh=94% nps=6.0M
log.002: time=4.03 mat=0 n=41061761 fh=94% nps=10.2M
log.001: time=27.45 mat=0 n=121794283 fh=92% nps=4.4M
log.002: time=32.32 mat=0 n=248567560 fh=92% nps=7.7M
As you can see, in each case, the second run searches more nodes (using 2 threads) as compared to the first run (one thread).
This is the killer for parallel search performance, imperfect move ordering. Crafty counts "aborts" which is directly related to overhead (abort this search, the results are not needed, we are failing high on another move, at what we thought was an ALL node).
#3. I generally quote Crafty results because I know what it does, and how it does it. I let others report their parallel search results (few do, however) since it is their code. Crafty's behavior in a parallel search is not going to be that different from what anyone else gets, for obvious reasons (everyone is using YBW as the basic idea, and many are using a near-identical approach to that used in Crafty since the source has been available for almost 20 years now (parallel search code).