Daniel Shawul wrote:ABDADA is the clear winner here even though it started out loosing to lazy on 1 cpu.
Btw, where is the 288 extra bytes coming from. My TT table entry size is same as before and i think i needed only 1 bit to make ABDADA work.
My TT entry did not increase either. I just changed the size of the depth field from 10 to 9 bits.
In the
pseudo code I posted before, I used two arrays:
* One 256 bit "searched[]" array to keep track of which moves were searched in the first ABDADA pass.
* One 256 byte "lrmVec[]" array to keep track of lmr reductions computed in the first ABDADA pass, to ensure that moves delayed to the second pass use the same lmr as they would have used if they had been searched in the first pass. If I just recompute lmr in the second pass, I may get a different result because the calculation depends on the number of legal moves already tested, not just on the move index in the sorted move list.
However, none of those arrays are really needed, because my Move class already has a 4 byte score field, so I can just encode that information as special score range within the move list. I am currently running tests with that modification (and some other tweaks). It looks promising but the tests are not finished yet.
Daniel Shawul wrote:I am surprized that your DTS-like implementation looses to lazy on 4 core. My YBW (also DTS like because it is peer-to-peer; however, idle threads still get picked up instead of themselves looking for work) beats lazy by a heavy margin and i am puzzled by what my implementation is missing for other people to see big wins with lazy. I don't do the depth+1 trick and others, also i synchronize the threads every iterations, but still i expected to see a significant gain with pure SHT.
I am going to try all the lazy tricks and also increase selectivity until i get the elos other people are getting with lazy.
I usually prefer to base what I write on concrete tests, but currently the list of tests I want to run grows faster than I can run the tests, so I will make an exception.
{begin speculation}
1. It might be the case that lazy SMP (and SHT) works better if you have a per-thread history table. The late moves in ALL nodes are mostly ordered by the history heuristic, and if the history table is thread local, there is a decent chance that it becomes different in different threads. This would have the effect that different threads often don't search the same moves at the same time, and would therefore be able to pick up results from other threads from the transposition table.
2. If you do heavy LMR/LMP type reductions, and the reduction a move receives depends on the move order, a side effect of a thread local history table is that different threads will use different reduction amounts for the same move. Since the moves searched early by each thread typically get a lower reduction, and those results are later picked up through the hash table by other threads, the effect is as if the average LMR/LMP reduction is lower than in a single threaded search. It is already known that LMR/LMP is a big loss for fixed depth games, but a big gain in fixed time games, so I expect this effect would cause some elo increase, although not as big an elo increase as when using an SMP algorithm that avoids duplicated work without decreasing LMR/LMP.
3. If you do none of 1 and 2, SHT/lazy SMP may give very little gain, which could explain why SHT has historically been very inefficient in many programs, even though lazy SMP is pretty efficient in many modern programs.
{end speculation}
I intend to eventually run tests to see if any of those conjectures are actually correct.