My program, running on Core 2 Duo, was speeded up by 10% -- not a bad result for a few minutes work. Although other hardware may not benefit to the same extent, Core 2 is very common. Somewhat to my surprise, even a 32-bit build was improved, but only by 3%.
My program, running on Core 2 Duo, was speeded up by 10% -- not a bad result for a few minutes work. Although other hardware may not benefit to the same extent, Core 2 is very common. Somewhat to my surprise, even a 32-bit build was improved, but only by 3%.
Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.
I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
bob wrote:
Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.
I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
Some programs don't care the scan-order. The advantage is to make bitscan ls1b and reset ls1b independently. Whether this safes 10% depends how much time is spend in serializing bitboards before with bsf result dependent memory read. If it was let say 15..20% before, 10% might be reasonable
bob wrote:
Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.
I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
Some programs don't care the scan-order. The advantage is to make bitscan ls1b and reset ls1b independently. Whether this safes 10% depends how much time is spend in serializing bitboards before with bsf result dependent memory read. If it was let say 15..20% before, 10% might be reasonable
; input rcx target bitboard to serialize
; rdi i.e. move destination
text rcx, rcx
jz over
l1:
lea rdx, [rcx - 1]
bsf rax, rcx
... ; eventually "or" from square
and rcx, rdx
... ; mov or stos
jnz l1
over:
I know some don't, but most would probably care for debugging issues unless you are one of those that flip the entire board each ply so it is always the same side to move. I want to search the moves in the same order whether I start from black or white on the initial position, which lets me detect asymmetry issues, not to mention improve move ordering and such.
The measured 10% speedup is for actual search, not some contrived test.
My program uses LSB scans only, and rather a lot of of them. I incrementally update attack boards with scan-rich code, and my positional scoring does up to 12 scans. Thus circumstances are favourable for the optimization.
Disassembly:
micron wrote:The measured 10% speedup is for actual search, not some contrived test.
My program uses LSB scans only, and rather a lot of of them. I incrementally update attack boards with scan-rich code, and my positional scoring does up to 12 scans. Thus circumstances are favourable for the optimization.
Disassembly:
My only comment is that if this is true, you need to seriously re-evaluate how you are doing things. There's absolutely no reason for the LSB/MSB stuff to take 10% of your time, even if you use a loop / shift to find these bits, rather than the a & a-1 trick. If I run on an I7 and use the hardware popcnt instruction, it is difficult to measure any improvement in speed, as I am simply careful where I used them.
10% is _way_ too high. 1% is too high. And if speeding this up improves overall speed by 10%, that means that MSB/LSB stuff is taking _more_ than 10% of total time. There's a free performance boost waiting to be grabbed...
Don't speed 'em up, get rid of 'em completely, if they are taking that much time. Another way of thinking about this is that if you could make everything but MSB/LSB code run in zero time, you would speed the program up by less than a factor of 10... not good.