My Raspberry Pi has arrived!

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: My Raspberry Pi has arrived!

Post by wgarvin »

Maybe with a few careful optimizations, performance would be better.

See here for CPU documentation that includes instruction latencies (see the "Cycle Timings and Interlock Behaviour" section).

[Edit: the "Interlock Behaviour" refers to a pipeline stall when you try to use a register too soon after writing it, very much like the AGI stall on 486 and Pentium 1 processors. Just avoid having one long dependence chain and it shouldn't be a problem.]

The CLZ (Count Leading Zeros) instruction should be usable for bit-scans. Maybe the BIC (Bit clear) instruction would be useful for clearing the bits as CLZ reports them. ARM's compiler supports a 32-bit __clz intrinsic for the CLZ instruction.. GCC probably does as well (?).

It looks like 32-bit multiplies have longer latencies than 16-bit multiplies, so if you do a lot of multiplies (e.g. in eval scoring) and your values are small enough, it might be worth trying to cast things to a 16-bit type before multiplying them.

These ARM11* chips supposedly have a longer pipeline than earlier ARMs like the ARM9TDMI, but mispredictions should still only cost you in the 5-7 cycle range. Note that the BX R14 instruction (the equivalent of x86's RET) takes 4 cycles even if the return address is correctly predicted. So make sure you aren't using small helper functions without inlining them--check if your compiler settings allow inlining, and check the assembly code to make sure things are being inlined.

Using the Thumb instruction set may result in slightly slower code--the instruction timings are the same, but the instructions themselves are less expressive, so it might take more of them to accomplish the same calculation.