Maybe with a few careful optimizations, performance would be better.
See here for CPU documentation that includes instruction latencies (see the "Cycle Timings and Interlock Behaviour" section).
[Edit: the "Interlock Behaviour" refers to a pipeline stall when you try to use a register too soon after writing it, very much like the AGI stall on 486 and Pentium 1 processors. Just avoid having one long dependence chain and it shouldn't be a problem.]
The CLZ (Count Leading Zeros) instruction should be usable for bit-scans. Maybe the BIC (Bit clear) instruction would be useful for clearing the bits as CLZ reports them. ARM's compiler supports a 32-bit __clz intrinsic for the CLZ instruction.. GCC probably does as well (?).
It looks like 32-bit multiplies have longer latencies than 16-bit multiplies, so if you do a lot of multiplies (e.g. in eval scoring) and your values are small enough, it might be worth trying to cast things to a 16-bit type before multiplying them.
These ARM11* chips supposedly have a longer pipeline than earlier ARMs like the ARM9TDMI, but mispredictions should still only cost you in the 5-7 cycle range. Note that the BX R14 instruction (the equivalent of x86's RET) takes 4 cycles even if the return address is correctly predicted. So make sure you aren't using small helper functions without inlining them--check if your compiler settings allow inlining, and check the assembly code to make sure things are being inlined.
Using the Thumb instruction set may result in slightly slower code--the instruction timings are the same, but the instructions themselves are less expressive, so it might take more of them to accomplish the same calculation.
My Raspberry Pi has arrived!
Moderators: hgm, Rebel, chrisw
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada