Gerd Isenberg wrote:diep wrote:Don wrote:hgm wrote:What do you mean by 'without memory access', then? Surely indexing a table counts as a memory access...
I meant without memory access but I would settle for anything that produced a speedup on modern processors.
The mailbox programs would index a much smaller array by using (destination - origin) as the index. To work with a 6 bit square a little bit twiddling would do the job. So if having a small array is a win, it might still be a win despite some additional bit twiddling to convert a 6 bit square to a 7 bit square but it's hard to believe that would make any difference.
Why would a 64x64 array be slower than executing more code with smaller array access?
The real problem of modern processors is mainly decode speed of instructions.
Really? I thought still branches and L1, L2 misses - at least in chess programs.
Sometimes smaller memory pays off, 32K vs. < 2K, likely more often in L1. A few additional instructions may improve ipc and hide latencies. How much is a L1 miss on todays processors?
From what i understand from the hardware engineers posting, the problem is the decoding nowadays at the fastest CPU's.
This has a lot of truths if you think about it.
making caches faster or predicting branches even better - it just isn't going to speed you up significantly, as they are already at the limits there if you ask me.
More threads will however speed you up easily.
Those can only run on those cores if they manage to decode more instructions.
For diep it's total trivial the problem is the instruction stream.
If they would manage to decode more instructions you can run 2x more threads as execution units the processors have plenty nowadays.
The question is how motivated intel is to progress fast with so little competition.
In itself i understood from Joel Hruska that AMD's new design would split the 1x4 decoding to 2x2 decoding.
That should speed it up in IPC, yet the real limitaion is that it's still decoding in total 4 instructions a clock, whereas Diep's IPC already is around 1.7 there.
So decoding 2x3 instructions instead of 1x4 right now, it would be possible for a small increase in number of transistors, to run 4 threads rather than 2 right now at each module.
Same logics applies to intel as well, though they would need more of a redesign to benefit from this.
We see how at the latest intel HPC cpu, larrabee, they run 4 threads now at each core as well. Now of course that's a vector design, so not comparable.
Yet if they would decode more instructions more threads make sense simply.
Let's look objectively at i7 the ipc already is nearby 1.8 for Diep. If i fix things like 'integer' to 'unsigned int' and remove branches by doing a single copmare in an 'if' instead of several, i'm sure i would be able to get it to up to 2.2 even if i really do effort there.
This where the cpu really has problems delivering 4 instructions a clock.
The theoretic max would be somewhere around 60% or so for codes like this, so that's IPC 2.4.
A few very seldom codes where the designers tuned for achieve over 70%, yet that's throughput codes - let's not compare with those.
Of course for bitboards what you can achieve, without vectorisation that is, is probably lower than what i can achieve with Diep; as of course there is only 1 execution units that can do specific instructions and only 2 execution units that can shift.
That also will influence the average IPC getting it probably not much over 1.5 with 2 threads at 1 core.
See how close Diep virtual already is there, despite its huge amount of branches if we include the possible improvements there, which i actually might go for one day (last systematic speedup of diep's coding was 1998 or so, so another sprint there makes sense).
So the statement of the hardware engineers there really makes sense. Also it wasn't contradicted by any expert. Who is silent agrees in such forums. Meaning that they would fix rest of the cpu if it would be able to decode more!