PINE A64+ 1.2 GHz quad core 64 bit: US$19

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: PINE A64+ vs Raspberry Pi B2

Post by matthewlai »

There is also the ODROID-XU4 (http://www.hardkernel.com/main/products ... 3452239825)

It's quite a bit more expensive at $74, but it has 4 A15 cores (2 GHz) and 4 A7 cores (1.5 GHz).

Looking at just the A15 cores,

A15 has 3.5-4.1 DMIPS/MHz, and A7 has 1.9 DMIPS/MHz, so it has a roughly 2x advantage over the RPi 2 in instructions per clock, and 2x advantage in clock speed.

It also has eMMC, USB 3.0, and Gigabit ethernet (though unclear if that's over USB).

Then there are the A7 cores, though it's unclear how they can be used in our applications.

It's probably a better building block for a cluster than RPi 2.

Cortex-A53 gives you 2.3 DMIPS/MHz, so A53 @ 1.2 GHz is quite a bit slower than A15 @ 2 GHz, even considering emulating 64-bit operations on 32-bit.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Bitboard efficiency

Post by sje »

Bitboard efficiency

In terms of calculations per unit energy, a program which spends a significant portion of its time with 64 bit calculations cannot be run efficiently on 32 bit hardware compared to 64 bit hardware. This is true at least with current hardware offerings.

The quad core 32 bit Raspberry Pi comes close, but still falls short. Older 32 bit desktop machines fall way, way behind.

An examination of the compiler's generated assembly language source shows why. Handling 64 bit operands takes twice as many instructions for most operations. It's even worse for shift and rotate operations which require some merging of the intermediate results of each pair of 32 bit operations.

The only advantage of a 32 bit CPU is less code for arbitrary addressing; this in exchange for a 4 GiB limit on the address space. But most 64 bit CPUs can use 32 bit addressing offsets. There is still some savings with a 32 bit CPU in that pointers are only four bytes long, but this doesn't make up for the cost for extra code for handling 64 bit operands.

----

The Pine A64+ should became widely available by summer 2016. But by then, I expect it to have some real competition as developers start to feel the limitations of 32 bit CPUs with respect to handling large address spaces as needed by multimedia applications.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Bitboard efficiency

Post by matthewlai »

sje wrote:Bitboard efficiency

In terms of calculations per unit energy, a program which spends a significant portion of its time with 64 bit calculations cannot be run efficiently on 32 bit hardware compared to 64 bit hardware. This is true at least with current hardware offerings.

The quad core 32 bit Raspberry Pi comes close, but still falls short. Older 32 bit desktop machines fall way, way behind.

An examination of the compiler's generated assembly language source shows why. Handling 64 bit operands takes twice as many instructions for most operations. It's even worse for shift and rotate operations which require some merging of the intermediate results of each pair of 32 bit operations.

The only advantage of a 32 bit CPU is less code for arbitrary addressing; this in exchange for a 4 GiB limit on the address space. But most 64 bit CPUs can use 32 bit addressing offsets. There is still some savings with a 32 bit CPU in that pointers are only four bytes long, but this doesn't make up for the cost for extra code for handling 64 bit operands.

----

The Pine A64+ should became widely available by summer 2016. But by then, I expect it to have some real competition as developers start to feel the limitations of 32 bit CPUs with respect to handling large address spaces as needed by multimedia applications.
On the other hand, CPU energy consumption is almost proportional to bit width. A CPU's dynamic power is proportional to number of transistors that switch per clock cycle, and a 64-bit CPU would have twice as wide ALUs, twice as wide buses, and twice as wide pretty much everything. They would all need to have about twice the number of transistors.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Bitboard efficiency

Post by sje »

matthewlai wrote:On the other hand, CPU energy consumption is almost proportional to bit width. A CPU's dynamic power is proportional to number of transistors that switch per clock cycle, and a 64-bit CPU would have twice as wide ALUs, twice as wide buses, and twice as wide pretty much everything. They would all need to have about twice the number of transistors.
I don't see it. The half length registers and paths in a 32 bit CPU have to work twice as much compared to a 64 bit CPU, so the heat will be about the same.
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: Bitboard efficiency

Post by Carey »

matthewlai wrote:
sje wrote:Bitboard efficiency

In terms of calculations per unit energy, a program which spends a significant portion of its time with 64 bit calculations cannot be run efficiently on 32 bit hardware compared to 64 bit hardware. This is true at least with current hardware offerings.

The quad core 32 bit Raspberry Pi comes close, but still falls short. Older 32 bit desktop machines fall way, way behind.

An examination of the compiler's generated assembly language source shows why. Handling 64 bit operands takes twice as many instructions for most operations. It's even worse for shift and rotate operations which require some merging of the intermediate results of each pair of 32 bit operations.

The only advantage of a 32 bit CPU is less code for arbitrary addressing; this in exchange for a 4 GiB limit on the address space. But most 64 bit CPUs can use 32 bit addressing offsets. There is still some savings with a 32 bit CPU in that pointers are only four bytes long, but this doesn't make up for the cost for extra code for handling 64 bit operands.

----

The Pine A64+ should became widely available by summer 2016. But by then, I expect it to have some real competition as developers start to feel the limitations of 32 bit CPUs with respect to handling large address spaces as needed by multimedia applications.
On the other hand, CPU energy consumption is almost proportional to bit width. A CPU's dynamic power is proportional to number of transistors that switch per clock cycle, and a 64-bit CPU would have twice as wide ALUs, twice as wide buses, and twice as wide pretty much everything. They would all need to have about twice the number of transistors.
Caches don't have to be increased. Instruction decoder circuitry doesn't have to be increased. FPU circuitry doesn't have to change. General cpu management, interrupts hardware, bus controller, etc. can stay the same, too.

Only the core operations themselves, and some of them (such as multiply & divide) could be done in stages and don't get used much in chess anyway. Performance / power trade off.

How many more transistors did the x64 add to the i32? I could be wrong (and very often am) but I'm wanting to say only about an extra 10% to actually add the 64 bit versions of the instructions.

Of course, ARM could be very different since it doesn't have the x86 instruction set to deal with, which is a major flaw and power drain.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Bitboard efficiency

Post by stegemma »

sje wrote:
matthewlai wrote:On the other hand, CPU energy consumption is almost proportional to bit width. A CPU's dynamic power is proportional to number of transistors that switch per clock cycle, and a 64-bit CPU would have twice as wide ALUs, twice as wide buses, and twice as wide pretty much everything. They would all need to have about twice the number of transistors.
I don't see it. The half length registers and paths in a 32 bit CPU have to work twice as much compared to a 64 bit CPU, so the heat will be about the same.
That's true for 64 bit software but sometime you only need a part of the register and in those cases a 8/16 bit CPU would be more efficient. Let's say a loop variable for less than 256 iterations, a piece value that can fits in 8/16 bit and so on. If you have an 8 bit CPU with the power of a Pentium you could get a faster (and stronger) software even with a mail-slot chess engine than a bitboard 64 bit software with the same power consumption. of course I mean with the same power consumption, because of the less transistor you could have. Of course no one wants to project an 8 bit CPU today with the advanced logic of a Pentium and similar CPUs so this comparison couldn't be effectively done.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Throughput/power ratio

Post by sje »

Throughput/power ratio

My eight core, 64 bit Core i7-5960 might use 30 times the power of a Raspberry Pi, but it also has 300 times the throughput. (Estimates, but close.) Overall, while my most power hungry box with its US$1,049 3.0 GHz i7-5960 CPU and its US$700 64 GiB 2.4 GHz RAM is the most expensive computer I own, it also is the most efficient. If I could beat it with a Raspberry Pi cluster, then I would.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Throughput/power ratio

Post by Dann Corbit »

sje wrote:Throughput/power ratio

My eight core, 64 bit Core i7-5960 might use 30 times the power of a Raspberry Pi, but it also has 300 times the throughput. (Estimates, but close.) Overall, while my most power hungry box with its US$1,049 3.0 GHz i7-5960 CPU and its US$700 64 GiB 2.4 GHz RAM is the most expensive computer I own, it also is the most efficient. If I could beat it with a Raspberry Pi cluster, then I would.
The most important measure would be applied GHz/Dollar.

We could measure a proportional version by (e.g.):
cost_effectiveness =(20_ply_search_from_root_time)/total_cost_of_system;
If we wanted to get more technically accurate, we could also measure electricity cost of operation and add that in.

It seems to be approximately what you are doing, but I think it would be important to measure it for the actual operation in question.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Throughput/power ratio

Post by stegemma »

Dann Corbit wrote:[...]

The most important measure would be applied GHz/Dollar.

We could measure a proportional version by (e.g.):
cost_effectiveness =(20_ply_search_from_root_time)/total_cost_of_system;
If we wanted to get more technically accurate, we could also measure electricity cost of operation and add that in.

It seems to be approximately what you are doing, but I think it would be important to measure it for the actual operation in question.
Yes, and maybe you should act as for a car and count all of the costs. I have an old Mazda 2 that is 11 years old and has run for 374000 Km, with very few mechanical problems, all over this years. The initial cost was 15500 euros and it has a maximum speed of less more than 140 Km/h (in Italy you can't go faster than 130 Km/h, so I really don't know). It runs about 20 Km/l so I've used about 18000 liters of diesel, with various prices over the year... let's say about 18000 euro. The overall cost could be about 40000 euros in 11 years, so the cost was about 300 euro/month.

The same would be done for a PC running chess engines but still it is an hobby. Maybe a faster PC even if more expensive is better, because we can use it only in week-end or a couple of hours at night. The satisfaction to build its own cluster of cheap boards can be enough even if it results more expensive or more power consuming. Who knows? What's important is that anybody can know about all the aspect involved, to avoid illusions and disillusions.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com