Core Port Saturation

bob · Post by **bob** » Tue Apr 15, 2014 10:39 pm

BeyondCritics wrote:Thank you for your hints.

I have one question: What happens if i call a function indirectly using a pointer, as it happens often when virtual methods are in play. Is there some sort of penalty, and how big is it?

Oliver

The call is problematic. Luckily the branch target buffer helps by predicting where the branch will go. But if it changes frequently, it will always be wrong. Only good thing is that the return will get predicted correctly most of the time due to the internal CPU call/return prediction stack it maintains...

With a branch, you have only two possibilities, taken or not taken. With an indirect branch, the possibilities are HUGE. And HUGELY unpredictable.

Think of every such indirect function call as an instant "pipeline drain" instruction.

wgarvin · Post by **wgarvin** » Tue Apr 15, 2014 11:40 pm

bob wrote: One quibble. When you say "much kinder to integer alignment" it depends on what kind of integer. I've been working on some ASM code of late where I have written a small library to be used in my 330 class. Wanted to give them some easy code to do basic input and output, and for some things, I wanted to use the C library. For example (read or _read depending on system). Turns out that code in the library uses XMM registers. Which is NOT forgiving of alignment errors, at least on my mac. If the stack alignment is not exactly correct, _read crashes on an xmm load. Pain in the a$$. It was quite educational to figure out just how the stack had to be aligned for each library routine call.

Oops.. yeah. I'm used to a couple of non-x86 platform where I think in terms of the three types of registers available: "integer", "float" and "vector" (SIMD).

When I said "integer" what I was actually thinking of, are what x86/x64 call "general-purpose registers", i.e. eax/edx/ecx/ebx, or rax/rdx/rcx/rbx etc. Those are very forgiving of mis-aligned accesses. Floating-point is usually done with SIMD registers anyways nowadays, and I can't remember the exact alignment rules there but its probably a hassle; the regular 16-byte XMM move instruction (MOVAPS?) requires 16-byte alignment, there's also an "unaligned" 16-byte move instruction but on many x86 and older x64 chips its actually slower than doing two separate 8-byte reads using MOVLPS/MOVHPS or something like that. Since it wouldn't be atomic anyway, I guess there's not much difference. Anyway, for general-purpose C code meant for x86, avoid misaligned float* or double* because those can get you into trouble. [Edit: I believe that 4- and 8-byte loads and stores involving XMM registers are allowed to be mis-aligned and are approximately as efficient as mis-aligned loads and stores with general-purpose registers, but I might be wrong about that. I mean, small penalty for crossing a cacheline and no penalty if its entirely within one cacheline. But if writing code that actually depends on this, better look it up in the Intel and AMD docs for sure.]

Another thing I forgot to mention that is pretty nice about x86: it lets you manipulate 8-, 16-, 32- and nowadays 64-bit quantities efficiently. e.g. you can load 8- or 16-bit values into a 32- or 64-bit register, and the zero-extension or sign-extension is usually free (via MOVZX or MOVSX, which support the same addressing modes as a regular MOV). On most other platforms it would cost you an extra instruction in there somewhere to extend a value to fill a larger register.

... An interesting thing is now happening in the game industry, because both of the "next-generation" consoles by Microsoft and Sony have x64 CPU cores in them now. Once developers have stopped making games for the older generation of consoles (360, PS3, and Wii/WiiU) then all of their target platforms (XBox One, PS4 and PC) are going to be x64! Finally, we can stop doing low-level nonsense to avoid Load-Hit-Stores. Finally, monomorphic virtual method calls will actually be as cheap as direct calls!

bob · Post by **bob** » Wed Apr 16, 2014 3:30 am

wgarvin wrote:
bob wrote: One quibble. When you say "much kinder to integer alignment" it depends on what kind of integer. I've been working on some ASM code of late where I have written a small library to be used in my 330 class. Wanted to give them some easy code to do basic input and output, and for some things, I wanted to use the C library. For example (read or _read depending on system). Turns out that code in the library uses XMM registers. Which is NOT forgiving of alignment errors, at least on my mac. If the stack alignment is not exactly correct, _read crashes on an xmm load. Pain in the a$$. It was quite educational to figure out just how the stack had to be aligned for each library routine call.
Oops.. yeah. I'm used to a couple of non-x86 platform where I think in terms of the three types of registers available: "integer", "float" and "vector" (SIMD).

When I said "integer" what I was actually thinking of, are what x86/x64 call "general-purpose registers", i.e. eax/edx/ecx/ebx, or rax/rdx/rcx/rbx etc. Those are very forgiving of mis-aligned accesses. Floating-point is usually done with SIMD registers anyways nowadays, and I can't remember the exact alignment rules there but its probably a hassle; the regular 16-byte XMM move instruction (MOVAPS?) requires 16-byte alignment, there's also an "unaligned" 16-byte move instruction but on many x86 and older x64 chips its actually slower than doing two separate 8-byte reads using MOVLPS/MOVHPS or something like that. Since it wouldn't be atomic anyway, I guess there's not much difference. Anyway, for general-purpose C code meant for x86, avoid misaligned float* or double* because those can get you into trouble. [Edit: I believe that 4- and 8-byte loads and stores involving XMM registers are allowed to be mis-aligned and are approximately as efficient as mis-aligned loads and stores with general-purpose registers, but I might be wrong about that. I mean, small penalty for crossing a cacheline and no penalty if its entirely within one cacheline. But if writing code that actually depends on this, better look it up in the Intel and AMD docs for sure.]

Another thing I forgot to mention that is pretty nice about x86: it lets you manipulate 8-, 16-, 32- and nowadays 64-bit quantities efficiently. e.g. you can load 8- or 16-bit values into a 32- or 64-bit register, and the zero-extension or sign-extension is usually free (via MOVZX or MOVSX, which support the same addressing modes as a regular MOV). On most other platforms it would cost you an extra instruction in there somewhere to extend a value to fill a larger register.

... An interesting thing is now happening in the game industry, because both of the "next-generation" consoles by Microsoft and Sony have x64 CPU cores in them now. Once developers have stopped making games for the older generation of consoles (360, PS3, and Wii/WiiU) then all of their target platforms (XBox One, PS4 and PC) are going to be x64! Finally, we can stop doing low-level nonsense to avoid Load-Hit-Stores. Finally, monomorphic virtual method calls will actually be as cheap as direct calls!

I suppose as long as I am still learning something new, I am not "dead" which is a good thing. But I was quite surprised to see a library routine for read (read(fd, buff, size) type of read) using xmm registers. It was doing 5 consecutive loads/stores which I suppose is a fast way to move 80 bytes around. But since it was moving to the stack, it had better be aligned right. Same for write (or _write as it seems apple and linux don't agree on the leading underscore) and other library calls. Annoying that it calls this a segmentation fault since it is anything but that. (mac OSX)

ZirconiumX · Post by **ZirconiumX** » Wed Apr 16, 2014 10:36 am

wgarvin wrote: ... An interesting thing is now happening in the game industry, because both of the "next-generation" consoles by Microsoft and Sony have x64 CPU cores in them now. Once developers have stopped making games for the older generation of consoles (360, PS3, and Wii/WiiU) then all of their target platforms (XBox One, PS4 and PC) are going to be x64!

Thank AMD for that.

Matthew:out

mar · Post by **mar** » Wed Apr 16, 2014 11:07 am

wgarvin wrote:... An interesting thing is now happening in the game industry, because both of the "next-generation" consoles by Microsoft and Sony have x64 CPU cores in them now. Once developers have stopped making games for the older generation of consoles (360, PS3, and Wii/WiiU) then all of their target platforms (XBox One, PS4 and PC) are going to be x64! Finally, we can stop doing low-level nonsense to avoid Load-Hit-Stores. Finally, monomorphic virtual method calls will actually be as cheap as direct calls!

A bit of OT

I was shocked when I recently discovered that PS3 had only 256mb ram! I wonder how game devs battled memory fragmentation on such platforms (plus how to minimize memory requirements in general because modern game engines must be very complex beasts).
I guess they had to use some wizardry to make it work as I doubt consoles support page swapping

I was recently forced to write my own simple heap manager because HeapAlloc (used by CRT) on windows (for 32-bit processes) has problems managing large blocks and causes fragmentation of VA space (thumb down for Microsoft!!).
I thought this was a relic of the past but on some systems it obviously remains

64-bit programs don't suffer from the same problem but still,
if memory managers were tailored for specific platforms (which one expects to work optimally out of the box),
this would save people time by not having to write low level memory managers themselves and focus on important things instead ...

wgarvin · Post by **wgarvin** » Wed Apr 16, 2014 3:26 pm

PS3 has 256MB of system RAM, but also 256MB of video RAM. The operating system also permanently reserves some of both kinds of RAM for itself. In addition to GPU resources (textures, shaders, vertex and index buffers etc.) some games also use the video RAM to store some of the game engine's infrequently-used data, but reading it from tbe CPU is brutally slow so you have to use the SPUs or a GPU shader to copy it back to system RAM when the engine needs to access it. Its similar to what Gamecube games used to do with its 10MB of 'audio RAM' and 24MB of system RAM (imagine porting a PS2 game to the Gamecube, with its 32MB of RAM from the PS2... to avoid seriously downgrading the game's assets, you had to somehow make use of that ARAM. Storing animations in it was a common thing, I believe).

Anyway, Xbox 360 has 512MB of RAM shared by the GPU and the CPU, so its a bit more flexible there, but similar to the PS3 in overall capability. The PS4 and Xbox One have way more memory available to the games, and thats one of the main reasons they will look nicer. Also the 360/PS3 are like 8 years old now, with DX9-class rendering hardware in them. The new consoles obviously have more powerful GPUs in them.

wgarvin · Post by **wgarvin** » Wed Apr 16, 2014 3:37 pm

I spent a lot of the last 8 months trying to reduce the memory usage of an upcoming game on 360/PS3, in any way possible. Its amusing (amazing?) but its 2014 now and spending all day trying to optimize a pile of code to save 100 KB is still a thing game developers do!

Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation