Page 2 of 2

Re: Performance: linux vs Windows vs Mac OS X

Posted: Wed Dec 02, 2009 7:40 am
by abulmo
shiv wrote:For PGO optimization to work, the profiling run should be typical of normal program execution (in terms of the branching decisions made). If the program is not branch intensive or if the profiling run is not typical of the actual run, PGO probably will not help. Wonder if this holds for your program.
Many intensive tasks (evaluation function, move generation) of my program indeed use branchless algorithms.
shiv wrote:I apologize for stating the obvious, but the fact that PGO makes performance worse does appear somewhat odd.
Your right, my data for worse performance were old and I might have poorly tested PGO at this time. I retried the idea and all compiler get better performance with PGO, but the speed increase is really small, about 1%, which is within the variability of my performance test (due to parallel search that gives less reproducible results).

Re: Performance: linux vs Windows vs Mac OS X

Posted: Wed Dec 02, 2009 7:57 am
by bob
abulmo wrote:
shiv wrote:For PGO optimization to work, the profiling run should be typical of normal program execution (in terms of the branching decisions made). If the program is not branch intensive or if the profiling run is not typical of the actual run, PGO probably will not help. Wonder if this holds for your program.
Many intensive tasks (evaluation function, move generation) of my program indeed use branchless algorithms.
shiv wrote:I apologize for stating the obvious, but the fact that PGO makes performance worse does appear somewhat odd.
Your right, my data for worse performance were old and I might have poorly tested PGO at this time. I retried the idea and all compiler get better performance with PGO, but the speed increase is really small, about 1%, which is within the variability of my performance test (due to parallel search that gives less reproducible results).
There is much to consider in a parallel search. Cache issues can be paramount, because two separate caches (such as two or four L1 caches on a dual or quad-core box) can generate a ton of inter-cache forwarding traffic if you are sloppy in how you lay stuff out in memory so that two processors frequently access data that lies within the same cache line. PGO can't do much for a program other than to prevent cache from prefetching unused data, by rearranging things as needed to avoid fetching blocks of instructions that are generally skipped over. Anything else is going to be up to the normal optimizer which won't be improved at all by PGO.