undo move vs. Position Cloning

Karlo Bala · Post by **Karlo Bala** » Wed Sep 23, 2009 1:13 pm

stegemma wrote:
Dann Corbit wrote:
...It seems like everything in chess programming comes down to "try it"....
Yes, that's true but i would add that you can foresee what could happen if you look at the assembly code. Despite from cache and pipeline issue, just seeing 60 assembly istructions (or a loop of 3 istructions repeated 20 times) instead of 20 istructions could give you an advise that a function "should be" wrong.

I just suggest a simple quantitative method to know if sonething could be good or not: just count the assembly istructions!!! (of course i know about clock cycles involved in istructions, pipeline, caching, stall and so on... but i don't ask so much from a pure C programmer).
I'm not sure that memcpy is right solution. Memcpy is good when compiler doesn't know in advance how many bytes to copy. In this particular case I think it is better to unroll loop by using just MOV instruction or similar SSE instructions. But, as Bob mentioned, it is not about copy performance, but about lot of data terrific through the cache that pollute other useful information in cache.

mcostalba · Post by **mcostalba** » Wed Sep 23, 2009 1:39 pm

stegemma wrote: I just suggest a simple quantitative method to know if sonething could be good or not: just count the assembly istructions!!!

It is the wrong way to go. Thank you

One single assembly instruction can be slower then 1000 ones (for instance if accessing a memory location not in cache).

So or the method is solid or it is better to avoid it entirely. Half baked tricks are a BIG source of wasted time and misleading understandings.

stegemma · Post by **stegemma** » Thu Sep 24, 2009 9:22 am

mcostalba wrote:
stegemma wrote: ...One single assembly instruction can be slower then 1000 ones (for instance if accessing a memory location not in cache)....
Yes but if you don't know what any istruction does... having 60 istructions instead of 20 means that statistically there are more probability to find one istruction that is 1000 times slower than those in the other solution.

Just for sample, in one of my C++ library i have used try-catch block almost in any function. Because that library was used in a lot of programs i've searched a way to optimize and looked at assembly code generated by compiler. I've seen that try-catch was compiled with a packet of assembly istructions. Just throwing away try-catch gives me a good increase in performance (about 10/20%), because the functions were very short.

The most of the assembly istructions takes just one or a little more clock cycles, in Pentium. In the average, a 20 istructions function takes 1/3 of the time of a 60 istructions function. This is just a first look that can help C programmers, i'd never said that it is the only and better way to optimize code. In my opinion [IMHO IMHO IMHO ] the only and better way it is to write the whole engine in assembly... even if my weak programs Drago and Raffaela seems to contradict mself

mcostalba · Post by **mcostalba** » Thu Sep 24, 2009 9:39 am

stegemma wrote:
mcostalba wrote:
stegemma wrote: ...One single assembly instruction can be slower then 1000 ones (for instance if accessing a memory location not in cache)....
Yes but if you don't know what any istruction does... having 60 istructions instead of 20 means that statistically there are more probability to find one istruction that is 1000 times slower than those in the other solution.
Sorry Stefano, but why do not measure directly with a speed test ? It is easy in this case, reliable and does not work by luck...ehmm...sorry...I meant "by statistic"

stegemma wrote: Just for sample, in one of my C++ library i have used try-catch block almost in any function. Because that library was used in a lot of programs i've searched a way to optimize and looked at assembly code generated by compiler. I've seen that try-catch was compiled with a packet of assembly istructions. Just throwing away try-catch gives me a good increase in performance (about 10/20%), because the functions were very short.
Well, exception handling is a well known source of slow down, that's the reason in almost any compiler you can switch off exceptions when compiling your binary.

stegemma wrote: The most of the assembly istructions takes just one or a little more clock cycles, in Pentium. In the average, a 20 istructions function takes 1/3 of the time of a 60 istructions function. This is just a first look that can help C programmers, i'd never said that it is the only and better way to optimize code. In my opinion [IMHO IMHO IMHO ] the only and better way it is to write the whole engine in assembly... even if my weak programs Drago and Raffaela seems to contradict mself
The point is that if we don't have _anything_ else then looking at instruction count _could_ be a way to have some hints (no more then that) about performance. But we have much better ways to test performance. Speed test is the best and is easy in this case (easier then count instructions BTW). Profiling with a proper tool is another way.

I also disagree that writing in assembly is better. Especially the whole engine. If you write in assembly you are stuck with very simple alghoritms otherwise it become too complex and too tedious to debug. And another giant limitation is that you are more constrained and have less freedom to test new ideas: if implementation of a "possible good" idea takes few hours or few days you can try it, if it takes a week you are very careful on what ideas to try and at the end you try only a subset of all possible improvments and this makes you program weaker because it is like developing with an hand bound behind your back.

No matter what langhage you use (well....actually matters ) but any language you use must be instrumental and be flexible to follow your ideas, not the contrary.

Ideas must be the first tool, language has to follow easily and do not bottleneck your mind.

From when you have a new idea to when you have build up the implementation to test it, in between it is only a waste of time.

stegemma · Post by **stegemma** » Thu Sep 24, 2009 9:58 am

mcostalba wrote: ...and this makes you program weaker because it is like developing with an hand bound behind your back.

Wow... i really like what you've said! In fact programming in assembly is a very time comsuming activity but not to write the program itself but to debug it. To avoid or limit this, one should write some debugging function, to show the state of the chessboard and other structure. This could help a lot, for me.

I don't agree that you need more time to add new idea to your assembly program than in C or other language. If you use a structured programming style, you can easly change any algorithm in assembly the same as you can do it in C. Maybe in assembly you have to write more istructions than in C... and this could statistically give you more ways to add bugs... but we now agree that statistic is not the better way to analize programs and programming

bob · Post by **bob** » Fri Sep 25, 2009 7:29 pm

stegemma wrote:
mcostalba wrote: ...and this makes you program weaker because it is like developing with an hand bound behind your back.
Wow... i really like what you've said! In fact programming in assembly is a very time comsuming activity but not to write the program itself but to debug it. To avoid or limit this, one should write some debugging function, to show the state of the chessboard and other structure. This could help a lot, for me.

I don't agree that you need more time to add new idea to your assembly program than in C or other language. If you use a structured programming style, you can easly change any algorithm in assembly the same as you can do it in C. Maybe in assembly you have to write more istructions than in C... and this could statistically give you more ways to add bugs... but we now agree that statistic is not the better way to analize programs and programming

Either you are not a very good assembly language programmer, or else you have not written any complex program. The assembly language tricks, such as instruction lifting (to mimimize latency from when you do a load until you actually need the data) or trying to re-order instructions to produce more ILP for the microprocessor, all go to make the program much harder to write, and then after you have carefully optimized it, it becomes _very_ difficult to modify it to test new ideas. In Cray Blitz we had over 20,000 lines of CAL (Cray Assembly Language). Fortunately we had a complete FORTRAN implementation as well, so that we could first test new ideas on the FORTRAN, and if they worked, then commit the week or two needed to take a couple of hours of FORTRAN programming and convert it to CAL and then merge it into the existing code. I _hated_ doing that part. Debugging was hard, and errors were legion.

I've done a ton of assembly language programming. For some things, you might be able to write assembly language that is easy to modify. But if you write it for speed, and modifiy it further for speed, the result is anything but easy to modify. Whether it be an old IBM 1620, IBM /360 or /370, a Xerox Sigma 9, Motorola 680x0, 6502, NC32032, a Vax, a Cray, a CDC, or X86/X86-64. I've done 'em all, And optimized assembly is a bear to work on.

stegemma · Post by **stegemma** » Sat Sep 26, 2009 10:22 am

bob wrote:...Either you are not a very good assembly language programmer, or else you have not written any complex program. The assembly language tricks, such as instruction lifting (to mimimize latency from when you do a load until you actually need the data) or trying to re-order instructions to produce more ILP for the microprocessor, all go to make the program much harder to write, and then after you have carefully optimized it, it becomes _very_ difficult to modify it to test new ideas. In Cray Blitz we had over 20,000 lines of CAL (Cray Assembly Language). Fortunately we had a complete FORTRAN implementation as well, so that we could first test new ideas on the FORTRAN, and if they worked, then commit the week or two needed to take a couple of hours of FORTRAN programming and convert it to CAL and then merge it into the existing code. I _hated_ doing that part. Debugging was hard, and errors were legion.

I've done a ton of assembly language programming. For some things, you might be able to write assembly language that is easy to modify. But if you write it for speed, and modifiy it further for speed, the result is anything but easy to modify. Whether it be an old IBM 1620, IBM /360 or /370, a Xerox Sigma 9, Motorola 680x0, 6502, NC32032, a Vax, a Cray, a CDC, or X86/X86-64. I've done 'em all, And optimized assembly is a bear to work on.

My experience in assembly goes from simple games (for sample: a 3D maze for the ZX81 in 1Kb RAM, in far '83) to chess programs (Drago, Raffaela and Freccia) up to an operating system (GemDI: all assembly code, windows like interface, internet connection through modem, mouse, network port handlig and so on... in just 37 Kb of executable!). I'm not a novice but still i'm not experienced as you are... but who could be???

So, IMHO, for "any" program, things are surely as you say but for chess programs i think that we're not talking about a "complex program". Just to provide you with some comparable data, on my PC (Vista 32 bit) the perft 6 from initial position gives this value:

Code: Select all

_____
unable to open book file [./book.bin].
book is disabled
unable to open book file [./books.bin].

Initializing multiple threads.
System is SMP, not NUMA.

Crafty v23.0 (1 cpus)

White(1): perft 6
total moves=119060324  time=14.26
White(1):
_____
Freccia:
1: perft=20 ms=1 nodi=1 n/s=1000 mosse=20 m/s=20000
2: perft=400 ms=1 nodi=21 n/s=21000 mosse=420 m/s=420000
3: perft=8902 ms=1 nodi=421 n/s=421000 mosse=9322 m/s=9322000
4: perft=197281 ms=11 nodi=9323 n/s=847545 mosse=206603 m/s=18782091
5: perft=4865609 ms=248 nodi=206604 n/s=833081 mosse=5072212 m/s=20452468
6: perft=119060324 ms=6240 nodi=5072213 n/s=812855 mosse=124132536 
m/s=19893035

So my perft 6 it's not the faster one in the world but is still very fast, even if i haven't optimized yet the check control. It is double faster than Crafty v23: 6,24 seconds against 14.26. Obviously Crafty on 64 bit CPU is another thing, i think, but i haven't a 64 bit version of my engine to compare with.

Still i think that almost all of the standard algorithm are well know, today, so how one can get better performance? With new algorithms (i'm working on genethical ones... in assembly) or with strong assembly implementation and optimization. That's very hard? Yes... but simple things are not interesting, for me

Don · Post by **Don** » Sat Sep 26, 2009 2:42 pm

stegemma wrote: So my perft 6 it's not the faster one in the world but is still very fast, even if i haven't optimized yet the check control. It is double faster than Crafty v23: 6,24 seconds against 14.26. Obviously Crafty on 64 bit CPU is another thing, i think, but i haven't a 64 bit version of my engine to compare with.

Still i think that almost all of the standard algorithm are well know, today, so how one can get better performance? With new algorithms (i'm working on genethical ones... in assembly) or with strong assembly implementation and optimization. That's very hard? Yes... but simple things are not interesting, for me

But then why do you like assembly language programming? It's the simplest of all. Only a small handful of very simple concepts to grasp and the rest comes out of the manual or op-code reference.

I think you actually think simple things are more interesting or you would not avoid the "complexities" of high level languages.

bob · Post by **bob** » Sat Sep 26, 2009 4:38 pm

stegemma wrote:
bob wrote:...Either you are not a very good assembly language programmer, or else you have not written any complex program. The assembly language tricks, such as instruction lifting (to mimimize latency from when you do a load until you actually need the data) or trying to re-order instructions to produce more ILP for the microprocessor, all go to make the program much harder to write, and then after you have carefully optimized it, it becomes _very_ difficult to modify it to test new ideas. In Cray Blitz we had over 20,000 lines of CAL (Cray Assembly Language). Fortunately we had a complete FORTRAN implementation as well, so that we could first test new ideas on the FORTRAN, and if they worked, then commit the week or two needed to take a couple of hours of FORTRAN programming and convert it to CAL and then merge it into the existing code. I _hated_ doing that part. Debugging was hard, and errors were legion.

I've done a ton of assembly language programming. For some things, you might be able to write assembly language that is easy to modify. But if you write it for speed, and modifiy it further for speed, the result is anything but easy to modify. Whether it be an old IBM 1620, IBM /360 or /370, a Xerox Sigma 9, Motorola 680x0, 6502, NC32032, a Vax, a Cray, a CDC, or X86/X86-64. I've done 'em all, And optimized assembly is a bear to work on.
My experience in assembly goes from simple games (for sample: a 3D maze for the ZX81 in 1Kb RAM, in far '83) to chess programs (Drago, Raffaela and Freccia) up to an operating system (GemDI: all assembly code, windows like interface, internet connection through modem, mouse, network port handlig and so on... in just 37 Kb of executable!). I'm not a novice but still i'm not experienced as you are... but who could be???

So, IMHO, for "any" program, things are surely as you say but for chess programs i think that we're not talking about a "complex program". Just to provide you with some comparable data, on my PC (Vista 32 bit) the perft 6 from initial position gives this value:
Code: Select all
_____
unable to open book file [./book.bin].
book is disabled
unable to open book file [./books.bin].

Initializing multiple threads.
System is SMP, not NUMA.

Crafty v23.0 (1 cpus)

White(1): perft 6
total moves=119060324  time=14.26
White(1):
_____
Freccia:
1: perft=20 ms=1 nodi=1 n/s=1000 mosse=20 m/s=20000
2: perft=400 ms=1 nodi=21 n/s=21000 mosse=420 m/s=420000
3: perft=8902 ms=1 nodi=421 n/s=421000 mosse=9322 m/s=9322000
4: perft=197281 ms=11 nodi=9323 n/s=847545 mosse=206603 m/s=18782091
5: perft=4865609 ms=248 nodi=206604 n/s=833081 mosse=5072212 m/s=20452468
6: perft=119060324 ms=6240 nodi=5072213 n/s=812855 mosse=124132536 
m/s=19893035
So my perft 6 it's not the faster one in the world but is still very fast, even if i haven't optimized yet the check control. It is double faster than Crafty v23: 6,24 seconds against 14.26. Obviously Crafty on 64 bit CPU is another thing, i think, but i haven't a 64 bit version of my engine to compare with.

Still i think that almost all of the standard algorithm are well know, today, so how one can get better performance? With new algorithms (i'm working on genethical ones... in assembly) or with strong assembly implementation and optimization. That's very hard? Yes... but simple things are not interesting, for me

I'm not sure this proves anything. Perft is not a speed test, as I implemented it, it is a _correctness_ test. I don't hash the perft search, I didn't try to optimize the way it works to avoid legality tests where possible, etc. It was only written to measure node counts for given positions to catch move generator issues.

How does the comparison look for a real search? That's what actually counts.

As far as chess goes, I would call any program that is over 10,000 lines of code "complex". And that is C. When you turn it into assembly language, it becomes 2-3x (at least) bigger. My comments were from a software-engineering perspective. How easy is the language to use? How easy is it to debug? and most importantly, how easy is it to modify?

The only advantage for assembly language is none of those, it is purely performance, because we know things about the program that the compiler can't deduce, which lets us optimize better, but which also makes it harder to understand the program later.

stegemma · Post by **stegemma** » Sat Sep 26, 2009 4:57 pm

bob wrote: ...
The only advantage for assembly language is none of those, it is purely performance, because we know things about the program that the compiler can't deduce, which lets us optimize better, but which also makes it harder to understand the program later.

In fact assembly language gives you the "power" to optimize better. Because chess is a competitive game and chess programming lead us to competition... having the strongest tools could give you some advantage. At end only doing true games between programs will provide a measure of the goodness of the tools, algorithm, implementation and so on.

The scope of the test reported, based on perft, was done only to show how we can get a faster program even at the first stage of developing. Maybe i was wrong comparing my perft with those of Crafty because my perft use the same procedures of the true program and is "implemented" only throwing away things from alfa-beta (no tree cutting, no hash tables...). I was thinking that even Crafty perft was done in the same way but now you say that is'nt true, so i'm wrong on that.

About the lines of code, Freccia engine is now:

- move generator: 991 lines
- alfa-beta 197
- data + debug routines: 692

For now i think that the program could be understood later, almost by me... but i will say that for sure only after some year from today.

The last aspect is very programmer depending. The same problem becomes from any language, i think.

undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning