Results of testing Crafty

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

nczempin

Results of testing Crafty

Post by nczempin »

bob wrote:I think the thing that is getting on everyone's nerves is the fact that it appears that the number of games required to make sensitive decisions is far larger than was originally thought. Large enough that it is actually very difficult to play enough unless you have some wild hardware resources.

I (and others) have said for years that events like the WCCC don't prove a thing about which program is best, unless you have a run like chess 3.x/4.x had where they won almost every year for nearly 10 years, or deep thought which did the same. If you win enough, it becomes convincing, but even that just identifies when one program is far superior to the others. But to test for determining whether a change is good or not requires far more games, and that is a bit disappointing... Particularly when quite a few (myself included) have been using what is obviously way too few games in the past...
Did you mean sensitive or sensible decisions? Could both make sense, although I would find "sensitive" a bit unusual.

Of course small tournaments don't prove anything; they are still fun, even in disciplines where there is inherently more variation than in chess, such as sports.

What I don't understand is that some of your stronger competitors have also been using these apparently problematic test methods, yet they still come out on top even after a much larger number of games. Are you perhaps focusing too much on testing small changes rather than finding some bigger improvements? Some of the stronger engines are actually open-source too. I'm sure they have all learned large amounts of stuff from you and Crafty, but somehow they managed to surpass you. Are you deliberately ignoring their sources, or is your (or their) approach just too different for any of their ideas to apply to your engine?

Just one example comes to mind, because it was discussed here some time last year: generating legal moves. I seem to recall that you are adamant that your method of just letting the king be captured and then invalidating the previous move is better or at least not worse than "checking for check". Tord said he did it differently, and I think Fruit does, too. Only circumstantial evidence, of course, and only an example.

I also see you reply to some ideas with "I have tried that, didn't show any improvement". Was this before or after you started realizing that for your situation more games are needed (BTW I can say with quite some confidence now that for Eden's level, far fewer games are needed)?

And regarding the small changes you test for: Isn't it possible that trying to isolate each small change can mislead your test methodology? What if n small changes, each one by itself, would show no improvement, but the combination of them would be much bigger?

Also, there are certain changes that can never be small. Changing the position representation from 0x88 to bb or a hybrid or vice versa would impact most engines' source code severely. Perhaps this particular example is only an "outlier", but can you think of other examples?

Sorry to ask so many questions at once, but I've been fairly quiet recently (except when I made a fool of myself :-) ) and they built up.
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: Results of testing Crafty

Post by Pradu »

nczempin wrote:Are you perhaps focusing too much on testing small changes rather than finding some bigger improvements? Some of the stronger engines are actually open-source too. I'm sure they have all learned large amounts of stuff from you and Crafty, but somehow they managed to surpass you. Are you deliberately ignoring their sources, or is your (or their) approach just too different for any of their ideas to apply to your engine?
Just my opinion, but I think it is very difficult to learn from opensource engines. To show my point, imagine someone without the knowledge of alphabeta pruning. Then he/she go takes a look at a opensource engine with all kinds of zero-window things, nullmove pruning, node types... and he/she gets througly confused. I myself have tried to read the evaluation function of Glaurung and Fruit, however they are extremely complex. For example Fruit's kingsafety shelter code, did anyone understand that without much difficulty? In my opinion you could learn far more if the authors or someone who actually manages to understand the complex parts of Fruit's or Glaurung's eval publish a paper on it explaining how it works. In my opinion, a paper would be far more useful than source code. Just my opinion...
Last edited by Pradu on Fri Sep 28, 2007 2:16 am, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Results of testing Crafty

Post by bob »

nczempin wrote:
bob wrote:I think the thing that is getting on everyone's nerves is the fact that it appears that the number of games required to make sensitive decisions is far larger than was originally thought. Large enough that it is actually very difficult to play enough unless you have some wild hardware resources.

I (and others) have said for years that events like the WCCC don't prove a thing about which program is best, unless you have a run like chess 3.x/4.x had where they won almost every year for nearly 10 years, or deep thought which did the same. If you win enough, it becomes convincing, but even that just identifies when one program is far superior to the others. But to test for determining whether a change is good or not requires far more games, and that is a bit disappointing... Particularly when quite a few (myself included) have been using what is obviously way too few games in the past...
Did you mean sensitive or sensible decisions? Could both make sense, although I would find "sensitive" a bit unusual.
"sensitive" as in "sensitive enough to accurately detect small changes..."



Of course small tournaments don't prove anything; they are still fun, even in disciplines where there is inherently more variation than in chess, such as sports.

What I don't understand is that some of your stronger competitors have also been using these apparently problematic test methods, yet they still come out on top even after a much larger number of games. Are you perhaps focusing too much on testing small changes rather than finding some bigger improvements? Some of the stronger engines are actually open-source too. I'm sure they have all learned large amounts of stuff from you and Crafty, but somehow they managed to surpass you. Are you deliberately ignoring their sources, or is your (or their) approach just too different for any of their ideas to apply to your engine?
Not so fast. :)

First, I have only been testing and playing with this stuff on our cluster for a year or so. Prior to that I didn't have any significant number of machines to test in parallel on.

Second, have you ever seen explanations by Ed Schroeder, or other commercial authors, where they have discussed how they test? A very few have discussed the idea. Ed used to have a group of machines set aside solely for auto232-type matches when testing new versions. He certainly couldn't run as many games as I ran, but he did do weeks of runs at a time.

However, for the most part, the most challenging part of developing a chess engine is determining whether a new change is good or bad. It is important enough that most current commercial types won't discuss how they test, or how they evaluate the results, because that is yet another competitive edge. 95% or more of amateurs are just making ad hoc changes and if it seems to play better, off they go to the next change. I did that for years myself. It becomes a hindrance rather than a help however... and you have to develop better methods.


Just one example comes to mind, because it was discussed here some time last year: generating legal moves. I seem to recall that you are adamant that your method of just letting the king be captured and then invalidating the previous move is better or at least not worse than "checking for check". Tord said he did it differently, and I think Fruit does, too. Only circumstantial evidence, of course, and only an example.
I was _never_ "adamant" that my way was better. I was certainly positive that my worked better for _my_ program. And I have been very consistent in that kind of statement over the years with rare exceptions. In fact, I am not sure you don't have that out of context, because for Crafty, it has checked for "in check" since version 9.0 because of null-move. You can't do a null move if the side on move is in check, it will fail high and hide mates every time. I don't check for "in check" in my q-search, and let capturing the king cause me to back up one ply. And that does work for me because most captures are not king captures and the in_check test takes time at _every_ node, where the capture a king test only happens at very rare nodes.


I also see you reply to some ideas with "I have tried that, didn't show any improvement". Was this before or after you started realizing that for your situation more games are needed (BTW I can say with quite some confidence now that for Eden's level, far fewer games are needed)?
Depends. Some ideas proposed here are not about changing the evaluation, they are related to speed. And for speed, whatever is fastest is best, assuming there are no trade-offs. I have known for at least a few years that there was significant variability between the same two programs. But since most of my testing (actually all) involved books and such, I never spent any time trying to track down where the variability comes from. So for many ideas, you can test for speed, or you can test on a particular type of tactical position, to determine if the change was better or worse. The eval changes, and minor search tweaks require games to validate, however. Lots of games... The smaller the change, the more games you need...


And regarding the small changes you test for: Isn't it possible that trying to isolate each small change can mislead your test methodology? What if n small changes, each one by itself, would show no improvement, but the combination of them would be much bigger?
I try to not depend on serendipity to make progress. I have a good idea of how each new change is going to affect the search. I want to test to be sure that (a) the change has a positive influence as I suspect and (b) that there are no negative side-effects that make a good idea seem bad until they are fixed. The chance of two unrelated changes producing no benefit individually, but somehow magically interacting to produce a better evaluation is just beyond remote. I have a very clear understanding of what we are evaluating and why. The problem is exposing the unexpected behavior which is what kills most ideas, rather than just adding something that is bad in and of itself. We are good enough to not do that often if at all. But side-effects in a complex eval or search are a different matter... yet they must be discovered.


Also, there are certain changes that can never be small. Changing the position representation from 0x88 to bb or a hybrid or vice versa would impact most engines' source code severely. Perhaps this particular example is only an "outlier", but can you think of other examples?
I don't worry about that kind of change here. First, what would be the reason for it? If you believe that one is faster than the one you use, you can make that change and validate it without playing games. For example, in Crafty, with version 21.0 I renumbered the bits in a bitboard to match the native way intel numbers them for BSF/BSR. It was a big change. And the way I verified I had done it correctly was to run a large set of positions through the old and new versions searched to the same fixed depth, and I verified that the node counts and scores/pvs did not change at all. It took a couple of months. No need to play games to do that kind of change. When I added reductions, I first tested against positions comparing old to new on positional and tactical test cases. Once the stuff looked to be better, then I started playing large matches to determine how to best tune it as that is not so obvious, compared to just adding the code to do reductions or extensions.

and I am pretty sure my current extension amounts are non-optimal and I am before long going to smoke my cluster varying all of those values up and down to see which set produces the best result, irregardless of how that set does on tactical test sets (which I predict it will do worse on, yet do better in real games.) But for this I need to test.

So it all depends on what you are trying to measure. Faster is _always_ better everything else remaining equal. So that is easy to test without games. It is the other more subtle stuff, such as should I extend checks by 1 ply, or 3/4 ply, or what? You can get one answer on tactical positions, and another answer on games. I care more about games.


Sorry to ask so many questions at once, but I've been fairly quiet recently (except when I made a fool of myself :-) ) and they built up.
Not a problem... hope I answered them in a way that is not too obscure...
Vempele

Re: Results of testing Crafty

Post by Vempele »

Pradu wrote:In my opinion you could learn far more if the authors or someone who actually manages to understand the complex parts of Fruit's or Glaurung's eval publish a paper on it explaining how it works. In my opinion, a paper would be far more useful than source code. Just my opinion...
Here's how Toga's eval works.
nczempin

Re: Results of testing Crafty

Post by nczempin »

Pradu wrote:
nczempin wrote:Are you perhaps focusing too much on testing small changes rather than finding some bigger improvements? Some of the stronger engines are actually open-source too. I'm sure they have all learned large amounts of stuff from you and Crafty, but somehow they managed to surpass you. Are you deliberately ignoring their sources, or is your (or their) approach just too different for any of their ideas to apply to your engine?
Just my opinion, but I think it is very difficult to learn from opensource engines. To show my point, imagine someone without the knowledge of alphabeta pruning. Then he/she go takes a look at a opensource engine with all kinds of zero-window things, nullmove pruning, node types... and he/she gets througly confused. I myself have tried to read the evaluation function of Glaurung and Fruit, however they are extremely complex. For example Fruit's kingsafety shelter code, did anyone understand that without much difficulty? In my opinion you could learn far more if the authors or someone who actually manages to understand the complex parts of Fruit's or Glaurung's eval publish a paper on it explaining how it works. In my opinion, a paper would be far more useful than source code. Just my opinion...
I am sure Bob would cope.