An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

hristo

Re: An objective test process for the rest of us?

Post by hristo »

hgm wrote:I thought they just took pills!? Or was this the weight lifters?
;-)
As a former swimmer I never took pills, but I also rarely lifted weights , since it is actually counter productive for a long distance swimming.
To answer your inquiry "I don't know." -- I'm sure that some people take pills, just like some computers get over-clocked. ;-)

Regards,
Hristo

p.s.
Your observation about the limited number of initial positions in long gauntlets and the resulting bias and possible mistakes in engine tuning makes more sense, to me, as a form of criticism, when compared to the "disable 90% of your engine if you want to make it better".
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Note that this actually was a conditional statement: "if it helps to make it better". Of course most of my engines do not even contain 10% of code that could be disabled without them ceasing to be able to do anything at all. And in uMax the aim is to keep that percentage at 0%. But the message is that I would not hesitate to switch off anything that interferes with testing. The logic: "it makes testing more difficult, and it is easy to switch it off, but I keep it switched on anyway, because I need it in competition" just is no logic at all.

As for the athletes, I guess the difference is if you want to train for sustained aerobic ability, which is only possible by prolongued intensive use and loading of the heart-lung system, or if you want to train for explosive loading, which mainly requires muscle mass. Especially in the latter case there are more effective ways to train the required muscles. The critical effort in the high jump is just a single step in a comparatively lengthy sequence, that has the tendency to wreck your back when you finally come down. Much more effective, timewise, to hop around your house on one leg, where every step counts and you cannot break anything through an unlucky fall. For prolongued loading there is usually no way to intensify the training of the involved parts, as the action itself already loads them to the sustainable maximum. That doesn't mean that there can't be alternatives, and other good reasons to use them: Long-distance ice skaters often train in the summer by riding a bicycle...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

jwes wrote:
bob wrote:
jwes wrote: I think you missed the point. If your search is deterministic, you can test with any timing algorithm and then use the number of nodes searched to recreate the search while debugging.
No you can't. Here's why. You test with pondering on. You set a target time of 180 seconds. Your opponent fails low and uses more time. And searches for a total of 390.441 seconds and makes a move. You move instantly. You have _zero_ chance to re-create that timing the next time around.
Then you look in your log file, see that you searched 453823632 nodes including pondering and set your program to terminate searching after 453823632 nodes. Why would that not be the same search tree if your program is deterministic and does not use information from prior searches?

This goes in circles. I have explained _many_ times previously that I can't do that... When I am playing games, which _always_ means using a parallel search, then at any random instant in the game I can not determine exactly how many nodes I have searched. Different threads have their own node counters. And as they split and re-split, more and more of these counters are created. And at any specific instant, some counters have redundant information as split points are cleaned up, or scores are backed up, etc. Bottom line is that the only time I have an exact count is when an iteration finishes as all split points have been cleaned up then. So the node count approach does not work for me...

bob wrote:But the other stuff I wouldn't consider. Not carrying hash entries from one search to the next. Not pruning based on hash information. Clearing killer and history counters even though they contain useful information. Etc... I don't see any advantage to doing any of that at all. Because when I get a move from a real search in a real game, I am not going to be able to reproduce it anyway some of the time.

Another is "In what positions will the program make poor moves ?". Here, it is obviously valuable to be able to exactly recreate the search tree.
yes. But you are really asking "in what positions will it make poor moves when significant parts of the search are made inoperational?"
To a large extent, these search changes should not change the results of the search, only the time the search takes.
bob wrote:Again, that is wrong. Any tiny timing change in the search has several influences. from what is stored in the transposition table, to what is stored in the killer moves and history counters.

The problem is that the "deterministic requirement" is one that is useful when it is not available. I carefully watch games that are being played in tournaments, move by move, second by second, as the game progresses, and that is where I see the things that are the subject of analysis later. And there, I have _everything_ turned on. Including SMP search, perhaps on hardware I can't even use to test later when I have time.

I don't do a lot of that on these cluster matches I play. Too much data. There I am only interested in a quantitative good/bad indication for whatever I am testing...

I find it tough to consider modifying various parts of my search so that I can deterministically play moves, and then debug all of that to make sure it doesn't break something unexpectedly, and then realize that I won't have this stuff turned off during real games, which is where I am most likely going to notice something that I want to look at later...
This is just your personal preference, as you play tens of thousands of test games for every tournament game and errors should be equally likely. (Except for the demo effect, where occurrence of bugs is directly related to the importance of the occasion, e.g. Windows crashing when Bill Gates first demoed it.)
What I was trying to say is that if there are problems with evaluation or extensions/reductions, they should show up roughly as often in deterministic searches as normal ones.
If I didn't agree with that I wouldn't be testing. But that's a far distance from being able to _reproduce_ the error once it does show up somewhere...
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Well, one more reason not to test under SMP.

That you have a crappy SMP implementation that might at any moment reveal new hitherto undiscovered bugs, is another issue. There is no reason at all why you should hunt for those bugs under the same conditions as for testing eval changes.

This seems in fact a very obvious example why testing under tournament conditions is undesirable: under tournament conditions you would play in a configuration that makes it very unlikely that new bugs would reveal themselves, while under testing you would like them to manifest themselves abundantly. To catch the bugs I would play with hundreds of processors (virtual ones, of course, just run 128 processes per core) to force more split points, more communication, more of everything, to stress the system to breaking point. That is how you find bugs, so that they won't surprise you later.
hristo

Re: An objective test process for the rest of us?

Post by hristo »

hgm wrote:Note that this actually was a conditional statement: "if it helps to make it better". Of course most of my engines do not even contain 10% of code that could be disabled without them ceasing to be able to do anything at all. And in uMax the aim is to keep that percentage at 0%. But the message is that I would not hesitate to switch off anything that interferes with testing.
Yes, this is a good message and should be applied whenever applicable.
hgm wrote: The logic: "it makes testing more difficult, and it is easy to switch it off, but I keep it switched on anyway, because I need it in competition" just is no logic at all.
If this was the only reason then yes, you would be correct. However, I believe the reason (situation) is slightly different than the one stated (implied) above.

Removing terms of the evaluation function (or parts of the program that influence that evaluation) might help while determining the quality of the changes being introduced only with respect to the altered state of the engine and will tell nothing (or very little) with respect to the engine as a whole. Once all parts of the engine are re-enabled the entire test must be performed again to ensure that the new change doesn't interfere with the engine as a whole.

The testing time will be significantly higher if the process of integration goes through gradually turning some of the features 'on'. (Which would make sense if you are looking for a full, rigorous integration cycle)

From a practical standpoint (call it holistic) I believe that the testing procedure greatly depends on the state of the engine and the developers goals, therefore at some point it seems perfectly reasonable to introduce new features while testing those features only against the entirety of the engine and not against some of its parts.

Regards,
Hristo
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

But we are not talking at all about strongly interwoven parts like that. We are merely talking abot disabling features whose major function it is to provide a speedup of ~10% (remembering the leftovers of the information of the previous search, in stead of deriving them again from scratch), or searching the tree sequentially in stead of doing it in parallel in a way designed to emulate the results to do it sequentially as closely as possible.

In lowest order the presence of these features is supposed to have no effect at all on the search and evaluation (provided the slowdown is compensated by giving extra time). Thus there can only be small second-order effects on the size of the tree searched, which again in lowest order is supposed to hardly affect the relative strength of the evaluation. (The strength difference between two engines is hardly a function of the amount of search time given to them.)

As for the idea that iterations have to be interrupted to do realistic testing, this is si,mply inconsistent. No matter how long that iteration took, there would always have been a time control where exactly that amount of time would have been available for that move, so that it would not have to be interrupted. Assuming that the measurement is invalid because you finished the iteration is equivalent to claiming that there are time controls for which the strength of an engine cannot be determined.

Plus, despite all the sweet talking by Bob, I still have not seen shred of evidence that interrupting the iterations actually provides any strength at all. "This is the way I would play Chess as a Human" is simply no valid scientific argument.
hristo

Re: An objective test process for the rest of us?

Post by hristo »

hgm wrote:But we are not talking at all about strongly interwoven parts like that. We are merely talking abot disabling features whose major function it is to provide a speedup of ~10% (remembering the leftovers of the information of the previous search, in stead of deriving them again from scratch), or searching the tree sequentially in stead of doing it in parallel in a way designed to emulate the results to do it sequentially as closely as possible.
In theory it is all well and good, but in practice
Antony wrote:Well unfortunately we missed the win against Rybka today. Zappa cut out due to a lack of time and I really feel that with just another 20 seconds it could have found Rxg7. When I clear the hash tables, the 8x machine finds Rxg7 in under 1 minute. Oh well.
... here is where I found this.

Regardless of what we (you, I, others...) might think of the 10% increase in speed (hash tables) and the "triviality" of it all, the ramifications of those aspects of the engine can have huge influence on the game results during a match -- and consequently on the perceived strength of the engine.

Some of these interactions can qualify as genuine bugs, but they also might end up being a result of the overall chess-engine state, over which there is no control (during tournament conditions).
hgm wrote: In lowest order the presence of these features is supposed to have no effect at all on the search and evaluation (provided the slowdown is compensated by giving extra time).
It shouldn't but it sometimes does and then you may lose a bunch of money. ;-)
hgm wrote: Thus there can only be small second-order effects on the size of the tree searched, which again in lowest order is supposed to hardly affect the relative strength of the evaluation. (The strength difference between two engines is hardly a function of the amount of search time given to them.)
I don't know, it might be. In fact I strongly suspect that engines with higher (better) developed evaluation functions depend on time in a very direct and unforgiving way. So, your statement, if I understand it correctly, is wrong. ;-)
(This never happens)

Regards,
Hristo
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

hristo wrote:In theory it is all well and good, but in practice
Antony wrote:Well unfortunately we missed the win against Rybka today. Zappa cut out due to a lack of time and I really feel that with just another 20 seconds it could have found Rxg7. When I clear the hash tables, the 8x machine finds Rxg7 in under 1 minute. Oh well.
I don't see any point here. So Zappa could have come up with better moves given longer time. Who can't? How does this have any bearing that improvements still show up as improvements, whether you test them with a fast program at time control X, or with a 10% slower program at time control X+10%? (Or in fact at time control X-30%...)
Regardless of what we (you, I, others...) might think of the 10% increase in speed (hash tables) and the "triviality" of it all, the ramifications of those aspects of the engine can have huge influence on the game results during a match -- and consequently on the perceived strength of the engine.
For one, never was it suggested anywhere that you should switch off anything during a match that can earn you $100,000. It seems to me that making the choice between slightly faster improvement of your engine during 2 weeks and k$100 is easily made. But even then, I would not call a 1% lower win probability a "huge influence". That in hind-sight they are now blaming the loss of half a point on not being 10% (or whatever) faster is total bullshit. On a faster machine they would most likely have played a different move early on, and played a totally different game, which they might very well have lost. It is only the impact on the probability that can be ascribed to speed and program changes. The actual outcome is caused by chance, and programs with a slightly smaller win probability often win where the version with a higher win probaility would have lost. So you can never blame the difference between the materialized score and the score in some virtual parallel universe that only exists in your wildest optimistically biased imagination on the engine+hardware.
I don't know, it might be. In fact I strongly suspect that engines with higher (better) developed evaluation functions depend on time in a very direct and unforgiving way. So, your statement, if I understand it correctly, is wrong. ;-)
(This never happens)
This sounds extremely alien to me. Do you know even a single engine that behaves like this?
hristo

Re: An objective test process for the rest of us?

Post by hristo »

hgm wrote:
I don't know, it might be. In fact I strongly suspect that engines with higher (better) developed evaluation functions depend on time in a very direct and unforgiving way. So, your statement, if I understand it correctly, is wrong. ;-)
(This never happens)
This sounds extremely alien to me. Do you know even a single engine that behaves like this?
My response was to the following.
hgm wrote: Thus there can only be small second-order effects on the size of the tree searched, which again in lowest order is supposed to hardly affect the relative strength of the evaluation. (The strength difference between two engines is hardly a function of the amount of search time given to them.)
I believe that most strong engines will show fluctuations in their relative strength to one another as a function of time given to them (some engines will perform relatively better depending on the time control).

However, I do believe that this is rather obvious and so I suspect that you meant something else when you said "(The strength difference between two engines is hardly a function of the amount of search time given to them.)".

The point is (previous message as well) that dismissing the importance of time is not constructive when trying to make the engine play stronger chess.

Regards and Good morning,
Hristo
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hristo wrote:
hgm wrote:
I don't know, it might be. In fact I strongly suspect that engines with higher (better) developed evaluation functions depend on time in a very direct and unforgiving way. So, your statement, if I understand it correctly, is wrong. ;-)
(This never happens)
This sounds extremely alien to me. Do you know even a single engine that behaves like this?
My response was to the following.
hgm wrote: Thus there can only be small second-order effects on the size of the tree searched, which again in lowest order is supposed to hardly affect the relative strength of the evaluation. (The strength difference between two engines is hardly a function of the amount of search time given to them.)
I believe that most strong engines will show fluctuations in their relative strength to one another as a function of time given to them (some engines will perform relatively better depending on the time control).

However, I do believe that this is rather obvious and so I suspect that you meant something else when you said "(The strength difference between two engines is hardly a function of the amount of search time given to them.)".

The point is (previous message as well) that dismissing the importance of time is not constructive when trying to make the engine play stronger chess.

Regards and Good morning,
Hristo
I completely agree with that. Searching to some fixed depth or fixed number of nodes removes that issue, yet it influences games in significant ways that I want to observe. If I miss a move, and study shows that a little more time would help, then can I find a way to use more time there? I have two examples where I have done so.

1. Fail low triggers more time. I can recall Slate/Atkin bouncing in their chairs wondering "we just discovered this move sucks... are we going to have enough time to find that move Y is better as Levy pointed out?" I don't have that problem as when X fails low, I am going to search _hard_ to try to find Y and use a lot more time than usual to do so if possible.

2. I have an OK move, but my program is about to change its mind to a new better move. I could tell this since I display the current move being searched once every 15 seconds just to let the operator know how long I have been searching and what I am searching on. Normally the ply-1 moves go by quickly (after the first one) and the farther down the list I get, the faster they go by. Unless I run into a move that looks more promising and there it can take a lot of time to dismiss it (but it requires more nodes which triggers my ply-1 move reordering to move this near the top for the next iteration to check it more carefully earlier in the search). But once I get into searching a move that will become the new best move, if I have enough time to complete the search, I don't worry about having enough time because I never to a time abort in the middle of an iteration until all currently active ply-1 moves are completed (since I do a parallel search and also split at the root, I can have N ply-1 moves in various stages of completion).

By playing using real time as the limiting factor, the search time-abort code has evolved to handle those cases quite well without the mental duress on the operator that is caused by "will it have enough time?"

Why would I want to miss such ideas, and avoid testing the ones I have?

There are other tricks I have learned, and some I have not. Deep Blue had a novel way of triggering extra time that I have not yet fully understood since it was never explained clearly. But I know the idea is out there, waiting to be re-discovered again...