An objective test process for the rest of us?

bob · Post by **bob** » Mon Sep 24, 2007 12:26 am

jwes wrote:
bob wrote:
hgm wrote:
bob wrote:You overlook _one_ important detail. Lose X elo due to clearing the hash, lose Y elo to do a deterministic timing algorithm. Lose Z elo by clearing killer moves. Before long you have lost a _bunch_. A 10 elo improvement is not exactly something one can throw out if he hopes to make progress...
And after finding and removing 10 bugs, you flip a switch, and tataa...!

Suddenly all the lost Elo points are back!
Not if you don't have the code to do any of those things effectively...
The point is that by deferring implementation of full hash tables, he mekes all his bugs deterministic, which means anytme his program plays a bad move in a test, he can easily recreate it while debugging.

And when do you implement a "full hash table"??? I'm already there... And have been there for 30 years...

And hash tables are not the only thing we are talking about. We already have cut out pondering, which certainly adds non-determinism but also adds characteristics that need testing (what to do when you accumulate extra (saved) time? What about instant moves and easy moves? ) each and every thing we remove from the program removes the experimental value of the test to some degree or other. If you remove too much, then interactions between the omitted features are totally ignored. Yet I have spent months looking at SMP issues in certain odd cases. If I don't test it, I won't even know the odd cases exist until they are critical cases and I blow them when I can't change something while a game is in progress...

jwes · Post by **jwes** » Mon Sep 24, 2007 7:50 am

bob wrote:
jwes wrote:
bob wrote:
hgm wrote:
bob wrote:You overlook _one_ important detail. Lose X elo due to clearing the hash, lose Y elo to do a deterministic timing algorithm. Lose Z elo by clearing killer moves. Before long you have lost a _bunch_. A 10 elo improvement is not exactly something one can throw out if he hopes to make progress...
And after finding and removing 10 bugs, you flip a switch, and tataa...!

Suddenly all the lost Elo points are back!
Not if you don't have the code to do any of those things effectively...
The point is that by deferring implementation of full hash tables, he mekes all his bugs deterministic, which means anytme his program plays a bad move in a test, he can easily recreate it while debugging.
And when do you implement a "full hash table"??? I'm already there... And have been there for 30 years...

I thought we were talking about Uri's program.

bob wrote:And hash tables are not the only thing we are talking about. We already have cut out pondering, which certainly adds non-determinism but also adds characteristics that need testing (what to do when you accumulate extra (saved) time? What about instant moves and easy moves? ) each and every thing we remove from the program removes the experimental value of the test to some degree or other. If you remove too much, then interactions between the omitted features are totally ignored. Yet I have spent months looking at SMP issues in certain odd cases. If I don't test it, I won't even know the odd cases exist until they are critical cases and I blow them when I can't change something while a game is in progress...

Timing is not a problem. If you have an option to stop search after a given number of nodes, you can use it to exactly reproduce your search tree. SMP is a whole 'nother can of worms. No one is saying you should always test this way. There are various things to test for. One is "Will version B play better than version A in game conditions ?". Here you obviously want to test as close as practicable to game conditions. (BTW, have you checked that results in your 40 position matches are highly correlated with tesults under game conditions ?) Another is "In what positions will the program make poor moves ?". Here, it is obviously valuable to be able to exactly recreate the search tree.

bob · Post by **bob** » Mon Sep 24, 2007 8:06 am

jwes wrote:
bob wrote:
jwes wrote:
bob wrote:
hgm wrote:
bob wrote:You overlook _one_ important detail. Lose X elo due to clearing the hash, lose Y elo to do a deterministic timing algorithm. Lose Z elo by clearing killer moves. Before long you have lost a _bunch_. A 10 elo improvement is not exactly something one can throw out if he hopes to make progress...
And after finding and removing 10 bugs, you flip a switch, and tataa...!

Suddenly all the lost Elo points are back!
Not if you don't have the code to do any of those things effectively...
The point is that by deferring implementation of full hash tables, he mekes all his bugs deterministic, which means anytme his program plays a bad move in a test, he can easily recreate it while debugging.
And when do you implement a "full hash table"??? I'm already there... And have been there for 30 years...
I thought we were talking about Uri's program.
bob wrote:And hash tables are not the only thing we are talking about. We already have cut out pondering, which certainly adds non-determinism but also adds characteristics that need testing (what to do when you accumulate extra (saved) time? What about instant moves and easy moves? ) each and every thing we remove from the program removes the experimental value of the test to some degree or other. If you remove too much, then interactions between the omitted features are totally ignored. Yet I have spent months looking at SMP issues in certain odd cases. If I don't test it, I won't even know the odd cases exist until they are critical cases and I blow them when I can't change something while a game is in progress...
Timing is not a problem. If you have an option to stop search after a given number of nodes, you can use it to exactly reproduce your search tree. SMP is a whole 'nother can of worms. No one is saying you should always test this way. There are various things to test for. One is "Will version B play better than version A in game conditions ?". Here you obviously want to test as close as practicable to game conditions. (BTW, have you checked that results in your 40 position matches are highly correlated with tesults under game conditions ?)

Yes. Although SMP changes things since some of the opponents can't use 2 processors and the results against them are far better when SMP is on.

But time allocation is a major component of how a program behaves. How does it allocate its time per move (some spend more time right out of book, some end up spending more time right before time control because they have saved up a surplus, some don't extend on fail lows, some do, etc.) Each and every one of those ideas exerts an influence on the program, and each can cause problems in unexpected ways. I have already mentioned that I did the "trivial test" of using a specific number of nodes, but that does not produce results consistent with using time at all. Because now there are no extended searches, no "easy moves" that save some of that wasted time to use later, no time variation (we use more time on the first few moves out of book), etc.

So I can easily produce reproducible results with the specific number of nodes approach, but the results are significantly different from normal playing results, which are also significantly different from real game results which would include book, pondering and SMP on top of the normal time variation...

But the other stuff I wouldn't consider. Not carrying hash entries from one search to the next. Not pruning based on hash information. Clearing killer and history counters even though they contain useful information. Etc... I don't see any advantage to doing any of that at all. Because when I get a move from a real search in a real game, I am not going to be able to reproduce it anyway some of the time.

Another is "In what positions will the program make poor moves ?". Here, it is obviously valuable to be able to exactly recreate the search tree.

yes. But you are really asking "in what positions will it make poor moves when significant parts of the search are made inoperational?"

hgm · Post by **hgm** » Mon Sep 24, 2007 9:26 am

bob wrote:A simple question here: "what is wrong with you?"

There is effort involved in making the changes. There is effort involved in verifying that the changes don't have bugs. There is effort involved in analyzing how all of this affects the rest of the search. So I am not arguing anything different at all. I just added explanation as to several reasons why I don't want to do things to my program that won't be used in real games...
I can't imagine anyone missing the connection that "more bugs" equates to "more effort". In addition to the effort to make the initial changes...

So developing and properly testing an engine is an effort. Therefore one should not do it. Strange logic...

I don't now, have never, and will never care about trying to make the search deterministic so that when something strange happens in a game, I can reproduce that result exactly. First, it isn't possible due to pondering and SMP issues, both of which affect hashing issues, and other search issues that rely on data that carries from move to move. So why add code to make parts of the thing deterministic, when it is completely impossible to make the most important parts deterministic????

Talk about "surreal" this really is it...

Well, not all engines are SMP. I would even be so bold to claim that virtually no engines are SMP.

But even if one would have an SMP engine, what you say here merely tells you that it is a bad idea to test evaluation changes under SMP conditions. Like it is a bad idea to test eval or search changes under keep-hash conditions. Of course SMP has to be tested when you develop it, and retested now and then to make sure that you haven't inadvertently broken anything. But there is no need whatsover to retest it for any tiny evaluation change you make.

Ever wonder where that idea came from? You will find it in Crafty.

Well, I never look at source code of other engines, so I cannot exclude that others do it too. It just seemed the obvious way to do it. Quibbling about priority does not seem of much relevance. The point was that it is a totally trivial change, that I would not even bother to test separately.

Except I don't use it to "invalidate" anything whatsoever, which is a lousy idea. I use it to prevent an old position from an earlier search from living in the hash table too long (because of its deep draft) wasting that entry with unnecessary data. I've _never_ "cleared" the hash table in any program I have since Cray Blitz days. The "id" flag was something I used prior to 1980 to avoid stomping thru the large hash tables I could create back then. So I have never heard such a silly statement in my life since I already do that. I am talking about code to _clear_ entries so that they won't be used to influence the next search or whatever. And no I don't do that. And no it would not be a couple of minutes worth of work. And yes, it could introduce errors. And yes the code would be worthless because the very idea is worthless...

Well, uMax uses replace-always, so there an entry cannot affect the search in any way. More advanced replacement algorithms typically use some aging field in the hash to prevent the accumulation of deep results that you describe above.Often they store something like Draft+SearchNr there, and use that quantity as the basis for replacement decisions. In such a design, it would be a trivial change to switch to using the SearchNr itself in stead, as Uri does. That would still be a trivial change, and that it is an "effort" to implement and test it cannot be taken seriously. Those that consider such an option useful, will happily make this 0.000001% effort.

The rest is merely your opinion, that has been discussed so much already that I won't comment on it.

Sorry, but that is _not_ why we build labs. We build labs so that we can control everything that is controllable. But physics labs don't control atomic motion any more than I do.

Sure we do. This is why physicists measure cross sections of chemical or nuclear reactions by atomic or molecular beams of well defined velocity, intersecting each other at well defined angle.

And if it is my goal to design a process that is going to work 10K meters beneath the surface of the ocean, I am not going to be doing my testing in a vacuum tank. Yes I'd like to eliminate any outside influence that I can ignore, but no I don't want to eliminate any component that is going to be an inherent part of the environment that the thing has to work in...

You are putting way too much into what you _think_ I am saying, rather than addressing what I really _am_ saying.

Simply:

I am not interested in designing, writing or testing something that I am not going to use in real games. It wastes too much time. There are exceptions. If I see a need for a debugging tool in my program, I add it. Using conditional compilation. And I deal with the testing and debugging because I feel that the resulting code will be worth the effort. But for what has been discussed here, it would change the basic search environment significantly, so that I could no longer be sure I am testing the same algorithm that is used in real games. Carrying or not carrying hash table entries from search to search can be a big deal. When you ponder and start at previous depth -1 or you start at ply 1 can be a big deal. And I certainly want to include those factors in my testing since they are an integral part of how crafty searches... and they influence the branching factor, time to move, etc...

I don't think I missed any of that. It just doesn't sound convincing to me. Sure the hash table has an impact on performance. But there should not be any interaction between the hashing and evaluation changes, so I would prefer to test them in isolation. If you empirically optimize the total package, you run the risk of compensating one wrong with another wrong, adding a flawed evaluation term or adopting a bad search strategy just because iit makes you search more robust towards hash errors.

I call your approach more alchemy that science, because of the testing requirements needed to make sure your "test probe" is very high-impedence.

Well, if you want to make an electronics metaphore... This is only what I would do to check out if a given piece of equipment was working according to given specifications. When I was developing something new, or repairing something that was faulty, I always started by breaking all feed-back loops, isolate all logical blocks from each other, and then make sure that they all perform as I think they should.

I can say with no reservation that good programs of the future are going to be non-deterministic from the get-go, because they _will_ be parallel. Given that, this entire thread is moot...

Except, of course, that you won't have to test their eval with parallel search. There is absolutely nothing to be gained in doing that, if you have 8 cores, better to play 8 independent games, one on each of them, at 8 times slower time control. Or just 6 times slower, to reach the same depth, as SMP speedup will probably not be 100%. So you can play more games of the same quality with the same resources.

bob · Post by **bob** » Mon Sep 24, 2007 6:34 pm

hgm wrote:
bob wrote:A simple question here: "what is wrong with you?"

There is effort involved in making the changes. There is effort involved in verifying that the changes don't have bugs. There is effort involved in analyzing how all of this affects the rest of the search. So I am not arguing anything different at all. I just added explanation as to several reasons why I don't want to do things to my program that won't be used in real games...
I can't imagine anyone missing the connection that "more bugs" equates to "more effort". In addition to the effort to make the initial changes...
So developing and properly testing an engine is an effort. Therefore one should not do it. Strange logic...

Apples and Oranges. Effort to develop the engine is necessary. Effort to add code to intentionally disable important parts of an engine might be necessary or not depending on what is being done. I have some extra code in Crafty for debugging certain parts of the program. I did that because the effort to develop the new stuff for debugging was less than the effort to debug without them. And I debugged that code as well, and the conditional compilation stuff that also introduces glitches. But to develop code that intentionally shuts off significant contributors to overall search efficiency is not something I want to do because I want to test everything possible so that when I hit a tournament, I don't have bugs suddenly show up because I am running using different code than when I am testing...

I don't now, have never, and will never care about trying to make the search deterministic so that when something strange happens in a game, I can reproduce that result exactly. First, it isn't possible due to pondering and SMP issues, both of which affect hashing issues, and other search issues that rely on data that carries from move to move. So why add code to make parts of the thing deterministic, when it is completely impossible to make the most important parts deterministic????

Talk about "surreal" this really is it...
Well, not all engines are SMP. I would even be so bold to claim that virtually no engines are SMP. But even if one would have an SMP engine, what you say here merely tells you that it is a bad idea to test evaluation changes under SMP conditions. Like it is a bad idea to test eval or search changes under keep-hash conditions. Of course SMP has to be tested when you develop it, and retested now and then to make sure that you haven't inadvertently broken anything. But there is no need whatsover to retest it for any tiny evaluation change you make.

"virtually no engines are SMP"

Better start counting, and you won't stop at 10 or 20 or even 30. I see many on ICC. I see questions from many fairly new programs that are now doing parallel search. It is the way of the future thanks to dual/quad-core becoming "the norm".

Ever wonder where that idea came from? You will find it in Crafty.
Well, I never look at source code of other engines, so I cannot exclude that others do it too. It just seemed the obvious way to do it. Quibbling about priority does not seem of much relevance. The point was that it is a totally trivial change, that I would not even bother to test separately.
Except I don't use it to "invalidate" anything whatsoever, which is a lousy idea. I use it to prevent an old position from an earlier search from living in the hash table too long (because of its deep draft) wasting that entry with unnecessary data. I've _never_ "cleared" the hash table in any program I have since Cray Blitz days. The "id" flag was something I used prior to 1980 to avoid stomping thru the large hash tables I could create back then. So I have never heard such a silly statement in my life since I already do that. I am talking about code to _clear_ entries so that they won't be used to influence the next search or whatever. And no I don't do that. And no it would not be a couple of minutes worth of work. And yes, it could introduce errors. And yes the code would be worthless because the very idea is worthless...
Well, uMax uses replace-always, so there an entry cannot affect the search in any way. More advanced replacement algorithms typically use some aging field in the hash to prevent the accumulation of deep results that you describe above.Often they store something like Draft+SearchNr there, and use that quantity as the basis for replacement decisions. In such a design, it would be a trivial change to switch to using the SearchNr itself in stead, as Uri does. That would still be a trivial change, and that it is an "effort" to implement and test it cannot be taken seriously. Those that consider such an option useful, will happily make this 0.000001% effort.

Replace-always doesn't matter here. I am talking about searching for white's move 21, and making that move on the board. Then searching for white's move 22, and using the stuff from the hash table that was put there during the move 21 search. Why would I not want to use that? I can't think of any possible reason. It helps me order moves, cut off subtrees completely, avoid null-move searches that will be pointless, tell me about certain extensions that should apply here (or not). Turning that off would be a major factor in my search, so that my "test version" would produce far different results from my "production version" which we use in tournaments. I don't want that....

The rest is merely your opinion, that has been discussed so much already that I won't comment on it.
Sorry, but that is _not_ why we build labs. We build labs so that we can control everything that is controllable. But physics labs don't control atomic motion any more than I do.
Sure we do. This is why physicists measure cross sections of chemical or nuclear reactions by atomic or molecular beams of well defined velocity, intersecting each other at well defined angle.
And if it is my goal to design a process that is going to work 10K meters beneath the surface of the ocean, I am not going to be doing my testing in a vacuum tank. Yes I'd like to eliminate any outside influence that I can ignore, but no I don't want to eliminate any component that is going to be an inherent part of the environment that the thing has to work in...

You are putting way too much into what you _think_ I am saying, rather than addressing what I really _am_ saying.

Simply:

I am not interested in designing, writing or testing something that I am not going to use in real games. It wastes too much time. There are exceptions. If I see a need for a debugging tool in my program, I add it. Using conditional compilation. And I deal with the testing and debugging because I feel that the resulting code will be worth the effort. But for what has been discussed here, it would change the basic search environment significantly, so that I could no longer be sure I am testing the same algorithm that is used in real games. Carrying or not carrying hash table entries from search to search can be a big deal. When you ponder and start at previous depth -1 or you start at ply 1 can be a big deal. And I certainly want to include those factors in my testing since they are an integral part of how crafty searches... and they influence the branching factor, time to move, etc...
I don't think I missed any of that. It just doesn't sound convincing to me. Sure the hash table has an impact on performance. But there should not be any interaction between the hashing and evaluation changes, so I would prefer to test them in isolation. If you empirically optimize the total package, you run the risk of compensating one wrong with another wrong, adding a flawed evaluation term or adopting a bad search strategy just because iit makes you search more robust towards hash errors.

My point is this: It is important to test what you are going to run in tournaments. I've not had any "bugs" show up in tournament games in years. How many have you seen lose games on time, or crash repeatedly, or whatever. Not Crafty. Because it has been thoroughly tested, completely as it is running, not in a crippled mode...

I call your approach more alchemy that science, because of the testing requirements needed to make sure your "test probe" is very high-impedence.
Well, if you want to make an electronics metaphore... This is only what I would do to check out if a given piece of equipment was working according to given specifications. When I was developing something new, or repairing something that was faulty, I always started by breaking all feed-back loops, isolate all logical blocks from each other, and then make sure that they all perform as I think they should

I do assume you use a very high-impedence probe while testing that thing, so that the testing itself doesn't bias the behavior? That is exactly what I am talking about. A chess engine (perhaps better to say a mature chess engine) is a very complex system. shutting off one part can have many side-effects that are unexpected. I want to make sure everything works, in the way it is intended..

I can say with no reservation that good programs of the future are going to be non-deterministic from the get-go, because they _will_ be parallel. Given that, this entire thread is moot...
Except, of course, that you won't have to test their eval with parallel search. There is absolutely nothing to be gained in doing that, if you have 8 cores, better to play 8 independent games, one on each of them, at 8 times slower time control. Or just 6 times slower, to reach the same depth, as SMP speedup will probably not be 100%. So you can play more games of the same quality with the same resources.

nczempin · Post by **nczempin** » Mon Sep 24, 2007 6:46 pm

bob wrote: "virtually no engines are SMP"

Better start counting, and you won't stop at 10 or 20 or even 30. I see many on ICC. I see questions from many fairly new programs that are now doing parallel search. It is the way of the future thanks to dual/quad-core becoming "the norm".

Just for some perspective, here's a list of WB and/or UCI engines, last count was 399.

bob · Post by **bob** » Mon Sep 24, 2007 7:39 pm

nczempin wrote:
bob wrote: "virtually no engines are SMP"

Better start counting, and you won't stop at 10 or 20 or even 30. I see many on ICC. I see questions from many fairly new programs that are now doing parallel search. It is the way of the future thanks to dual/quad-core becoming "the norm".
Just for some perspective, here's a list of WB and/or UCI engines, last count was 399.

OK, so at least one of every 10 are SMP. Is that "virtually no"?

And others are actively working on parallel search right now. Most commercial programs are SMP, because they want to be competitive. The amateurs have to catch up or else fall much farther behind.

When Crafty became the first micro-program to run parallel in 1996, it was a big novelty and everyone kept saying "why?" Now they know "why?" And many are following up on that concept...

I honestly do not know how many are working on this. Some email me with questions. Some programs just show up on ICC with the "experimental SMP" comment in their notes. Clearly it is not just a very few programs doing this however. And it requires a _ton_ of testing to screen out all the bugs.

diep · Post by **diep** » Mon Sep 24, 2007 9:03 pm

Hi Bob,

Could you show us a picture from a tunneling electronic microscope, just so we know that you have at least seen a picture of it after writing your story about it.

Vincent

hgm · Post by **hgm** » Mon Sep 24, 2007 9:09 pm

bob wrote:Apples and Oranges. Effort to develop the engine is necessary. Effort to add code to intentionally disable important parts of an engine might be necessary or not depending on what is being done. I have some extra code in Crafty for debugging certain parts of the program. I did that because the effort to develop the new stuff for debugging was less than the effort to debug without them. And I debugged that code as well, and the conditional compilation stuff that also introduces glitches. But to develop code that intentionally shuts off significant contributors to overall search efficiency is not something I want to do because I want to test everything possible so that when I hit a tournament, I don't have bugs suddenly show up because I am running using different code than when I am testing...

Well, I see no difference. We are talking about a tiny amount of code Uri adds because it helps him debugging.

Replace-always doesn't matter here. I am talking about searching for white's move 21, and making that move on the board. Then searching for white's move 22, and using the stuff from the hash table that was put there during the move 21 search. Why would I not want to use that? I can't think of any possible reason. It helps me order moves, cut off subtrees completely, avoid null-move searches that will be pointless, tell me about certain extensions that should apply here (or not). Turning that off would be a major factor in my search, so that my "test version" would produce far different results from my "production version" which we use in tournaments. I don't want that....

All you do is argue that KeepHash provides some strength. (How much is that? A head start of N-2 ply in an engine with an effective branching ratio of 3 would save you 10% time, which should translate to 10 Elo.)

I don't see the relevance for testing changes, as the only thing that interests you is in how strong A is compared to A'. Are you afraid that A would lose 10 Elo by ClearHash, and A' perhaps 11, so that you would be off 1 Elo if you would measure with ClearHash? Or can it be that A would gain 10 Elo with KeepHash, while A' would lose 10 Elo by it?

Almost everything you say applies just as much to the opening book. It makes you stronger durting a match, it gives extra noise in the test results, it take extra code to read in a FEN in stead, which has to be debugged, your book code is not used during testing, so you might run into unpleasant bugs during the tournament. Yet for the book you argue the other way around. It just doesn't make any sense.

My point is this: It is important to test what you are going to run in tournaments. I've not had any "bugs" show up in tournament games in years. How many have you seen lose games on time, or crash repeatedly, or whatever. Not Crafty. Because it has been thoroughly tested, completely as it is running, not in a crippled mode...

Yes, you don't want to appear in a tournament with untested code. But testing it once (in a 100,000 game gauntlet, or whatever you normally use), is enough. No reason to test it 100 times. And between tournaments you might want to test 100 eval changes. So if that can be made easier by switching off parts of your code (like the book...), it is well worth the effort to switch it off.

I do assume you use a very high-impedence probe while testing that thing, so that the testing itself doesn't bias the behavior? That is exactly what I am talking about. A chess engine (perhaps better to say a mature chess engine) is a very complex system. shutting off one part can have many side-effects that are unexpected. I want to make sure everything works, in the way it is intended..

Yes, once. I gather all the (limited) information I can get with that, and then I take the thing apart to measure the rest. I wouldn't get nowhere if I would limit myself to the non-invasive measurement.

hgm wrote: Except, of course, that you won't have to test their eval with parallel search. There is absolutely nothing to be gained in doing that, if you have 8 cores, better to play 8 independent games, one on each of them, at 8 times slower time control. Or just 6 times slower, to reach the same depth, as SMP speedup will probably not be 100%. So you can play more games of the same quality with the same resources.

I would still like to hear your comment on this. Do you think you would need a different evaluation when playing on 8 cores as when playing on a single core? What kind of terms improve play in the SMP version, but are counter-productive on a single core?

bob · Post by **bob** » Mon Sep 24, 2007 9:56 pm

hgm wrote:
bob wrote:Apples and Oranges. Effort to develop the engine is necessary. Effort to add code to intentionally disable important parts of an engine might be necessary or not depending on what is being done. I have some extra code in Crafty for debugging certain parts of the program. I did that because the effort to develop the new stuff for debugging was less than the effort to debug without them. And I debugged that code as well, and the conditional compilation stuff that also introduces glitches. But to develop code that intentionally shuts off significant contributors to overall search efficiency is not something I want to do because I want to test everything possible so that when I hit a tournament, I don't have bugs suddenly show up because I am running using different code than when I am testing...
Well, I see no difference. We are talking about a tiny amount of code Uri adds because it helps him debugging.

Again it is not a "tiny amount of code" in my case. But even if it were, would I really want to test in an even more degraded mode than I already use (no pondering, no book, no SMP search most of the time, no time allocation code, crippled hash implementation, no history/killer stuff carried from search to search)

Replace-always doesn't matter here. I am talking about searching for white's move 21, and making that move on the board. Then searching for white's move 22, and using the stuff from the hash table that was put there during the move 21 search. Why would I not want to use that? I can't think of any possible reason. It helps me order moves, cut off subtrees completely, avoid null-move searches that will be pointless, tell me about certain extensions that should apply here (or not). Turning that off would be a major factor in my search, so that my "test version" would produce far different results from my "production version" which we use in tournaments. I don't want that....
All you do is argue that KeepHash provides some strength. (How much is that? A head start of N-2 ply in an engine with an effective branching ratio of 3 would save you 10% time, which should translate to 10 Elo.)

1. I haven't measured it with the degree of accuracy I can measure with today, so I am not certain. Intuitively it is the correct way to do things, because it is absolutely the way I search as a human. I have sat down with IM Mike Valvo at ACM events to adjudicate games, and our analysis would often end with "OK, we saw this position a few minutes ago and found it was won for white, so..." So intuitively it is the right way to do this.

2. 10 points here, 5 points there, first thing you know you are talking major points. But 10 points is significant enough for me. Of course we'd all like bigger numbers, but even 5 is worth working for.

/\

I don't see the relevance for testing changes, as the only thing that interests you is in how strong A is compared to A'. Are you afraid that A would lose 10 Elo by ClearHash, and A' perhaps 11, so that you would be off 1 Elo if you would measure with ClearHash? Or can it be that A would gain 10 Elo with KeepHash, while A' would lose 10 Elo by it?

Almost everything you say applies just as much to the opening book. It makes you stronger durting a match, it gives extra noise in the test results, it take extra code to read in a FEN in stead, which has to be debugged, your book code is not used during testing, so you might run into unpleasant bugs during the tournament. Yet for the book you argue the other way around. It just doesn't make any sense.

Since I can't test my opening book during these tests (how would I get access to my opponent's tournament book, for example) I don't think that is important at all. I do test some with pondering, and will soon complete enough runs to determine if pondering is causing any increase in non-determinism over what I get by using time limits already.. And I have run hundreds of thousands of games using SMP as well. And there I know the number of games required goes up and I choose to not test that way _all_ the time. But I do test that way enough to keep myself confident that I have not broken anything, or if I did I will see it somewhere.

My point is this: It is important to test what you are going to run in tournaments. I've not had any "bugs" show up in tournament games in years. How many have you seen lose games on time, or crash repeatedly, or whatever. Not Crafty. Because it has been thoroughly tested, completely as it is running, not in a crippled mode...
Yes, you don't want to appear in a tournament with untested code. But testing it once (in a 100,000 game gauntlet, or whatever you normally use), is enough. No reason to test it 100 times. And between tournaments you might want to test 100 eval changes. So if that can be made easier by switching off parts of your code (like the book...), it is well worth the effort to switch it off.

No argument about the book. And since I can't test against my opponent's book anyway, that would not show me anything useful, and would start me off in significantly more different positions than my current set of 40. So I avoid the book to keep the runs shorter, since testing the book is futile without my opponent's best book available.

Ditto for SMP although it needs more than one big run here and there, as it is very easy to introduce a simple bug that does not show up until you have multiple threads. For example, mis-type a temp variable name and you might get a compiler error. Or you might get a global variable you had forgotten about. Changing/using it in one thread might be safe. But with two or more, it is going to cause problems. If I don't test properly, I miss those, and yes, I have seen them happen to me.

I do assume you use a very high-impedence probe while testing that thing, so that the testing itself doesn't bias the behavior? That is exactly what I am talking about. A chess engine (perhaps better to say a mature chess engine) is a very complex system. shutting off one part can have many side-effects that are unexpected. I want to make sure everything works, in the way it is intended..
Yes, once. I gather all the (limited) information I can get with that, and then I take the thing apart to measure the rest. I wouldn't get nowhere if I would limit myself to the non-invasive measurement.
hgm wrote: Except, of course, that you won't have to test their eval with parallel search. There is absolutely nothing to be gained in doing that, if you have 8 cores, better to play 8 independent games, one on each of them, at 8 times slower time control. Or just 6 times slower, to reach the same depth, as SMP speedup will probably not be 100%. So you can play more games of the same quality with the same resources.

It isn't nearly that simple as I mentioned above. Playing purely non-SMP matches can overlook an SMP bug easily, even if you didn't modify the SMP search code at all.

]
I would still like to hear your comment on this. Do you think you would need a different evaluation when playing on 8 cores as when playing on a single core? What kind of terms improve play in the SMP version, but are counter-productive on a single core?

I don't see why that would happen. I have seen the _opposite_ happen, that tuning on a slow machine, but running on a fast machine can cause problems. That was the 1986 WCCC bug I mentioned. A program that can reach a significant depth on real hardware, but very reduced depth on test hardware can require two different evaluations, or at least two different tunings...

I treat the 8-cores case as two issues. (1) I will be over 6x faster. That translates into at least a ply or two depending on the branching factor that can vary from 1.5 to 10+. (2) it offers significantly more potential for timing interactions where two threads improperly modify something without locking. The more cores, the more opportunity for a tiny "window of opportunity" timing bug to rear up. Happened to me more than once on Cray Blitz when we tested on a 4-cpu system, and ran on a 16 cpu system where each processor was 2x faster. I found interesting timing bugs every time I changed to a platform that offered much higher speed and more processors...

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?