Testing methodology

Edsel Apostol · Post by **Edsel Apostol** » Wed Nov 28, 2007 12:17 pm

Hi everyone,

I've been trying lately to find a better way of testing the development of my engine due to the limited testing resources that I have. I decided to run some depth 1 matches against some opponents from a fixed set of positions. This is to determine the accuracy of my evaluation and my quiescent search between versions, before I even started refining the search algorithm.

I noticed that programs that prune losing captures in the quiescent search like Fruit 2.1 loses badly against my engine versions in which I disabled pruning those losing captures. My engine versions that I do those pruning are very weak also on this matches. I also notice that Rybka 1.0beta is very strong in this setup, so it might suggest that it has a very accurate evaluation or maybe a very accurate quiescent search. What do you people think of this?

I have not tested much but it seems to me to be a good way to test faster changes in the evaluation.

Does anyone has experience with this setup? Is it productive or there are other much better testing methodologies?

Edsel Apsotol (Twisted Logic)

Karlo Bala · Post by **Karlo Bala** » Wed Nov 28, 2007 1:29 pm

Rubka1.0 is strong in this setup because at depth 1 rybka actually search to depth 3 + quiescent search (I don't know exactly what rybka search in QS).
Point is that with simple QS (not included checking moves etc.) program search deeper and compensate QS weakness.

I think it is good way of testing. Once you resolve horizon problem you can put your energy on implementing search.

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Nov 28, 2007 1:34 pm

Fixed depth search is not chess. Tuning chess evaluation using non chess game results is pointless.

IMHO a small set of test positions with limited solution time is better chess, then fixed depth chess.

adams161 · Post by **adams161** » Wed Nov 28, 2007 10:17 pm

I think he said he is testing the eval and wants a low depth like depth 1 because he can see ( apparently ) what the position would look like in the eval.

What i do to test eval is run a fen on a position with no search. this gives the best info i think.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2007 3:39 am

Hi Aleks,

My initial purpose was to test only the changes I made with the evaluation so I decided to do depth 1 matches where the search has little or no contribution to playing strength at all.

I think that when I play the matches using some time factor I would have a longer time to determine the status of changes on my evaluation whether they are good or bad as the factor of search could affect it.

If you think that this is not a good idea, please enlighten me.

Tuning chess evaluation using non chess game results is pointless.

I am wondering what do you mean by this. When I play a depth 1 game between two engines, they are playing chess, so how come it is a non chess game result?

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2007 3:47 am

Hi Karlo,

I have knowledge about Rybka showing misleading depth but I overlooked it when I used it on my test matches using this setup. Thanks for reminding me, I would not use Rybka again for this specific test.

One thing I noticed is that pruning losing captures is bad at this test but in real games, it helps, so I don't really know if this affects my assessment of whether my evaluation is good or bad.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2007 3:52 am

Hi Mike,

Your name sounds like an English grandmaster.

By the way, you said that you run your tests from FEN positions, how can you determine then if the evaluation is good or bad from this. I mean are you analyzing the eval of each FEN positions manually?

Tony · Post by **Tony** » Thu Nov 29, 2007 8:51 am

Edsel Apostol wrote:Hi Karlo,

I have knowledge about Rybka showing misleading depth but I overlooked it when I used it on my test matches using this setup. Thanks for reminding me, I would not use Rybka again for this specific test.

One thing I noticed is that pruning losing captures is bad at this test but in real games, it helps, so I don't really know if this affects my assessment of whether my evaluation is good or bad.

Off coarse, you give them for free.

In your test setup, searching all non-capture moves the first 5 ply in quiescence will give good results.

Tony

Aleks Peshkov · Post by **Aleks Peshkov** » Thu Nov 29, 2007 12:06 pm

Edsel Apostol wrote:My initial purpose was to test only the changes I made with the evaluation so I decided to do depth 1 matches where the search has little or no contribution to playing strength at all.

IMO it is exactly opposite. Your disable evaluation importance -- changes in evaluation will be hidden by "novice" level tactical vision.

When I play a depth 1 game between two engines, they are playing chess, so how come it is a non chess game result?

Computers generally play master level chess because of tactical strength. We need master level evaluation function to make them play gross master.

If you artificially destroy computer primary strength, engine will not able to understand advanced chess evaluation ideas. Long term plans, like king mating attack or creating passed pawn will be counter-productive in evaluation because of short-term "do nothing, avoid complications, wait for tactical blunder by opponent" novice player winning tactics. (Even Zappa defeat Rybka by "do nothing" in one well-known match game).

adams161 · Post by **adams161** » Thu Nov 29, 2007 5:44 pm

this is just one way i test. its a snapshot. if i change the eval what is teh change in a no search fen eval. also if your eval should be symetric it will find out if it really is.

Testing methodology

Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology

Re: Testing methodology