On engine testing again!

michiguel · Post by **michiguel** » Sat Jan 02, 2010 7:56 pm

Don wrote:
michiguel wrote:
Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel
Miguel,

From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.

Efficiency is one thing, statistics is another. My message related to the latter. There are chess issues, too...

Just to digress a bit, I have seen in the past members of strong programming teams making statements that made me think they did not understand some statistical or chess issues. Of course, they are good and made progress, but you can make progress with flawed procedures too (science is progressing with flawed procedures all the time!).

There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.

Correct, but you can still have many opponents and as many different games as possible. I believe sometimes you need to opposite! see below.

I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.

If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.

I agree, but I think it is important to play a fraction that are a bit weaker. Maybe -100 would do it. An important part of chess strength is to know how to execute won positions. Otherwise, you risk tuning for a an excellent defensive program only.

There are are issues that are not statistical in nature, but more related to chess. When you engine is weak, like mine, direct observation is important (but you need to be a chess player). Here, the opposite of what is good for statistics should be done. The concept is more related to a "debugging spirit". For instance, you play a small number of positions (say, Silver) against a large number of opponents. Then you check what positions have a significant low scoring percentage. Then you go and look at the problems. Games need to be as fast a as possible to expose the problems and make sure that the search do not hide them. That is why, when I do this, I want a GUI. I sit and check for a while games that are played at 40 moves/20 s and many times I see patterns that are huge red flags. Besides, this is fun.

Miguel

mcostalba · Post by **mcostalba** » Sat Jan 02, 2010 8:24 pm

Don wrote: To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa.

Yes, this is a good point. Especially with stuff like futility pruning it is easy to introduce changes biased to low search depths.

But going to long time controls is not practical, so the only way to escape from this is with the help of the hardware: you need fast machines to get a quick (if 2-3 days can be considered quick

) and scalable enough test result.

Another point that we follow, but is painful, is never test a combination of ideas in one go, but always one by one. For instance, just to make a recent case, we are testing Dann's idea without using Dann's code. We have split the combination of ideas of his formula (reduction at high depth, reduction based on position value, early jump in qsearch, etc..) in many simple and single focused patches and we are going to test them one by one. It is _very_ long and "boring" to do in this way, but we have had in the past experience of a single "bundled" patch that ended up to be good only because was good one single component of it and there were another 1-2 components that were not useful or even slightly worse but that ended up in the same because bundled with the good one.

Uri Blass · Post by **Uri Blass** » Sat Jan 02, 2010 8:48 pm

mcostalba wrote:
Don wrote: To me a far greater issue is that some improvements are really regressions at longer time controls, or visa versa.
Yes, this is a good point. Especially with stuff like futility pruning it is easy to introduce changes biased to low search depths.

But going to long time controls is not practical, so the only way to escape from this is with the help of the hardware: you need fast machines to get a quick (if 2-3 days can be considered quick ) and scalable enough test result.

Another point that we follow, but is painful, is never test a combinetion of ideas in one go, but always one by one. For instance, just to make a recent case, we are testing Dann's idea without using Dann's code. We have split the combination of ideas of his formula (reduction at high depth, reduction based on position value, early jump in qsearch, etc..) in many simple and single focused patches and we are going to test them one by one. It is _very_ long and "boring" to do in this way, but we have had in the past experience of a single "bundled" patch that ended up to be good only because was good one single component of it and there were another 1-2 components that were not useful or even slightly worse but that ended up in the same because bundled with the good one.

I think that a possible way to escape from it may be to optimize for a test suite and I do not think about a tactical test suite and the time per position in the test suite that I suggest is 1 minute per move.

Lot of effort is needed to build a good test suite but after the test suite is done it may be useful for many programs and results in the test suite may be usually good to get an estimate for the CCRL rating of chess programs with mistakes of less than 100 elo.

If you ask how to build the test suite then I suggest to take random positions from games when the main problem is to decide how you give scores for the programs that analyze them.

People may need to do lot of analysis to get a score for every root move from the game position after an hour of search but I guess that Bob can do it relatively fast with his cluster(and relatively fast mean after few weeks and not after years).

Later programs may get a score for every position based on the move that they suggest after a minute when we need to define some formula based on the move(choosing a move that is 0.2 pawns worse than the best move should get lower score than choosing a move that is 1 pawns worse than the best move)

We can test different formulas based on their success to predict CCRL rating of chess programs.

Uri

Don · Post by **Don** » Sat Jan 02, 2010 8:49 pm

michiguel wrote:
Don wrote:
michiguel wrote:
Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel
Miguel,

From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.

Efficiency is one thing, statistics is another. My message related to the latter. There are chess issues, too...

Yes, I tried to make it clear that I was talking about something different than you didn't I?

Just to digress a bit, I have seen in the past members of strong programming teams making statements that made me think they did not understand some statistical or chess issues. Of course, they are good and made progress, but you can make progress with flawed procedures too (science is progressing with flawed procedures all the time!).

This is not a question of flawed procedures. Testing against only computer programs or only 10 computer programs is also flawed. Not testing at 50 different time controls is flawed. It's a matter only of degree.

I don't question the idea that if you are playing 10,000 games, it's better if those 10,000 games are being tested against a variety of opponents. That is not for a moment in question as far as I'm concerned.

But let me put it another way. There is no question whatsoever that if you are going to search 20 ply deep, it's FAR better to do it without LMR and without null move pruning. LMR and null move pruning is flawed, it just does not give you the same quality as brute force techniques.

But having the ultra conservative attitude that you must do brute force is pretty foolish. Same with testing. If you are perfectly willing to throw out most of your processing resources then you have the luxury to make the tests slightly more accurate.

What is not obvious to me is that if I had 4 quads to test with, why should I devote 2 of them to testing other peoples programs? I believe it would return slightly more accurate results, but not nearly enough to compensate for what you are giving up.

There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.

Correct, but you can still have many opponents and as many different games as possible. I believe sometimes you need to opposite! see below.

This is a way to have your cake and eat it too, that's my point. Try to make the tests much faster and still get the variety. Then you are still spending most of the CPU time testing YOUR program, not "theirs."

I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.

If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.
I agree, but I think it is important to play a fraction that are a bit weaker. Maybe -100 would do it. An important part of chess strength is to know how to execute won positions. Otherwise, you risk tuning for a an excellent defensive program only.

Is this is a feeling or something backed up by fact? I have found that all my own personal superstitions about testing are almost always wrong. I used to believe it was a sin to self-test and that it was just as important to test against weak program as strong. Perhaps there is some point to this, but if you start testing with opponents much weaker or stronger, you need MUCH more computing time to resolve the actual strength of your program.

I get the feeling that you don't give any consideration to CPU resources when you test. I just cannot express to you how much of a bottleneck testing is for me and I have a fast quad to test with and I test much more efficiently than you do, yet still I am always waiting on a test to complete.

There are are issues that are not statistical in nature, but more related to chess. When you engine is weak, like mine, direct observation is important (but you need to be a chess player).

This is something we can definitely agree on. But even that must be taken with a grain of objectivity and personal observations are rarely very objective. I cannot tell you how many times I get an email about Doch, saying it has a huge problem in some aspect of the game based on a single observation that turns out to be completely unrelated to what the person thinks. Still, I think it's very important to watch your program play chess and be aware of it's strengths and weaknesses.

Here, the opposite of what is good for statistics should be done. The concept is more related to a "debugging spirit". For instance, you play a small number of positions (say, Silver) against a large number of opponents. Then you check what positions have a significant low scoring percentage. Then you go and look at the problems. Games need to be as fast a as possible to expose the problems and make sure that the search do not hide them. That is why, when I do this, I want a GUI. I sit and check for a while games that are played at 40 moves/20 s and many times I see patterns that are huge red flags. Besides, this is fun.

Miguel

Kempelen · Post by **Kempelen** » Sat Jan 02, 2010 9:29 pm

Don wrote:
michiguel wrote:
Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results if the score diference is very large between tournaments.

Regards,
Fermin
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel
Miguel,

From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.

There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.

I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.

If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.

Well, I am actually doing testing with opponents whitch are from +150 or +200 to -100, and in average me engine is 40% of average engine's strength. With this I try to test mine and not other's engines

In this configuration I take the exception with one of them, being the last release version of my engine, so I have 50 games in each testing for possible comparations or conclusions; they are not many but enought for keep an eye on possible comparation conclusions.

When I said I saw "more precise results" with this setup, is that I have repeated the same tourneys sometimes and I get the same results more often than with my last configuration (12 engines 50 games). Of course this is only a subjetive view based on tournament repetitions I usually like to do from time to time to see all is working propertly.

michiguel · Post by **michiguel** » Sat Jan 02, 2010 9:39 pm

Don wrote:
michiguel wrote:
Don wrote:
michiguel wrote:
Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.

Regards,
Fermin
IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.

Miguel
Miguel,

From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.

Efficiency is one thing, statistics is another. My message related to the latter. There are chess issues, too...
Yes, I tried to make it clear that I was talking about something different than you didn't I?

Yes, you did. But apparently I did not make clear my points.

Just to digress a bit, I have seen in the past members of strong programming teams making statements that made me think they did not understand some statistical or chess issues. Of course, they are good and made progress, but you can make progress with flawed procedures too (science is progressing with flawed procedures all the time!).
This is not a question of flawed procedures. Testing against only computer programs or only 10 computer programs is also flawed. Not testing at 50 different time controls is flawed. It's a matter only of degree.

Yes.

I don't question the idea that if you are playing 10,000 games, it's better if those 10,000 games are being tested against a variety of opponents. That is not for a moment in question as far as I'm concerned.

But let me put it another way. There is no question whatsoever that if you are going to search 20 ply deep, it's FAR better to do it without LMR and without null move pruning. LMR and null move pruning is flawed, it just does not give you the same quality as brute force techniques.

But having the ultra conservative attitude that you must do brute force is pretty foolish.

I never said that brute force is always the way to go. In fact, I said that many times it is the opposite.

Same with testing. If you are perfectly willing to throw out most of your processing resources then you have the luxury to make the tests slightly more accurate.

What is not obvious to me is that if I had 4 quads to test with, why should I devote 2 of them to testing other peoples programs? I believe it would return slightly more accurate results, but not nearly enough to compensate for what you are giving up.

Why do you say that you may have to devote 2 quads to test other programs?

There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.

Correct, but you can still have many opponents and as many different games as possible. I believe sometimes you need to opposite! see below.
This is a way to have your cake and eat it too, that's my point. Try to make the tests much faster and still get the variety. Then you are still spending most of the CPU time testing YOUR program, not "theirs."

I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.

If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.
I agree, but I think it is important to play a fraction that are a bit weaker. Maybe -100 would do it. An important part of chess strength is to know how to execute won positions. Otherwise, you risk tuning for a an excellent defensive program only.
Is this is a feeling or something backed up by fact?

Based on theory. Whenever you test, you want every single piece of you software be examined. When your about to win, you use some muscles that you don't when you are fighting for a draw. You increase the probabilities to expose bugs too. This is not superstition. The reason why you want to have results closet o 50-50 is statistical. You will get the smallest "error bar" this way. That is the reason why I want opponents that are +- 100 elo points. But the ones are -100 are as important as the ones that are +100. That was my point.

I have found that all my own personal superstitions about testing are almost always wrong. I used to believe it was a sin to self-test and that it was just as important to test against weak program as strong. Perhaps there is some point to this, but if you start testing with opponents much weaker or stronger, you need MUCH more computing time to resolve the actual strength of your program.

Weaker does not mean 600 points weaker. Just 100.

I think that self testing is fine to do a quicker screen. Tord is right, it emphasizes the effect of change but: I believe it is a terrible idea for weak engines. It is the blind testing how good a blind can drive. Huge holes in evaluation are completely ignored, or bad new evaluation terms seem good just because they overcome a bigger hole in an indirect way. I have seen this. So, self testing may be good for strong engines.

I get the feeling that you don't give any consideration to CPU resources when you test.

No idea why you get this feeling... I do not have lots of resources.

I just cannot express to you how much of a bottleneck testing is for me and I have a fast quad to test with and I test much more efficiently than you do, yet still I am always waiting on a test to complete.

There are are issues that are not statistical in nature, but more related to chess. When you engine is weak, like mine, direct observation is important (but you need to be a chess player).
This is something we can definitely agree on. But even that must be taken with a grain of objectivity and personal observations are rarely very objective.

They are symptoms. They may not be the disease, but you should not ignore the symptoms. I believe that a good CC programmer should be able to fight the infection, rather than the fever. But there are two type of mistakes 1) fighting the fever, and 2) ignoring it because "we know that the fever is not the problem"

Miguel

I cannot tell you how many times I get an email about Doch, saying it has a huge problem in some aspect of the game based on a single observation that turns out to be completely unrelated to what the person thinks. Still, I think it's very important to watch your program play chess and be aware of it's strengths and weaknesses.

Here, the opposite of what is good for statistics should be done. The concept is more related to a "debugging spirit". For instance, you play a small number of positions (say, Silver) against a large number of opponents. Then you check what positions have a significant low scoring percentage. Then you go and look at the problems. Games need to be as fast a as possible to expose the problems and make sure that the search do not hide them. That is why, when I do this, I want a GUI. I sit and check for a while games that are played at 40 moves/20 s and many times I see patterns that are huge red flags. Besides, this is fun.

Miguel

bob · Post by **bob** » Mon Jan 04, 2010 2:57 am

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...
That's why I don't target the 10 to 20 elo changes for now, that will be for tuning later when the engine is already really strong. I'm more on trying out ideas that may give or take at least 30 elo.

I agree about the few starting positions being critical.

Well, at least 1200 games is better than nothing. If a version/setting is good it will show in the rating list no matter how few the games.

Actually it won't. I've posted lots of results where a new version was way worse after 200-500 games, and then by 40,000 games was better. And vice-versa. The eye-opening test here is to run a 100 game match, then change _nothing_ and rerun it. The difference can be startling. Run it enough times and you develop a real appreciation for why the error bar is so high after 100 games.

Edsel Apostol · Post by **Edsel Apostol** » Mon Jan 04, 2010 6:06 am

bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...
That's why I don't target the 10 to 20 elo changes for now, that will be for tuning later when the engine is already really strong. I'm more on trying out ideas that may give or take at least 30 elo.

I agree about the few starting positions being critical.

Well, at least 1200 games is better than nothing. If a version/setting is good it will show in the rating list no matter how few the games.
Actually it won't. I've posted lots of results where a new version was way worse after 200-500 games, and then by 40,000 games was better. And vice-versa. The eye-opening test here is to run a 100 game match, then change _nothing_ and rerun it. The difference can be startling. Run it enough times and you develop a real appreciation for why the error bar is so high after 100 games.

For elo differences of somewhere around 20, then I would agree, but for an elo difference of for example 100 I doubt that after 200 games the stronger version would still be behind the weaker version.

It depends really on what you are trying to measure. The smaller the differences, the more games you need. In my engine, I don't try tuning much. Tuning I think when an engine is not that strong yet will only find the local maximum. I'm more on trying out new ideas.

Sven · Post by **Sven** » Mon Jan 04, 2010 10:38 am

Could someone please post a table (or formula) listing typical error bars we can expect for 500, 1000, 1500, ..., 5000 games (most of us can't play more games within reasonable time), together with an explanation how the number of opponents and possibly other major factors affect the error bars? That would be great.

Sven

zamar · Post by **zamar** » Mon Jan 04, 2010 8:49 pm

Sven Schüle wrote:Could someone please post a table (or formula) listing typical error bars we can expect for 500, 1000, 1500, ..., 5000 games (most of us can't play more games within reasonable time), together with an explanation how the number of opponents and possibly other major factors affect the error bars? That would be great.

Sven

If I can recall correctly H.G.Mueller posted the approximative formula for the error bar of winning percentage: sigma = 40 / sqrt(number of games)

To get approximation in ELOs multiply by 7.

Number of opponents does not affect the error bar. But when comparing gauntlets (of equal number of games) one must multiply the error bar by sqrt(2).

Someone please correct if I got sth wrong. I'm not an expert in statistics

On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!

Re: On engine testing again!