Efficiency is one thing, statistics is another. My message related to the latter. There are chess issues, too...Don wrote:Miguel,michiguel wrote:IMHO, from the statistic point of view, your set up is *excellent* = As many opponents as you can, as much diversity of positions as you can.Kempelen wrote:In my case I do 1250 games: 25 opponents and 50 games against them. I dont repeat positions, but I choose for each game a random position from Bob suite (3891 different positions). I dont know what other may think about this setup, but for me it works very well, even I have noted more precise results.
Regards,
Fermin
Miguel
From an efficiency point of view, Kempelen is using half his resources testing everybody else's program. One must ask whether it's worth doing this. The really strong programming teams are not doing this.
Just to digress a bit, I have seen in the past members of strong programming teams making statements that made me think they did not understand some statistical or chess issues. Of course, they are good and made progress, but you can make progress with flawed procedures too (science is progressing with flawed procedures all the time!).
Correct, but you can still have many opponents and as many different games as possible. I believe sometimes you need to opposite! see below.
There is a way out if you do not have a really strong program. Find N opponents who are much stronger than your program and handicap them appropriately. To get the most "bang for the buck" you want your opponents to be close in ELO to you. By using stronger opponents you can cut down on their thinking time and thus use your resources more wisely.
I agree, but I think it is important to play a fraction that are a bit weaker. Maybe -100 would do it. An important part of chess strength is to know how to execute won positions. Otherwise, you risk tuning for a an excellent defensive program only.
I used this principle in my testing. Rybka is one of my sparring partners, but Rybka is so strong that I set it to play much faster than Doch which means my tester is spending most of it's CPU time testing MY program, not Rybka.
If the program is weaker, you can give it more time to equalize. I argue that you shouldn't test against programs that are too much weaker, but if you do it's probably best to eat the time and give the opponent more time so that there is an equalizing. It requires a lot more games to accurately rate against an opponent 300 ELO weaker for instance. I like to keep everyone within 100 ELO of my program.
There are are issues that are not statistical in nature, but more related to chess. When you engine is weak, like mine, direct observation is important (but you need to be a chess player). Here, the opposite of what is good for statistics should be done. The concept is more related to a "debugging spirit". For instance, you play a small number of positions (say, Silver) against a large number of opponents. Then you check what positions have a significant low scoring percentage. Then you go and look at the problems. Games need to be as fast a as possible to expose the problems and make sure that the search do not hide them. That is why, when I do this, I want a GUI. I sit and check for a while games that are played at 40 moves/20 s and many times I see patterns that are huge red flags. Besides, this is fun.
Miguel