New testing thread

bob · Post by **bob** » Fri Aug 08, 2008 10:29 pm

Uri Blass wrote:The reason for your result is not that time measurements are never identical but that that their distribution is changed after many games.

This is simply wrong. The distribution changes all during the match. I posted several 3 second runs to show that, where the last two were the same. I will try to run maybe 64 since that is the number of distinct 4-game sets I run for a given position, and see how the nodes searched are distributed... will post this later.

Note that even if you could get the same distribution the test is simply not good to test small changes because the number of positions is too small.

That may well be the issue that is hiding here. However, again, this is no different than testing done by others. So if it is bad for one, then it is also bad for all, which has been my point. The only remaining question that needs answering before this can be tested is "how many are enough and what are they?"

The main problem is that some change may give +5 elo in the silver suite at the specific time control that you use but -10 elo in slightly different time control.

The best solution is simply to have more positions in your suite.

It is better to test 13000 different position one time with white and one time with black and not to test the same position again and again if you want to evaluate small changes.

I remember reading that the rybka team use big set of position and not small set of positions for their tests at 1 second per game.

In order to get 26000 positions you can practically take some big pgn and take the first 26000 different positions(when you remove doubles) after move 10 as starting point.

Uri

I can probably make that happen easily enough.

bob · Post by **bob** » Fri Aug 08, 2008 10:35 pm

hgm wrote:
bob wrote:Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
Why would you want to test such a thing, which trivially follows from first principles? Would you also play a game of Chess twice, and the use BayesElo to 'test' if 1+1 equals 2???

Are you following the discussion? Someone posed the question: "If you run a round robin, how will that compare to the results of just C vs World, since it will not (according to Remi) affect the rating of Crafty itself?" And I simply ran a test to answer that question. A later question was "would this stabilize the numbers better and reduce the variation?" So I ran the test four times to see. Answer is there is still a ton of variation.

So that is the discussion _I_ was trying to deal with. Someone made a reasonable request for new data, and I tried to provide it. I am now off to try to generate a huge set of starting positions, once I come up with a way to make sure that the set of positions is fairly broad and doesn't just include a certain opening system such as the Sicilian, while excluding other popular choices. That is a bit more complicated as the PGN has to cover everything for me to extract opening positions that do.

bob · Post by **bob** » Fri Aug 08, 2008 10:40 pm

Uri Blass wrote:
hgm wrote:
bob wrote:Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
Why would you want to test such a thing, which trivially follows from first principles? Would you also play a game of Chess twice, and the use BayesElo to 'test' if 1+1 equals 2???

I suggest not to insult Bob.

It is obvious to me and you that games between non crafty versions are not going to help to give Crafty more accurate rating
but it seems not obvious to Bob and the people who suggested him to play games between non crafty versions.

I noticed that other people commented nonsense and based on my memory I commented something against the test but I understood that I have no chance to convince other people about it unless they see the results so I stop commenting about it.

Note that I did not study the exact way that the rating program works but it is obvious that smaller error for Crafty after games between non Crafty versions suggests some bug in the rating program because I see no way how this information can be productive to get better estimate to the rating of Crafty.

Uri

You might notice that the ratings _did_ change. The entire range was compressed, and in some cases the order changed a bit. So from some perspective, things were a bit different.

However, actual _suggestions_ are pretty rare in this forum. More often it is just the old "stampee foot, bad test, stampee foot, etc" sort of reply. Since it didn't take any of my time, I thought it worthwhile to run the test as asked. I'm now going to try to take a suggestion made by others and yourself, namely to try a larger set of starting positions. Just got to figure out how to get a set of positions that measures what I want to measure. I don't want to just test with 1. e4 c5 or some other specific opening, and I don't want to test using off-the-wall openings like 1. f3 e5 Kf2 and such, since I don't care how Crafty does in sucn an opening. So picking the positions is an interesting problem and I'm looking at it.

Zach Wegner · Post by **Zach Wegner** » Fri Aug 08, 2008 10:48 pm

bob wrote:Not even close. here are some samples from crafty. Same position. Target time of 3 seconds, using just one CPU to minimize the node variation. These results are run in a "conservative mode" where it checks the time more frequently than in real games.:

log.001: time=3.00 mat=0 n=6814217 fh=94% nps=2.3M
log.002: time=3.01 mat=0 n=6786713 fh=94% nps=2.3M
log.003: time=3.01 mat=0 n=6768568 fh=94% nps=2.2M
log.004: time=3.00 mat=0 n=6814217 fh=94% nps=2.3M
log.005: time=3.08 mat=0 n=6948342 fh=94% nps=2.3M
log.006: time=3.06 mat=0 n=6948342 fh=94% nps=2.3M

The problem is that the clock advances in "spurts" and the sample instant can occur anywhere along that spurt. I use the NPS to help control how often I sample the time, and the NPS is a non-constant as well which introduces more randomness in when the time is checked which controls how much time can be used in a search.

Well, your feedback loop of checking the time based on NPS does affect it, as well as your "conservative mode". I ran the same test:

Code: Select all

search: nodes=2040325 q=79.1% time=3.004 nps=679202
search: nodes=2030322 q=79.1% time=3.005 nps=675647
search: nodes=2040325 q=79.1% time=3.006 nps=678750
search: nodes=2040325 q=79.1% time=3.002 nps=679655
search: nodes=2045326 q=79.1% time=3.005 nps=680640
search: nodes=2035325 q=79.1% time=3.003 nps=677763
search: nodes=2045326 q=79.1% time=3.003 nps=681094
search: nodes=2050326 q=79.1% time=3.005 nps=682304
search: nodes=2040325 q=79.1% time=3.007 nps=678525
search: nodes=2050326 q=79.1% time=3.003 nps=682759

I should have been more accurate in my last post. I don't check timeout every 10000 nodes, but rather every 10000 iterations of my iterative search, which intuitively works out to close to 5000 nodes. But even so, there are just a few different outcomes.

Regardless of how unpredictable, I would not assume that they obeyed a random distribution enough to be statistically valid. If I were running the same massive tests that you were, I'd probably change to randomized node counts.

Any comments on the rest of my post?

xsadar · Post by **xsadar** » Fri Aug 08, 2008 11:09 pm

bob wrote:
Uri Blass wrote:The reason for your result is not that time measurements are never identical but that that their distribution is changed after many games.
This is simply wrong. The distribution changes all during the match. I posted several 3 second runs to show that, where the last two were the same. I will try to run maybe 64 since that is the number of distinct 4-game sets I run for a given position, and see how the nodes searched are distributed... will post this later.

Note that even if you could get the same distribution the test is simply not good to test small changes because the number of positions is too small.
That may well be the issue that is hiding here. However, again, this is no different than testing done by others. So if it is bad for one, then it is also bad for all, which has been my point. The only remaining question that needs answering before this can be tested is "how many are enough and what are they?"

This discussion tends to confuse me a bit, but this seems the most likely possibility to me. Chess engines are very deterministic things. When using the same starting position, even if you never play the exact same game more than once, it seems likely to me that you will still duplicate significant portions of many games, and that you might also find various permutations (or near permutations) of a move sequence occurring in different games. Have you looked for duplicate move sequences and positions in the resulting games?

As for how many different positions, I would guess that you would want enough positions so that Crafty plays from the same position against each engine exactly twice -- once as each color.

Which positions to use seems like a much more difficult question though, as you would probably want them to be both as distinct and as typical as possible.

The main problem is that some change may give +5 elo in the silver suite at the specific time control that you use but -10 elo in slightly different time control.

The best solution is simply to have more positions in your suite.

It is better to test 13000 different position one time with white and one time with black and not to test the same position again and again if you want to evaluate small changes.

I remember reading that the rybka team use big set of position and not small set of positions for their tests at 1 second per game.

In order to get 26000 positions you can practically take some big pgn and take the first 26000 different positions(when you remove doubles) after move 10 as starting point.

Uri
I can probably make that happen easily enough.

krazyken · Post by **krazyken** » Fri Aug 08, 2008 11:53 pm

bob wrote:
Uri Blass wrote:The reason for your result is not that time measurements are never identical but that that their distribution is changed after many games.
This is simply wrong. The distribution changes all during the match. I posted several 3 second runs to show that, where the last two were the same. I will try to run maybe 64 since that is the number of distinct 4-game sets I run for a given position, and see how the nodes searched are distributed... will post this later.

Note that even if you could get the same distribution the test is simply not good to test small changes because the number of positions is too small.
That may well be the issue that is hiding here. However, again, this is no different than testing done by others. So if it is bad for one, then it is also bad for all, which has been my point. The only remaining question that needs answering before this can be tested is "how many are enough and what are they?"

The main problem is that some change may give +5 elo in the silver suite at the specific time control that you use but -10 elo in slightly different time control.

The best solution is simply to have more positions in your suite.

It is better to test 13000 different position one time with white and one time with black and not to test the same position again and again if you want to evaluate small changes.

I remember reading that the rybka team use big set of position and not small set of positions for their tests at 1 second per game.

In order to get 26000 positions you can practically take some big pgn and take the first 26000 different positions(when you remove doubles) after move 10 as starting point.

Uri
I can probably make that happen easily enough.

I think 13000 positions might be extreme. To approach it in a controlled fashion, my instinct would be to start by doubling starting positions to see how results from 80 positions compare. Then go to 160, if the number of positions is a correlating factor then you should start to see a trend and be able to predict the minimal useful starting positions for match size N.

bob · Post by **bob** » Fri Aug 08, 2008 11:59 pm

Tord Romstad wrote:
bob wrote:
Tord Romstad wrote:By the way, I strongly recommend upgrading from Glaurung 2-ε/5 to Glaurung 2.1. As the name indicates, 2-ε/5 was just a beta version. The current version is less buggy, and much stronger.
I will do that once I get the kinks worked out on the Elo testing. I don't particularly care whether I use the strongest or not, just that all the opponents are perfectly consistent for the various test runs so the results are comparable.
I also frequently use old versions. The reason for my message was that you were not just using an old version, but an unstable and incomplete version. If you had used version 2.0 (which is also somewhat oldish, but complete and stable), I wouldn't have mentioned it.

Tord

OK. At least it plays well, and doesn't seem to crash/hang/etc. I don't see it losing on time at all, unlike some of the others that drop a game for no reason here and there.

Once I make a couple of runs with my new set of positions (I have created a new opening test set of 3375 positions. I took them (at Uri's suggestion) from a PGN collectoin by just dumping each position with WTM and move = 10, and then removing any position not played more than 3 times.) from the PGN we use for our normal "wide" opening book (not from enormous which has lots of junk openings as well as usual ones). Once I see how they look, I may well re-vamp the entire test suite of programs since I will be "starting over" in the way I test (if this works that is, without so much unpredictability).

bob · Post by **bob** » Sat Aug 09, 2008 12:03 am

No other comments at present. I have created a new starting position epd file with 3375 positions. I am going to run two tests, crafty vs 5 opponents (no RR yet) playing each position once as white and once as black, per opponent. If that produces pretty consistent results, I will start cutting the number of positions by 1/2 to see what is a reasonable minimum, that is if 3375 is enough to give decent stability...

Once this runs, I will be back and ready to discuss options further...

bob · Post by **bob** » Sat Aug 09, 2008 12:05 am

xsadar wrote:
bob wrote:
Uri Blass wrote:The reason for your result is not that time measurements are never identical but that that their distribution is changed after many games.
This is simply wrong. The distribution changes all during the match. I posted several 3 second runs to show that, where the last two were the same. I will try to run maybe 64 since that is the number of distinct 4-game sets I run for a given position, and see how the nodes searched are distributed... will post this later.

Note that even if you could get the same distribution the test is simply not good to test small changes because the number of positions is too small.
That may well be the issue that is hiding here. However, again, this is no different than testing done by others. So if it is bad for one, then it is also bad for all, which has been my point. The only remaining question that needs answering before this can be tested is "how many are enough and what are they?"
This discussion tends to confuse me a bit, but this seems the most likely possibility to me. Chess engines are very deterministic things. When using the same starting position, even if you never play the exact same game more than once, it seems likely to me that you will still duplicate significant portions of many games, and that you might also find various permutations (or near permutations) of a move sequence occurring in different games. Have you looked for duplicate move sequences and positions in the resulting games?

As for how many different positions, I would guess that you would want enough positions so that Crafty plays from the same position against each engine exactly twice -- once as each color.

Which positions to use seems like a much more difficult question though, as you would probably want them to be both as distinct and as typical as possible.

The main problem is that some change may give +5 elo in the silver suite at the specific time control that you use but -10 elo in slightly different time control.

The best solution is simply to have more positions in your suite.

It is better to test 13000 different position one time with white and one time with black and not to test the same position again and again if you want to evaluate small changes.

I remember reading that the rybka team use big set of position and not small set of positions for their tests at 1 second per game.

In order to get 26000 positions you can practically take some big pgn and take the first 26000 different positions(when you remove doubles) after move 10 as starting point.

Uri
I can probably make that happen easily enough.

I have done a first cut. I took a good opening book PGN file we use, and had Crafty save the 10th white-to-move position from each game. I then used the usual unix utilities to sort and uniq -c them and then culled any position with a duplicate count of 3 or less. That gave 3375 which seems doable (3375 * 2 * 5 = 33,750 games where I was doing 25,000 so the order of magnitude of the work is about the same.

bob · Post by **bob** » Sat Aug 09, 2008 12:07 am

krazyken wrote:
bob wrote:
Uri Blass wrote:The reason for your result is not that time measurements are never identical but that that their distribution is changed after many games.
This is simply wrong. The distribution changes all during the match. I posted several 3 second runs to show that, where the last two were the same. I will try to run maybe 64 since that is the number of distinct 4-game sets I run for a given position, and see how the nodes searched are distributed... will post this later.

Note that even if you could get the same distribution the test is simply not good to test small changes because the number of positions is too small.
That may well be the issue that is hiding here. However, again, this is no different than testing done by others. So if it is bad for one, then it is also bad for all, which has been my point. The only remaining question that needs answering before this can be tested is "how many are enough and what are they?"

The main problem is that some change may give +5 elo in the silver suite at the specific time control that you use but -10 elo in slightly different time control.

The best solution is simply to have more positions in your suite.

It is better to test 13000 different position one time with white and one time with black and not to test the same position again and again if you want to evaluate small changes.

I remember reading that the rybka team use big set of position and not small set of positions for their tests at 1 second per game.

In order to get 26000 positions you can practically take some big pgn and take the first 26000 different positions(when you remove doubles) after move 10 as starting point.

Uri
I can probably make that happen easily enough.
I think 13000 positions might be extreme. To approach it in a controlled fashion, my instinct would be to start by doubling starting positions to see how results from 80 positions compare. Then go to 160, if the number of positions is a correlating factor then you should start to see a trend and be able to predict the minimal useful starting positions for match size N.

I had the same thought but using an inverse process. I have 3375 starting positions already. I'm going to run 4 times using those and compare the results for stability. If they look decent, I am going to use just the first half, and try again. And continue until things start to go too variable... But first I have to make sure that 3375 is enough, if not I have to go back and produce more.

New testing thread

Re: Correlated data discussion

Re: 4 sets of data

Re: 4 sets of data

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: ugh ugh ugh

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion