Here are some results from a test in progress. This is just one set of 64 matches against the same opponent, so far only 13 matches are completely done, with some additional partial results that the program I use does not tally up until a match has a full 80 games in it.
glaurung results ->
15 distinct runs (944 games) found
win/ draw/ lose (score)
1: 34/ 22/ 24 ( 10)
2: 26/ 11/ 43 (-17)
3: 26/ 14/ 40 (-14)
4: 39/ 12/ 29 ( 10)
5: 32/ 18/ 30 ( 2)
6: 37/ 9/ 34 ( 3)
7: 35/ 11/ 34 ( 1)
8: 38/ 13/ 29 ( 9)
9: 29/ 16/ 35 ( -6)
10: 34/ 17/ 29 ( 5)
11: 34/ 13/ 33 ( 1)
12: 32/ 17/ 31 ( 1)
13: 35/ 14/ 31 ( 4)
again, those are the first 13 completed matches, there will be 51 more when it is done. they are given in the order they are played. each line represents 40 positions, each side playing black and white for a total of 80 games.
The previous score for the previous version at this fairly fast time control was -7 overall. So I am looking to determine whether the new version is better or not.
If I stopped after the first 80 games, I would certainly conclude "yes". That is a 17 point improvement. If I just took the second test I would conclude "no" as that is a 10 point worse result. If I had run a 160 game match (which adds the first two lines) I conclude "the two versions are equal" So far, the change looks good as the average is around zero, which is better than -7. I have no idea where the final result will end up however, but I suspect it will say "good, bad or equal". With a lot more weight than what the first few results show.
Again, this is not worst case, nor best case, nor hand-selected. It is just the first 13 80 game matches that were completed. Starts off on a roller-coaster, settles down a bit, whether it will stay settled down or start to oscillate heavily again is unknown until it finishes.
more test data
Moderator: Ras
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: more test data
Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more test data
The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
-
- Posts: 10788
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: more test data
bob wrote:The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.
I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.
A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more test data
First, the highest deviation so far was not in the first 4. The next two were worse (14 and 15).Uri Blass wrote:bob wrote:The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.
I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.
A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.
Uri
Second, you can think something is wrong all you want. The worst runs are not always first. The big set I published last week had the biggest variances farther down.
There is absolutely no difference in how things are executed. Each program is started exactly the same way each time, exactly the same options, exactly the same hash, nobody else running on that processor. And I monitor the NPS for each and every game inside Crafty.
Unfortunately the "other programs" can't produce logs because they just use one log file as opposed to crafty that can log for each separate game that is played.
But forget about the idea of things changing from game to game. I have one giant command file, produced automatically, one line per game, and I fire these off to individual computers one at a time until all machines are busy, and then as one completes I send that node the next game after it re-initializes...
You have to take this data for what it is, significant random variance in some matches, apparent stability in others. For the matches that produce more stable results, that is only because I am using the total score for the 80 games. If you were to look at the individual game strings, you would find lots of variance even though the final scores are nearly the same. So the variance is still there, it is just not expressed in the final result...
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: more test data
Well, if the largest deviations tend to be distributed evenly over the set, then there is nothing remarkable in this data (quite unlike the notorious gang of four...). The maximum deviation is about what one would expect for this size, and that it occurs early is apparently just coincidence.
The last thing you mention is self-evident. The variance in the match totals only grows as the square root of the number of games, so for smaller subsets of games (in particular individual games) the relative SD must always be higher. Otherwise the could never produce the observed variance in their total.
The last thing you mention is self-evident. The variance in the match totals only grows as the square root of the number of games, so for smaller subsets of games (in particular individual games) the relative SD must always be higher. Otherwise the could never produce the observed variance in their total.
Re: more test data
Would that mean that an engine that has learning enabled, and (accidently) gets a fast win, will have a new game faster than the other machines, and with the extra knowledge (learnfile) will give it a bigger chance of a new fast win wich will go into the leraning file etc ?bob wrote:First, the highest deviation so far was not in the first 4. The next two were worse (14 and 15).Uri Blass wrote:bob wrote:The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.
I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.
A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.
Uri
Second, you can think something is wrong all you want. The worst runs are not always first. The big set I published last week had the biggest variances farther down.
There is absolutely no difference in how things are executed. Each program is started exactly the same way each time, exactly the same options, exactly the same hash, nobody else running on that processor. And I monitor the NPS for each and every game inside Crafty.
Unfortunately the "other programs" can't produce logs because they just use one log file as opposed to crafty that can log for each separate game that is played.
But forget about the idea of things changing from game to game. I have one giant command file, produced automatically, one line per game, and I fire these off to individual computers one at a time until all machines are busy, and then as one completes I send that node the next game after it re-initializes...
You have to take this data for what it is, significant random variance in some matches, apparent stability in others. For the matches that produce more stable results, that is only because I am using the total score for the 80 games. If you were to look at the individual game strings, you would find lots of variance even though the final scores are nearly the same. So the variance is still there, it is just not expressed in the final result...
Whereas if there is no accidentle fast win, this won't happen ?
Tony
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more test data
I don't know if that could happen when using learning or not. But in this case, it can't possibly happen for several reasons:Tony wrote:Would that mean that an engine that has learning enabled, and (accidently) gets a fast win, will have a new game faster than the other machines, and with the extra knowledge (learnfile) will give it a bigger chance of a new fast win wich will go into the leraning file etc ?bob wrote:First, the highest deviation so far was not in the first 4. The next two were worse (14 and 15).Uri Blass wrote:bob wrote:The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.
I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.
A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.
Uri
Second, you can think something is wrong all you want. The worst runs are not always first. The big set I published last week had the biggest variances farther down.
There is absolutely no difference in how things are executed. Each program is started exactly the same way each time, exactly the same options, exactly the same hash, nobody else running on that processor. And I monitor the NPS for each and every game inside Crafty.
Unfortunately the "other programs" can't produce logs because they just use one log file as opposed to crafty that can log for each separate game that is played.
But forget about the idea of things changing from game to game. I have one giant command file, produced automatically, one line per game, and I fire these off to individual computers one at a time until all machines are busy, and then as one completes I send that node the next game after it re-initializes...
You have to take this data for what it is, significant random variance in some matches, apparent stability in others. For the matches that produce more stable results, that is only because I am using the total score for the 80 games. If you were to look at the individual game strings, you would find lots of variance even though the final scores are nearly the same. So the variance is still there, it is just not expressed in the final result...
Whereas if there is no accidentle fast win, this won't happen ?
Tony
(1) learning is explicitly disabled (position learning is all that could be used since there are no opening books used at all).
(2) no files are carried over between individual games. Each game is played in a completely "sterile" environment.
(3) to further rule that out, I have played these same tests but using a fixed number of nodes for both opponents, and there the games are identical run after run. Even with fixed node counts, position learning would still change scores, and moves, particularly for the side that lost. But with this kind of test, 64 matches produce 64 sets of 80 games that match perfectly, move for move. The only variance is very slight time differences (one move takes 5.21 seconds, in the next match it might take 5.23 or 5.19, but since time is not used to end searches in that test, it made no difference at all.
Re: more test data
OK, for me the big difference is strange (between fixed nodes and normal mode). So either the result is wrong, or my expectation.bob wrote:I don't know if that could happen when using learning or not. But in this case, it can't possibly happen for several reasons:Tony wrote:Would that mean that an engine that has learning enabled, and (accidently) gets a fast win, will have a new game faster than the other machines, and with the extra knowledge (learnfile) will give it a bigger chance of a new fast win wich will go into the leraning file etc ?bob wrote:First, the highest deviation so far was not in the first 4. The next two were worse (14 and 15).Uri Blass wrote:bob wrote:The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.
I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.
A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.
Uri
Second, you can think something is wrong all you want. The worst runs are not always first. The big set I published last week had the biggest variances farther down.
There is absolutely no difference in how things are executed. Each program is started exactly the same way each time, exactly the same options, exactly the same hash, nobody else running on that processor. And I monitor the NPS for each and every game inside Crafty.
Unfortunately the "other programs" can't produce logs because they just use one log file as opposed to crafty that can log for each separate game that is played.
But forget about the idea of things changing from game to game. I have one giant command file, produced automatically, one line per game, and I fire these off to individual computers one at a time until all machines are busy, and then as one completes I send that node the next game after it re-initializes...
You have to take this data for what it is, significant random variance in some matches, apparent stability in others. For the matches that produce more stable results, that is only because I am using the total score for the 80 games. If you were to look at the individual game strings, you would find lots of variance even though the final scores are nearly the same. So the variance is still there, it is just not expressed in the final result...
Whereas if there is no accidentle fast win, this won't happen ?
Tony
(1) learning is explicitly disabled (position learning is all that could be used since there are no opening books used at all).
(2) no files are carried over between individual games. Each game is played in a completely "sterile" environment.
(3) to further rule that out, I have played these same tests but using a fixed number of nodes for both opponents, and there the games are identical run after run. Even with fixed node counts, position learning would still change scores, and moves, particularly for the side that lost. But with this kind of test, 64 matches produce 64 sets of 80 games that match perfectly, move for move. The only variance is very slight time differences (one move takes 5.21 seconds, in the next match it might take 5.23 or 5.19, but since time is not used to end searches in that test, it made no difference at all.
OTOH
Flipping a coin should give 50%, with expected deviations etc.
Assuming the result of the coinflip is dependent on the force and angle I flip it, fixing this force and angle, would result in a repeatable experiment.
But would that improve the measurement ? There would be a big difference with the nonfixed flips. But how do I know my fixed variables aren't biased ?
Did you find a noticable difference between fixed node matches and nonfixed ? (results or std )
If the results are "equal" but the std is lower, fixed nodes could be an alternative (to the lengthy matches)
Tony
-
- Posts: 1056
- Joined: Fri Mar 10, 2006 6:07 am
- Location: Basque Country (Spain)
Re: more test data
I imagine that you are making this test in the cluster of the university. You have proven to make your test in the laptop and to see if they are similar?
I have not found a rival for my engine that in a test of 80 games I can win of 10 and the following to lose it of 17. When I repeat the test normally I can find a 0-3 points of difference.
Perhaps Crafty is a engine more irregular than the majority in its results.
I have proven single in occasion my engine versus Crafty, with openings books, after 40 games result was even, which surprised me very much although some people commented that the proven version of Crafty was inferior to the previous ones (I believe that it was version 20.1), in the following games Crafty began to win clearly, about its moment I thought that it was to have the learning and because my engine played repeat the same openings much. Perhaps it was not the learning and that it is that Crafty is really more irregular in its results that others engines.
I have not found a rival for my engine that in a test of 80 games I can win of 10 and the following to lose it of 17. When I repeat the test normally I can find a 0-3 points of difference.
Perhaps Crafty is a engine more irregular than the majority in its results.
I have proven single in occasion my engine versus Crafty, with openings books, after 40 games result was even, which surprised me very much although some people commented that the proven version of Crafty was inferior to the previous ones (I believe that it was version 20.1), in the following games Crafty began to win clearly, about its moment I thought that it was to have the learning and because my engine played repeat the same openings much. Perhaps it was not the learning and that it is that Crafty is really more irregular in its results that others engines.