testing, again. Glaurung 2 change

bob · Post by **bob** » Sun Aug 24, 2008 12:09 am

Tord:

A while back you mentioned that I should move from the older 2.0 epsilon whatever to the most recent. I didn't change at the time because I didn't want to alter a constant opponent that was represented in a lot of old data.

With the new testing approach, I am in the progress os now re-evaluating the opponents, and perhaps adding a few more opponents (to do a few less games per opponent to keep things close computationally).

One oddity I found is this:

Code: Select all

crafty-22.2R5
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115    9    9  3894   70%   -34   20% 
   2 Fruit 2.1               68    9    9  3894   64%   -34   24% 
   3 opponent-21.7           20    8    8  3894   58%   -34   34% 
   4 Glaurung 1.1 SMP        14    9    9  3894   57%   -34   20% 
   5 Crafty-22.2            -34    5    5 19470   44%     7   23% 
   6 Arasan 10.0           -184    9    9  3894   30%   -34   20% 

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2.1        95   11   11  2271   65%   -22   18% 
   2 Fruit 2.1           52   11   11  2267   60%   -22   24% 
   3 Glaurung 1.1 SMP    27   11   11  2263   57%   -22   21% 
   4 opponent-21.7       16   11   10  2269   56%   -22   35% 
   5 Crafty-22.2        -22    5    6 11344   46%     4   24% 
   6 Arasan 10.0       -169   11   11  2274   30%   -22   20%

While the current test has not completed, I did run one complete run but threw it out because I accidentally replaced the wrong glaurung with the newest. But the thing I noticed is that at least for Crafty, the new glaurung is not doing quite as well as the previous version (old was 70% vs crafty, new is 65%). I will post the complete run when it finishes, but I thought it interesting. Whether it suggests that some change was not so good, or just not so good against Crafty I am not sure.

I could, if you are interested, run a round-robin so that everybody plays everybody a ton of games to see how the old and new compare??

It is always possible that the results will change when all games are played, so I will post a follow-up when it finishes, probably late tonight.

Cluster is currently at 1/2 speed with half the nodes powered down until the A/C problem is resolved.

Uri Blass · Post by **Uri Blass** » Sun Aug 24, 2008 2:14 am

bob wrote:Tord:

A while back you mentioned that I should move from the older 2.0 epsilon whatever to the most recent. I didn't change at the time because I didn't want to alter a constant opponent that was represented in a lot of old data.

With the new testing approach, I am in the progress os now re-evaluating the opponents, and perhaps adding a few more opponents (to do a few less games per opponent to keep things close computationally).

One oddity I found is this:
Code: Select all
crafty-22.2R5
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115    9    9  3894   70%   -34   20% 
   2 Fruit 2.1               68    9    9  3894   64%   -34   24% 
   3 opponent-21.7           20    8    8  3894   58%   -34   34% 
   4 Glaurung 1.1 SMP        14    9    9  3894   57%   -34   20% 
   5 Crafty-22.2            -34    5    5 19470   44%     7   23% 
   6 Arasan 10.0           -184    9    9  3894   30%   -34   20% 

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2.1        95   11   11  2271   65%   -22   18% 
   2 Fruit 2.1           52   11   11  2267   60%   -22   24% 
   3 Glaurung 1.1 SMP    27   11   11  2263   57%   -22   21% 
   4 opponent-21.7       16   11   10  2269   56%   -22   35% 
   5 Crafty-22.2        -22    5    6 11344   46%     4   24% 
   6 Arasan 10.0       -169   11   11  2274   30%   -22   20% 
While the current test has not completed, I did run one complete run but threw it out because I accidentally replaced the wrong glaurung with the newest. But the thing I noticed is that at least for Crafty, the new glaurung is not doing quite as well as the previous version (old was 70% vs crafty, new is 65%). I will post the complete run when it finishes, but I thought it interesting. Whether it suggests that some change was not so good, or just not so good against Crafty I am not sure.

I could, if you are interested, run a round-robin so that everybody plays everybody a ton of games to see how the old and new compare??

It is always possible that the results will change when all games are played, so I will post a follow-up when it finishes, probably late tonight.

Cluster is currently at 1/2 speed with half the nodes powered down until the A/C problem is resolved.

I suspect that something is wrong with the results not only because of glaurung.

Fruit2.1's rating went down by 16 elo when Arasan10.0 went up by 15 elo.
It seems that results tend to get closer to 50% and if in the middle of the test white always win on time or something like that happened it can explain results that are closer to 50%(no idea if something like that happened and there are different explanations).

Hopefully you saved the pgn of both runs that includes the time of the games so people can find out if something was wrong in specific time interval.

Uri

BubbaTough · Post by **BubbaTough** » Sun Aug 24, 2008 2:19 am

the new glaurung is not doing quite as well as the previous version

I have noted that in some situations Glaurung 2-epsilon/5 will lose on time...particularly when it is going to checkmate someone. The first thing I would do is check if there are any loses on time in your test runs (if such a thing is easy for you to do).

-Sam

bob · Post by **bob** » Sun Aug 24, 2008 2:23 am

Uri Blass wrote:
bob wrote:Tord:

A while back you mentioned that I should move from the older 2.0 epsilon whatever to the most recent. I didn't change at the time because I didn't want to alter a constant opponent that was represented in a lot of old data.

With the new testing approach, I am in the progress os now re-evaluating the opponents, and perhaps adding a few more opponents (to do a few less games per opponent to keep things close computationally).

One oddity I found is this:
Code: Select all
crafty-22.2R5
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115    9    9  3894   70%   -34   20% 
   2 Fruit 2.1               68    9    9  3894   64%   -34   24% 
   3 opponent-21.7           20    8    8  3894   58%   -34   34% 
   4 Glaurung 1.1 SMP        14    9    9  3894   57%   -34   20% 
   5 Crafty-22.2            -34    5    5 19470   44%     7   23% 
   6 Arasan 10.0           -184    9    9  3894   30%   -34   20% 

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2.1        95   11   11  2271   65%   -22   18% 
   2 Fruit 2.1           52   11   11  2267   60%   -22   24% 
   3 Glaurung 1.1 SMP    27   11   11  2263   57%   -22   21% 
   4 opponent-21.7       16   11   10  2269   56%   -22   35% 
   5 Crafty-22.2        -22    5    6 11344   46%     4   24% 
   6 Arasan 10.0       -169   11   11  2274   30%   -22   20% 
While the current test has not completed, I did run one complete run but threw it out because I accidentally replaced the wrong glaurung with the newest. But the thing I noticed is that at least for Crafty, the new glaurung is not doing quite as well as the previous version (old was 70% vs crafty, new is 65%). I will post the complete run when it finishes, but I thought it interesting. Whether it suggests that some change was not so good, or just not so good against Crafty I am not sure.

I could, if you are interested, run a round-robin so that everybody plays everybody a ton of games to see how the old and new compare??

It is always possible that the results will change when all games are played, so I will post a follow-up when it finishes, probably late tonight.

Cluster is currently at 1/2 speed with half the nodes powered down until the A/C problem is resolved.
I suspect that something is wrong with the results not only because of glaurung.

Fruit2.1's rating went down by 16 elo when Arasan10.0 went up by 15 elo.
It seems that results tend to get closer to 50% and if in the middle of the test white always win on time or something like that happened it can explain results that are closer to 50%(no idea if something like that happened and there are different explanations).

Hopefully you saved the pgn of both runs that includes the time of the games so people can find out if something was wrong in specific time interval.

Uri

1. There are _zero_ time losses in this match so far. Or in the previous match I am comparing to.

2. I did say it was incomplete.

3. There is no time dependency. Already been there. I ran the test 12 times so far and the results have been consistent. It might be that as the match continues, things will even back out since there are 4000 different positions, and they are played in sequence.

I will run it a second time (and yes, the PGN has been saved) to see if anything changes, which I doubt, at least enough to see this. In the first run, G2.1 finished behind the epsilon version which is what caught my eye and also pointed out the fact I had replaced the wrong one.

bob · Post by **bob** » Sun Aug 24, 2008 2:27 am

BubbaTough wrote:
the new glaurung is not doing quite as well as the previous version
I have noted that in some situations Glaurung 2-epsilon/5 will lose on time...particularly when it is going to checkmate someone. The first thing I would do is check if there are any loses on time in your test runs (if such a thing is easy for you to do).

-Sam

It is easy to do as the referee records that in the PGN. E5 has not been losing on time in any of my previous matches unless I go to very short time controls where Fruit starts to "lose it" as well. But in the current time control, these matches have zero time forfeits...

BubbaTough · Post by **BubbaTough** » Sun Aug 24, 2008 2:43 am

But in the current time control, these matches have zero time forfeits...

sigh...the advantage of having enough hardware to run reasonable length test games I guess. Just another thing to be jealous of.

-Sam

bob · Post by **bob** » Sun Aug 24, 2008 2:58 am

BubbaTough wrote:
But in the current time control, these matches have zero time forfeits...
sigh...the advantage of having enough hardware to run reasonable length test games I guess. Just another thing to be jealous of.

-Sam

These are not all that long. 1+1 is 1 minute each plus 1 sec per move increment. So games are something like 4-5 minutes long, sometimes much less. But when playing 20,000 games, considering that there are 1440 minutes in a day, a big cluster helps.

bob · Post by **bob** » Sun Aug 24, 2008 4:03 am

I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.

bob · Post by **bob** » Sun Aug 24, 2008 7:59 am

I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.

I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.

Dirt · Post by **Dirt** » Sun Aug 24, 2008 11:29 am

bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.

I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.

What difference is there in the polyglot.ini files that would cause a substantial change?

testing, again. Glaurung 2 change

testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

Re: testing, again. Glaurung 2 change

error in testing...

Re: error in testing...