hgm wrote:Well, that more positions is better, is to be expected. I don't know how many Nunn positions you used (was this the Nunn-10 or Nun-25 set?), but the SD in going from 40 to 50 samples is of course only expected to drop by 10%, and it is questionable if that would be noticable. Between 40 and 10 it would be a factor 2.
I used a test with 20 positions. Whether it started with 25 and I culled them or what, I don't know. I think Albert published a revised set a year or so ago with 50 positions rather than 40. And I will probably switch to them when a good point comes up, I haven't yet so that all of my comparisons and test results are directly comparable...
It is indeed annoying that the positions are given as FEN, but as these are common openings it should of course be possible to find the move sequence that leads to them.
If any program is expected to play both sides reasonably, how do you explain the extreme deviations that occur in your data? Some individual positions don't seem to reflect the relative capabilities at all.
If you look at the positions, and look at suggested book lines (Crafty can give the set of book moves played in each position) you will notice that none are simply "won by the side on move". But many are equal positions where things are happening on opposite sides. One side is supposed to attack on the queenside, the other side is supposed to attack on the kingside. Some programs, mine included, don't get this right every time. And some are very passive (gnuchess comes to mind) and it will lose both ends, it never attacks on the kingside, and it doesn't understand the requirement for counterplay on the queenside.
So there are plenty of positions that two good humans would remark "the position is pretty equal, but it is unbalanced." Then it is up to the program to do something with it. I see two distinct cases regularly. (1) Crafty knows more about something than its opponent, which gives it an edge in those kinds of positions. Pawn majorities come to mind or the classic trapped bishop problem at a2/h2/a7/h7. So it outplays its opponent based on that knowledge; (2) Crafty doesn't know a lot about a position, but the opponent has no clue at all and struggles to shuffle around and Crafty slowly builds up enough pressure somewhere to break through.
Of course both of those work against Crafty when playing good opponents just as easily. But winning all or losing all doesn't necessarily mean the position is bad or one-sided, it could just mean the programs are one-sided in that particular position, and one of them has plenty of room for improvement.
I am still curious if small improvements that you observe are spread evenly over the positions, or concentrated in just a few.
I don't have the data at present, but my next test I will save the 80-character result strings to see what happens. However, I suspect that we are going to see just the normal variability in the matches, with an occasional win here or there extra, to account for the improved result margin. It will be easier to show once I can run again. We are still dead here, waiting on the maint. bozos to get the new A/C compressor installed, and for our sysadmin to get some new network cards for the ibrix filesystem that has been pretty unreliable.
I'll post the next batch, but it might be a few days based on what I have seen so far...