Questions for the Stockfish team

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Questions for the Stockfish team

Post by Milos »

Houdini wrote:I disagree.
Houdini 1.03's 15-20 Elo increase is entirely due to improved evaluation.
Miguel, I'm on your side :).
I disagree.
You gained some elo, but mostly thanks to increased speed and its in single core version.
Still Houdini 1.02 (single core) has less than 10elo advantage over Robbo g3 x64 and it has a few minor things from recent Ivanhoes that bring some elo (but under 10 elo in total).
And Robbo g3 was a quite non-optimized compile.
1.03 brings a bit more but that's, I presume, thanks to LMR in root which also brought 10 elo to Ivanhoe 54/55.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Questions for the Stockfish team

Post by Dann Corbit »

User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Questions for the Stockfish team

Post by Houdini »

Milos wrote:
Houdini wrote:I disagree.
Houdini 1.03's 15-20 Elo increase is entirely due to improved evaluation.
Miguel, I'm on your side :).
I disagree.
You gained some elo, but mostly thanks to increased speed and its in single core version.
Still Houdini 1.02 (single core) has less than 10elo advantage over Robbo g3 x64 and it has a few minor things from recent Ivanhoes that bring some elo (but under 10 elo in total).
And Robbo g3 was a quite non-optimized compile.
1.03 brings a bit more but that's, I presume, thanks to LMR in root which also brought 10 elo to Ivanhoe 54/55.
I was talking about the 15-20 Elo gain of Houdini 1.03 compared to Houdini 1.02, like I said it's entirely coming from changes in the evaluation function.
And contrary to what you presume, no version of Houdini has ever used root node LMR.

The amount of "presumptions" about Houdini on this forum is quite interesting, to say the least...

Robert
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions for the Stockfish team

Post by bob »

Daniel Shawul wrote:When you say random eval , do you mean tactics(material eval) + random score for positional terms OR just random score for all... In the later case I can't see how it can reach 1800 elo.. All the engine will do is make random moves throwing away material. It may try to avoid mates when it sees them through search but how can it possibly think intelligently when it is fed garbage at the end.
No. I mean this:

int Evaluate() {

return (random());

}

where random() returns a value between 0 and 100 (0 to 1 pawn).

I've explained how this works, Don Beal first discovered this in the early 90's I think. It does work. Just ask the people that have been complaining about "skill 1" playing too strongly in crafty...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions for the Stockfish team

Post by bob »

Tord Romstad wrote:
Daniel Shawul wrote:When you say random eval , do you mean tactics(material eval) + random score for positional terms OR just random score for all... In the later case I can't see how it can reach 1800 elo..
Yes, must be material + random score, unless Crafty is very different from Stockfish in this respect. I just tried a blitz game against Stockfish with random eval, and won very easily just by capturing undefended pieces all over the place. Of course it's impossible to judge the strength after just a single game, but I doubt it's even 800. It did play a little better than you would expect from a program with a completely random eval, though.

Code: Select all

[Event "Test game"]
[Site "Oslo"]
[Date "2010.07.20"]
[Round "-"]
[White "tord"]
[Black "Stockfish 100720 64bit"]
[Result "1-0"]
[TimeControl "2+1"]

1. e4 c6 2. d4 d5 3. f3 dxe4 4. fxe4 Qa5+ 5. c3 Nf6 6. Bd3 e6 7. Nf3 Nbd7
8. e5 Qd5 9. exf6 a6 10. fxg7 Bxg7 11. O-O a5 12. Be3 Rg8 13. c4 Qh5 14.
Nc3 Be5 15. dxe5 Rg4 16. Ne4 h6 17. Nd6+ Kd8 18. Bd4 Rxg2+ 19. Kxg2 c5 20.
Bf2 Ra6 21. Bg3 Rxd6 22. exd6 b6 23. Qa4 Bb7 24. Rae1 Kc8 25. Be4 Kd8 26.
Bxb7 Ke8 27. Ne5 b5
{Black resigns} 1-0
No. with skill 1, what you get for a score from evaluate is this:

score = 0.01 * real_evaluation + .99 * random();

where random() returns a value between 0 and 100 (0 to 1 pawn).

With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.

1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Questions for the Stockfish team

Post by lkaufman »

bob wrote:
Tord Romstad wrote:.
No. with skill 1, what you get for a score from evaluate is this:

score = 0.01 * real_evaluation + .99 * random();

where random() returns a value between 0 and 100 (0 to 1 pawn).

With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.

1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Two points: The game Tord cited was 2'+1", so he already followed your advice in advance. So there is some discrepancy between his findings and yours. The only obvious culprit is the 0.01 weight on real eval; it's not much, but maybe it biases things enough in favor of good moves to make the difference between 800 and 1800. Hard to believe, but you should play a version with zero weight for real eval against some weak program with a rating of maybe 1600 to see what happens. Do you have any other explanation for the huge difference between Crafty random and Stockfish random? With eval not an issue and with LMR and such turned off in Crafty, there can hardly be an explanation in the difference between th programs.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Questions for the Stockfish team

Post by Daniel Shawul »

I tested it and as I said it played terrible ... not even close to 1000 elo.
Before we argue further, I noticed two things in your implementation.

One is you don't have a skill value of 0, why ?. Eventhough it is very
small one can not completely disregard the effect of the small real
evaluation introduced. For example with skill=1, you can not get more than
3 sigma accuracy. Completely random means _completely_ random, not 0.01 of real evaluation
bla bla ...

Another,when the same position is evaluated twice, it gets a different evaluation
as you call random() again. This breaks TT and eval cache.. Is this part of
the trick I (and probably some others) don't understand ? For my test I used
a (hash_key % 1000) and then changed to (hash_key % 100) after you suggested
that it should be like that , for reasons that is not obvious to me.. Unless ofcourse
you want to add it to real evaluation somehow, which you just did. Scaling shouldn't matter
at all for the original statement you made..that is I repeat completely random eval.

I can understand that there is a bit of sense as to what Marco tried to explain viz a viz
approximate mobility evaluation. I accept that it is some weird way of approximating mobility.
Even at that I am not completley sure. If you take random numbers and get the maximum, you get one of
the Extreme value distributions (say gambel). Now with that, how much sure are you that
you get a larger random value if you take a sample of 20 or 15...

There must be some sensible input to get a sensible result. I accept
(0.01 eval and a bit of mobility eval added by the search) are
possible improvement. The search doesn't amplify elo, if it doesn't add
something good as in the mobility case (disguised at first). Otherwise it is
garbage in garbage out.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Questions for the Stockfish team

Post by Dann Corbit »

lkaufman wrote:
bob wrote:
Tord Romstad wrote:.
No. with skill 1, what you get for a score from evaluate is this:

score = 0.01 * real_evaluation + .99 * random();

where random() returns a value between 0 and 100 (0 to 1 pawn).

With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.

1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Two points: The game Tord cited was 2'+1", so he already followed your advice in advance. So there is some discrepancy between his findings and yours. The only obvious culprit is the 0.01 weight on real eval; it's not much, but maybe it biases things enough in favor of good moves to make the difference between 800 and 1800. Hard to believe, but you should play a version with zero weight for real eval against some weak program with a rating of maybe 1600 to see what happens. Do you have any other explanation for the huge difference between Crafty random and Stockfish random? With eval not an issue and with LMR and such turned off in Crafty, there can hardly be an explanation in the difference between th programs.
Something that is probably very relevant is if the program is still able to recognize draws and wins, despite the awful eval. If (for instance) a program can see that chasing the opposing king around the mullberry bush with their queen will cause a repeated position resulting in a draw, this will collect quite a lot of draws. And if it can recognize perfectly an 8 full move checkmate, it will gain a lot of wins that way (even having checkmate recognized at all will result in a lot of wins because 0.01 * 30000 = 300 so the path will be seen as a good one. If (on the other hand) the eval function collects no information at all, then I expect the performance to be far, far worse.

So, we have these two questions:
Can the crippled program see a draw?
Can the crippled program see a win?
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Questions for the Stockfish team

Post by lkaufman »

Houdini wrote:
The amount of "presumptions" about Houdini on this forum is quite interesting, to say the least...

Robert
I made the presumption that since you multiply the final score before displaying it by a variable fraction of 0.5 or less, your motivation was to hide the similarity of the eval to Robbo and Rybka 3. I can't think of another reason to make the score meaningless to the human user; a clear pawn up in the opening is valued at 0.4 or less. Please correct me if you had another motivation for doing this; if I don't hear otherwise I'll assume this was correct.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions for the Stockfish team

Post by bob »

lkaufman wrote:
bob wrote:
Tord Romstad wrote:.
No. with skill 1, what you get for a score from evaluate is this:

score = 0.01 * real_evaluation + .99 * random();

where random() returns a value between 0 and 100 (0 to 1 pawn).

With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.

1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Two points: The game Tord cited was 2'+1", so he already followed your advice in advance. So there is some discrepancy between his findings and yours. The only obvious culprit is the 0.01 weight on real eval; it's not much, but maybe it biases things enough in favor of good moves to make the difference between 800 and 1800. Hard to believe, but you should play a version with zero weight for real eval against some weak program with a rating of maybe 1600 to see what happens. Do you have any other explanation for the huge difference between Crafty random and Stockfish random? With eval not an issue and with LMR and such turned off in Crafty, there can hardly be an explanation in the difference between th programs.
Somewhere in there I saw a 1+0. That was what I was basing my opinion on. Don't recall whether it was in the PGN or what, will try to look back to see.

I just ran a test with pure random eval between 0 and 100 (0 and 1 pawn). No difference at all.

When you think about it, if you play NxN and the program should recapture, you get two choices:

-300 + random() which will produce a score of -3 + random() at skill level 1. Or you can get 0 + random() when you recapture rather than leave the piece hanging. You really think that +3 matters? In any case, I ran with pure random and the thing plays pretty well. Not well enough that it will beat me, but it doesn't hang material (maybe an occasional (rarely) pawn, but that's about it. It never seems to miss a recapture, or hung material, most likely because of the random (Beal) effect.

I don't see any difference to explain. My results are not based on my testing. We had a thread here a few weeks back about someone complaining that way back, skill 1 was around 800, but it had slowly moved up to almost +1800. I looked to see if I had broken the code, and anyone can look at the 23.2 source code to see what skill 1 will do. I simply verified this by hand. I don't have an easy way to test down on that end of the scale, the lowest-rated opponent I have on my cluster is glaurung 1.something, and it is in the 2500 range. Testing something you want to be under 1,000 simjply can't be done with that as the worst opponent.

Believe me, this is real, not imagined. I tried it myself when it first came up. I had _never_ tested with skill level 1 before, and had assumed that this was going to play beyond ugly. Amazingly it played reasonable-looking chess for the most part.

I now have a solution, but I still have the problem of testing on the low end. I now knock the NPS down so that by the time I get to skill 1, the thing is running about 1K nodes per second, which is on down there. Only question is what is the Elo? We are about ready to release this and let those guys test it and give me feedback...

I did go back and look and might have well looked at the 1 0 result and thought that went with the time control entry on the next line... My laptop has a small font and my eyes are 62 years old now. :)