Houdini wrote:I disagree.
Houdini 1.03's 15-20 Elo increase is entirely due to improved evaluation.
Miguel, I'm on your side .
I disagree.
You gained some elo, but mostly thanks to increased speed and its in single core version.
Still Houdini 1.02 (single core) has less than 10elo advantage over Robbo g3 x64 and it has a few minor things from recent Ivanhoes that bring some elo (but under 10 elo in total).
And Robbo g3 was a quite non-optimized compile.
1.03 brings a bit more but that's, I presume, thanks to LMR in root which also brought 10 elo to Ivanhoe 54/55.
Houdini wrote:I disagree.
Houdini 1.03's 15-20 Elo increase is entirely due to improved evaluation.
Miguel, I'm on your side .
I disagree.
You gained some elo, but mostly thanks to increased speed and its in single core version.
Still Houdini 1.02 (single core) has less than 10elo advantage over Robbo g3 x64 and it has a few minor things from recent Ivanhoes that bring some elo (but under 10 elo in total).
And Robbo g3 was a quite non-optimized compile.
1.03 brings a bit more but that's, I presume, thanks to LMR in root which also brought 10 elo to Ivanhoe 54/55.
I was talking about the 15-20 Elo gain of Houdini 1.03 compared to Houdini 1.02, like I said it's entirely coming from changes in the evaluation function.
And contrary to what you presume, no version of Houdini has ever used root node LMR.
The amount of "presumptions" about Houdini on this forum is quite interesting, to say the least...
Daniel Shawul wrote:When you say random eval , do you mean tactics(material eval) + random score for positional terms OR just random score for all... In the later case I can't see how it can reach 1800 elo.. All the engine will do is make random moves throwing away material. It may try to avoid mates when it sees them through search but how can it possibly think intelligently when it is fed garbage at the end.
No. I mean this:
int Evaluate() {
return (random());
}
where random() returns a value between 0 and 100 (0 to 1 pawn).
I've explained how this works, Don Beal first discovered this in the early 90's I think. It does work. Just ask the people that have been complaining about "skill 1" playing too strongly in crafty...
Daniel Shawul wrote:When you say random eval , do you mean tactics(material eval) + random score for positional terms OR just random score for all... In the later case I can't see how it can reach 1800 elo..
Yes, must be material + random score, unless Crafty is very different from Stockfish in this respect. I just tried a blitz game against Stockfish with random eval, and won very easily just by capturing undefended pieces all over the place. Of course it's impossible to judge the strength after just a single game, but I doubt it's even 800. It did play a little better than you would expect from a program with a completely random eval, though.
No. with skill 1, what you get for a score from evaluate is this:
score = 0.01 * real_evaluation + .99 * random();
where random() returns a value between 0 and 100 (0 to 1 pawn).
With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.
1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Tord Romstad wrote:.
No. with skill 1, what you get for a score from evaluate is this:
score = 0.01 * real_evaluation + .99 * random();
where random() returns a value between 0 and 100 (0 to 1 pawn).
With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.
1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Two points: The game Tord cited was 2'+1", so he already followed your advice in advance. So there is some discrepancy between his findings and yours. The only obvious culprit is the 0.01 weight on real eval; it's not much, but maybe it biases things enough in favor of good moves to make the difference between 800 and 1800. Hard to believe, but you should play a version with zero weight for real eval against some weak program with a rating of maybe 1600 to see what happens. Do you have any other explanation for the huge difference between Crafty random and Stockfish random? With eval not an issue and with LMR and such turned off in Crafty, there can hardly be an explanation in the difference between th programs.
I tested it and as I said it played terrible ... not even close to 1000 elo.
Before we argue further, I noticed two things in your implementation.
One is you don't have a skill value of 0, why ?. Eventhough it is very
small one can not completely disregard the effect of the small real
evaluation introduced. For example with skill=1, you can not get more than
3 sigma accuracy. Completely random means _completely_ random, not 0.01 of real evaluation
bla bla ...
Another,when the same position is evaluated twice, it gets a different evaluation
as you call random() again. This breaks TT and eval cache.. Is this part of
the trick I (and probably some others) don't understand ? For my test I used
a (hash_key % 1000) and then changed to (hash_key % 100) after you suggested
that it should be like that , for reasons that is not obvious to me.. Unless ofcourse
you want to add it to real evaluation somehow, which you just did. Scaling shouldn't matter
at all for the original statement you made..that is I repeat completely random eval.
I can understand that there is a bit of sense as to what Marco tried to explain viz a viz
approximate mobility evaluation. I accept that it is some weird way of approximating mobility.
Even at that I am not completley sure. If you take random numbers and get the maximum, you get one of
the Extreme value distributions (say gambel). Now with that, how much sure are you that
you get a larger random value if you take a sample of 20 or 15...
There must be some sensible input to get a sensible result. I accept
(0.01 eval and a bit of mobility eval added by the search) are
possible improvement. The search doesn't amplify elo, if it doesn't add
something good as in the mobility case (disguised at first). Otherwise it is
garbage in garbage out.
Tord Romstad wrote:.
No. with skill 1, what you get for a score from evaluate is this:
score = 0.01 * real_evaluation + .99 * random();
where random() returns a value between 0 and 100 (0 to 1 pawn).
With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.
1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Two points: The game Tord cited was 2'+1", so he already followed your advice in advance. So there is some discrepancy between his findings and yours. The only obvious culprit is the 0.01 weight on real eval; it's not much, but maybe it biases things enough in favor of good moves to make the difference between 800 and 1800. Hard to believe, but you should play a version with zero weight for real eval against some weak program with a rating of maybe 1600 to see what happens. Do you have any other explanation for the huge difference between Crafty random and Stockfish random? With eval not an issue and with LMR and such turned off in Crafty, there can hardly be an explanation in the difference between th programs.
Something that is probably very relevant is if the program is still able to recognize draws and wins, despite the awful eval. If (for instance) a program can see that chasing the opposing king around the mullberry bush with their queen will cause a repeated position resulting in a draw, this will collect quite a lot of draws. And if it can recognize perfectly an 8 full move checkmate, it will gain a lot of wins that way (even having checkmate recognized at all will result in a lot of wins because 0.01 * 30000 = 300 so the path will be seen as a good one. If (on the other hand) the eval function collects no information at all, then I expect the performance to be far, far worse.
So, we have these two questions:
Can the crippled program see a draw?
Can the crippled program see a win?
Houdini wrote:
The amount of "presumptions" about Houdini on this forum is quite interesting, to say the least...
Robert
I made the presumption that since you multiply the final score before displaying it by a variable fraction of 0.5 or less, your motivation was to hide the similarity of the eval to Robbo and Rybka 3. I can't think of another reason to make the score meaningless to the human user; a clear pawn up in the opening is valued at 0.4 or less. Please correct me if you had another motivation for doing this; if I don't hear otherwise I'll assume this was correct.
Tord Romstad wrote:.
No. with skill 1, what you get for a score from evaluate is this:
score = 0.01 * real_evaluation + .99 * random();
where random() returns a value between 0 and 100 (0 to 1 pawn).
With skill 1, the material/positional score is almost nothing, the remainder of the score is a pure random number.
1 0 is not the best test. That restricts the depth enough that random eval fails, but use something like 1+1 or 2+2 and watch what happens. Suddenly it won't hang material, and plays decent chess...
Two points: The game Tord cited was 2'+1", so he already followed your advice in advance. So there is some discrepancy between his findings and yours. The only obvious culprit is the 0.01 weight on real eval; it's not much, but maybe it biases things enough in favor of good moves to make the difference between 800 and 1800. Hard to believe, but you should play a version with zero weight for real eval against some weak program with a rating of maybe 1600 to see what happens. Do you have any other explanation for the huge difference between Crafty random and Stockfish random? With eval not an issue and with LMR and such turned off in Crafty, there can hardly be an explanation in the difference between th programs.
Somewhere in there I saw a 1+0. That was what I was basing my opinion on. Don't recall whether it was in the PGN or what, will try to look back to see.
I just ran a test with pure random eval between 0 and 100 (0 and 1 pawn). No difference at all.
When you think about it, if you play NxN and the program should recapture, you get two choices:
-300 + random() which will produce a score of -3 + random() at skill level 1. Or you can get 0 + random() when you recapture rather than leave the piece hanging. You really think that +3 matters? In any case, I ran with pure random and the thing plays pretty well. Not well enough that it will beat me, but it doesn't hang material (maybe an occasional (rarely) pawn, but that's about it. It never seems to miss a recapture, or hung material, most likely because of the random (Beal) effect.
I don't see any difference to explain. My results are not based on my testing. We had a thread here a few weeks back about someone complaining that way back, skill 1 was around 800, but it had slowly moved up to almost +1800. I looked to see if I had broken the code, and anyone can look at the 23.2 source code to see what skill 1 will do. I simply verified this by hand. I don't have an easy way to test down on that end of the scale, the lowest-rated opponent I have on my cluster is glaurung 1.something, and it is in the 2500 range. Testing something you want to be under 1,000 simjply can't be done with that as the worst opponent.
Believe me, this is real, not imagined. I tried it myself when it first came up. I had _never_ tested with skill level 1 before, and had assumed that this was going to play beyond ugly. Amazingly it played reasonable-looking chess for the most part.
I now have a solution, but I still have the problem of testing on the low end. I now knock the NPS down so that by the time I get to skill 1, the thing is running about 1K nodes per second, which is on down there. Only question is what is the Elo? We are about ready to release this and let those guys test it and give me feedback...
I did go back and look and might have well looked at the 1 0 result and thought that went with the time control entry on the next line... My laptop has a small font and my eyes are 62 years old now.